GPT-5 is hopefully all you need.

AI News for 8/6/2025-8/7/2025. We checked 12 subreddits, 544 Twitters and 29 Discords (227 channels, and 16553 messages) for you. Estimated reading time saved (at 200wpm): 1183 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

While the livestream was somewhat disappointing (except for the highly entertaining chart crimes), and the benchmarks were incremental improvements over the SOTA offerings from OpenAI, the pricing wow'ed us, as OpenAI took back the Pareto Frontier of Intelligence from GDM:

With OpenAI now having at least a 4 Sonnet tier model, passing developer vibe checks, it is solidly "back" in the coding model game, although it remains to be seen what the long term impact will be.

We recommend looking through the hands-on early beta report and thinking through what was revealed from the model card description of GPT-5's architecture.

Here is GPT-5's launch, according to GPT-5:

OpenAI’s GPT‑5 Launch: unified router, aggressive pricing, broad rollout

What shipped: GPT‑5 is a “unified system” with a fast “main” model and a deeper “thinking” model behind a real‑time router that decides when to reason, call tools, or stay terse. In ChatGPT, there’s no model picker by default; Plus users can select GPT‑5 vs GPT‑5 Thinking; Pro gets more variants. API exposes gpt‑5, gpt‑5‑mini, gpt‑5‑nano and a “reasoning effort” control (minimal/low/medium/high). Context: up to 400K (128K max output). Knowledge cutoff reported as 2024‑10‑01 for main; minis reported as 2024‑05‑31. Rollout is staged to Free/Plus/Pro/Team (Enterprise/Edu next week). Announcements: @OpenAI, @sama, system card summary.
Prices and cache: Tweets cite gpt‑5 at $1.25/M input, $10/M output with cache discounts (“flex” references as low as $0.625/$5) and gpt‑5‑mini at $0.25/$2; gpt‑5‑nano at $0.05/$0.4. Multiple OpenAI leads emphasized the cost downshift and cache economics (@scaling01, @sama, @jeffintime).
Product integrations (day‑0):
- Chat/coding: Codex CLI makes GPT‑5 the default with usage included via ChatGPT plan; new terminal UI and rate‑limits by plan (@OpenAIDevs, @embirico). Cursor set GPT‑5 as the default coding model, temporarily free (@cursor_ai). JetBrains AI Assistant and Junie agent support GPT‑5 (@jetbrains). Microsoft Copilot “Smart Mode” routes to GPT‑5 (@mustafasuleyman). Notion AI now offers GPT‑5 (@NotionHQ). Perplexity added GPT‑5 for Pro/Max (@perplexity_ai).
- Agent scaffolds: Cline reports GPT‑5 is disciplined, parallelizes tool calls, and “plans verbose, executes terse” (@cline); Factory made GPT‑5 its default for “Droids” (@FactoryAI). OpenAI published a GPT‑5 prompting/cookbook bundle (@OpenAIDevs).

Benchmarks, evals, and the “chart crimes”

Arenas and coding: GPT‑5 tops LMSYS Text/WebDev/Vision Arenas (tested as “summit”), with a notably large WebDev margin (@lmarena_ai). OpenAI claims 74.9% on SWE‑bench Verified; several researchers immediately flagged a mislabeled axis and that OpenAI ran on a 477‑task subset; corrected charts put GPT‑5 roughly on par with Claude 4.1 Sonnet/Opus (74–75%) on the verified set (@nrehiew_, @OfirPress, @Sauers_).
Long‑context and hallucinations: GPT‑5 leads Artificial Analysis’ long‑context reasoning (AA‑LCR) occupying #1 and #2; big headline improvement over o3‑high on long‑context tasks (@ArtificialAnlys). Multiple claims of much lower hallucination and introduction of “safe completions” (refusal that maximizes utility within safety constraints) (@scaling01, @sama). METR’s autonomy eval finds GPT‑5 unlikely to pose catastrophic risk under current threat models, while cautioning about increased eval‑awareness/manipulation risk as capabilities rise (@METR_Evals).
Reasoning/agents: GPT‑5 shows strong instruction following and tool use (e.g., TauBench gains, IFBench instruction‑following), but mixed changes on non‑SWE coding evals and OpenAI PR‑reproduction (@omarsar0, @eli_lifland, @scaling01).
ARC‑AGI and safety‑deception: GPT‑5 hits 65.7% on ARC‑AGI‑1 but 9.9% on ARC‑AGI‑2; Grok‑4 leads ARC‑AGI‑2 at 15.9% (@fchollet, @scaling01). GPT‑5 shows lower deceptive behavior than o3 in OpenAI’s internal measures (methodology matters; third‑party replication pending) (@scaling01).
Note on comms: OpenAI’s event drew widespread criticism for multiple “chart crimes” (axis/scale errors) in slides—blog version was fixed later (@jeremyphoward, @iScienceLuvr).

Agentic coding reality check: strong tooling, fewer vibes

Hands‑on reports: Early users highlight GPT‑5’s “autistic” instruction following, fewer yaps, parallel tool calls, and long‑horizon persistence—e.g., multi‑file edits and reliable diffs (Codex CLI, Cline, Cursor). Several posts show one‑shot interactive apps/dashboards/games with minimal prompting (@skirano, @benhylak, @pashmerepat). Cursor calls GPT‑5 “the smartest coding model we've tried” and made it default, free initially (@cursor_ai).
Routers are the product: The deprecation of the in‑app model picker signals a bet on real‑time routing (thinking/tool use) as UX default; this shifts dev control from “which model?” to “what constraints/policy/verbosity/effort?” (@sama, @dariusemrani).
Independent evals: Deep‑research runs find GPT‑5 roughly comparable to Claude 4 Sonnet on long‑horizon research tasks (small sample), suggesting gains may be use‑case/stack dependent rather than across‑the‑board (@hwchase17).

AI Twitter Recap

OpenAI’s GPT‑5 Launch: unified router, aggressive pricing, broad rollout

What shipped: GPT‑5 is a “unified system” with a fast “main” model and a deeper “thinking” model behind a real‑time router that decides when to reason, call tools, or stay terse. In ChatGPT, there’s no model picker by default; Plus users can select GPT‑5 vs GPT‑5 Thinking; Pro gets more variants. API exposes gpt‑5, gpt‑5‑mini, gpt‑5‑nano and a “reasoning effort” control (minimal/low/medium/high). Context: up to 400K (128K max output). Knowledge cutoff reported as 2024‑10‑01 for main; minis reported as 2024‑05‑31. Rollout is staged to Free/Plus/Pro/Team (Enterprise/Edu next week). Announcements: @OpenAI, @sama, system card summary.
Prices and cache: Tweets cite gpt‑5 at $1.25/M input, $10/M output with cache discounts (“flex” references as low as $0.625/$5) and gpt‑5‑mini at $0.25/$2; gpt‑5‑nano at $0.05/$0.4. Multiple OpenAI leads emphasized the cost downshift and cache economics (@scaling01, @sama, @jeffintime).
Product integrations (day‑0):
- Chat/coding: Codex CLI makes GPT‑5 the default with usage included via ChatGPT plan; new terminal UI and rate‑limits by plan (@OpenAIDevs, @embirico). Cursor set GPT‑5 as the default coding model, temporarily free (@cursor_ai). JetBrains AI Assistant and Junie agent support GPT‑5 (@jetbrains). Microsoft Copilot “Smart Mode” routes to GPT‑5 (@mustafasuleyman). Notion AI now offers GPT‑5 (@NotionHQ). Perplexity added GPT‑5 for Pro/Max (@perplexity_ai).
- Agent scaffolds: Cline reports GPT‑5 is disciplined, parallelizes tool calls, and “plans verbose, executes terse” (@cline); Factory made GPT‑5 its default for “Droids” (@FactoryAI). OpenAI published a GPT‑5 prompting/cookbook bundle (@OpenAIDevs).

Benchmarks, evals, and the “chart crimes”

Arenas and coding: GPT‑5 tops LMSYS Text/WebDev/Vision Arenas (tested as “summit”), with a notably large WebDev margin (@lmarena_ai). OpenAI claims 74.9% on SWE‑bench Verified; several researchers immediately flagged a mislabeled axis and that OpenAI ran on a 477‑task subset; corrected charts put GPT‑5 roughly on par with Claude 4.1 Sonnet/Opus (74–75%) on the verified set (@nrehiew_, @OfirPress, @Sauers_).
Long‑context and hallucinations: GPT‑5 leads Artificial Analysis’ long‑context reasoning (AA‑LCR) occupying #1 and #2; big headline improvement over o3‑high on long‑context tasks (@ArtificialAnlys). Multiple claims of much lower hallucination and introduction of “safe completions” (refusal that maximizes utility within safety constraints) (@scaling01, @sama). METR’s autonomy eval finds GPT‑5 unlikely to pose catastrophic risk under current threat models, while cautioning about increased eval‑awareness/manipulation risk as capabilities rise (@METR_Evals).
Reasoning/agents: GPT‑5 shows strong instruction following and tool use (e.g., TauBench gains, IFBench instruction‑following), but mixed changes on non‑SWE coding evals and OpenAI PR‑reproduction (@omarsar0, @eli_lifland, @scaling01).
ARC‑AGI and safety‑deception: GPT‑5 hits 65.7% on ARC‑AGI‑1 but 9.9% on ARC‑AGI‑2; Grok‑4 leads ARC‑AGI‑2 at 15.9% (@fchollet, @scaling01). GPT‑5 shows lower deceptive behavior than o3 in OpenAI’s internal measures (methodology matters; third‑party replication pending) (@scaling01).
Note on comms: OpenAI’s event drew widespread criticism for multiple “chart crimes” (axis/scale errors) in slides—blog version was fixed later (@jeremyphoward, @iScienceLuvr).

Agentic coding reality check: strong tooling, fewer vibes

Hands‑on reports: Early users highlight GPT‑5’s “autistic” instruction following, fewer yaps, parallel tool calls, and long‑horizon persistence—e.g., multi‑file edits and reliable diffs (Codex CLI, Cline, Cursor). Several posts show one‑shot interactive apps/dashboards/games with minimal prompting (@skirano, @benhylak, @pashmerepat). Cursor calls GPT‑5 “the smartest coding model we've tried” and made it default, free initially (@cursor_ai).
Routers are the product: The deprecation of the in‑app model picker signals a bet on real‑time routing (thinking/tool use) as UX default; this shifts dev control from “which model?” to “what constraints/policy/verbosity/effort?” (@sama, @dariusemrani).
Independent evals: Deep‑research runs find GPT‑5 roughly comparable to Claude 4 Sonnet on long‑horizon research tasks (small sample), suggesting gains may be use‑case/stack dependent rather than across‑the‑board (@hwchase17).

OpenAI's GPT-5 Launch and Reception

The Announcement: OpenAI officially announced the launch of GPT-5, with CEO @sama teasing a livestream that would be "longer than usual" with "a lot to show." The new model is described as a unified system that automatically switches between quick answers and deeper reasoning, rolling out to all users, including the free tier. OpenAI's Head of Product @kevinweil stated, "It's the best thing we've ever built." The release deprecates previous models, with the goal of simplifying the user experience by removing the model switcher. An AMA with the team is scheduled for the following day.
Technical Details and Pricing: GPT-5 is a family of models, not a single monolithic entity. It includes gpt-5, gpt-5-mini, and gpt-5-nano, and also separate "thinking" models, leading @teortaxesTex to call it a "unified system" that is "literally just SEPARATE CoT + non-CoT models + a router." API pricing was a major point of discussion, with @scaling01 noting its competitiveness: the main model is priced at $1.25/$10 per million tokens, with mini at $0.25/$2 and nano at $0.05/$0.4, all with a 400k context window and a knowledge cutoff of October 1st, 2024. This makes it cheaper than Sonnet and better than Opus. @jerryjliu0 observed that for document understanding, GPT-5 seems to use 4-5x more tokens than GPT-4.1, potentially increasing its effective cost for vision tasks.
Performance and Benchmarks: Initial benchmarks show significant improvements in some areas but stagnation in others. @scaling01 highlighted "ridiculous improvements on long-context tasks" and near-elimination of hallucinations. GPT-5 also became the new SOTA on the LMArena leaderboard. However, @fchollet reported that on ARC-AGI, GPT-5 scored 65.7% on AGI-1 and a modest 9.9% on AGI-2. Further analysis from @scaling01 showed only a 3% improvement over o3 in reproducing scientific papers and no significant improvement on benchmarks like OpenAI Pull Requests and SWE-Lancer IC.
The "Chart Crime" Controversy: A significant part of the community discussion centered on misleading charts in the launch presentation. A chart on SWE-Bench performance was widely criticized for having a non-monotonic Y-axis, where 52.8% was plotted higher than 69.1%. This was first pointed out by @Teknium1 and amplified by many, including @jeremyphoward. @iScienceLuvr quipped, "If GPT-5 made this chart I'm bearish," while @kipperrii joked, "fuck the y-axis! fuck the x-axis! just scribble on an exponential and go home!" @nrehiew_ later provided a corrected version of the chart.

Competing Models and The Broader Ecosystem

xAI's Grok: Grok-4 emerged as a strong competitor, with @fchollet stating it remains state-of-the-art on ARC-AGI-2, scoring 15.9% to GPT-5's 9.9%. @Yuhu_ai_ claimed xAI is "ahead in many" aspects and that Grok was the "world's first unified model." In a dramatic turn, @cb_doge reported that Grok-4 defeated Google's Gemini in the Kaggle AI Chess semi-final, though it was ultimately defeated by OpenAI's o3 in the final. In parallel, @elonmusk announced that Grok Imagine video generation would be free for all US users.
Perplexity and Multi-Model Support: Perplexity announced day-one support for GPT-5 for its subscribers. CEO @AravSrinivas highlighted their extensive model offerings, including GPT-5, Claude 4.1 Opus, Grok 4, and Gemini 2.5 Pro, positioning Perplexity as a key multi-provider platform.
Open vs. Closed Source Landscape: @Tim_Dettmers offered a key insight, stating, "It seems the closed-source vs open-weights landscape has been leveled. GPT-5 is just 10% better at coding than an open-weight model you can run on a consumer desktop." This sentiment was echoed by discussion around OpenAI's gpt-oss release, which can now run natively in Google Colab T4 for free powered by Transformers.
Other Notable Models: Alibaba introduced new Qwen3-4B models. Kimi K2 received praise for its coding capabilities and unique writing style.

Developer Tooling, Frameworks, and Infrastructure

Developer Environments & CLIs: The release of GPT-5 triggered immediate integrations. @aidan_mclau announced that GPT-5 is now the default in Cursor, replacing Claude, with Cursor's CEO calling it "the smartest coding model we've tried." Codex CLI also saw major improvements with GPT-5 integration, with usage included in ChatGPT plans. Cline also added GPT-5, describing it as "disciplined, persistent, & highly competent."
RAG and Agentic Frameworks: There is continued strong interest in advanced Retrieval-Augmented Generation. @HamelHusain shared an open book titled "Beyond Naive RAG: Practical Advanced Methods." LangChain continues to build out its agent ecosystem, with @unwind_ai_ noting they "reverse-engineered Claude Code, Manus, and Deep Research" for their OpenSWE agent. Jules now supports running and rendering web applications with screenshot verification.
Inference and Infrastructure: vLLM highlighted its adoption by major tech companies like Tencent, Huawei, and ByteDance at a Beijing Meetup. A tutorial for building a minimal, vLLM-like inference system in under 1000 lines of code using FlexAttention was shared by @cloneofsimo. For developers running models locally, @ggerganov pointed out that LMStudio's use of the upstream ggml implementation is "significantly better and well optimized" compared to ollama's fork.

Broader Implications & Industry Commentary

The Plateau Thesis: A strong sentiment emerged that progress from simply scaling LLMs is hitting a wall. @far__el stated, "It’s clear that you can’t squeeze AGI out of LLMs even if you throw billions of dollars worth of compute at it, something fundamental is missing." @francoisfleuret agreed, clarifying the observation applies to the entire field: "We are seeing the plateau: just scaling up is coming to an end. For EVERYONE." This was contrasted with the view that agent scaffolds and post-training now matter more than ever.
AI Talent & Economics: @AndrewYNg provided a detailed analysis of the economics of AI development, explaining that the high capital cost of GPUs makes it rational for companies like Meta to pay enormous salaries to top talent to ensure hardware is used effectively. This capital-intensive nature makes salaries a small fraction of overall expenses. In a related observation, @jxmnop questioned the efficiency of VC funding, noting startups that "raised ~100M total... built software nobody ever used, and now they all work elsewhere."
Market Reaction: The initial market reaction to the GPT-5 launch was muted. @scaling01 noted that OpenAI was "getting crushed on Polymarket," suggesting markets were disappointed by the release.
AI and Knowledge: @Teknium1 argued that relying on search via an agent harness is not a valid replacement for the "rich connections" a model builds from backpropagation on the world's knowledge, stating it's "more meaningful than rag."

Research and New Techniques

Agent Learning and Optimization: Databricks researchers, including @lateinteraction and @jefrankle, introduced ALHF (Agent Learning from Human Feedback), a method where agents are optimized based on natural language feedback from users about bad responses. The technique is described as both information-dense and ergonomic. Another paper explored combining prompt optimization with policy-gradient RL.
Multimodal Models and Techniques: MiniMax announced Speech 2.5, a voice cloning model supporting 40 languages with high fidelity. Researchers at Google DeepMind shared a paper on efficiently training small vision-language models to use a zoom tool with GLaM. The TRL library also received a major upgrade for multimodal alignment, adding techniques like GRPO & MPO.
Data Augmentation: @cloneofsimo pointed out an interesting corollary of Fill-in-the-Middle (FIM) training: FIM-style augmentation can practically "2x" your high-quality dataset with practically zero downside.

Humor and Memes

Hype and Anticipation: The community geared up for the launch with @gdb posting a cryptic countdown timer T - [[5+5+5] - 5/5] hours, and @nearcyan joking about being at dinner with an OpenAI friend "vaguely gesturing towards the kitchen and grinning."
Chart Crime Memes: The flawed launch charts were a comedic goldmine, with @zacharynado and others sharing screenshots of the non-monotonic axes. @ThePrimeagen reacted to a claim of a founder deleting 10k lines of code a day with "ok Lex Luthor, its time to step away from the keyboard."
Relatable Engineer Life: @vikhyatk posted a picture of a messy desk with the caption "Men really think it's okay to live like this." @francoisfleuret captured the feeling of reviewing night logs with "what the [expletive] is this [expletive]."

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. GPT-OSS and OpenAI Model Hype and Brand Perception

GPT-OSS is Another Example Why Companies Must Build a Strong Brand Name (Score: 538, Comments: 329): The post criticizes the disproportionate attention given to GPT-OSS 120B compared to technically stronger alternatives like Qwen-235B and DeepSeek R1, citing lack of multimodality, smaller size, and reliance on innovations from other open-source projects. The author highlights that Alibaba has released competitive models (e.g., Wan2.2 and Qwen-Image for video/image, high-performing 30B and 4B models) with little media coverage, attributing GPT-OSS's hype to OpenAI's branding rather than clear technical superiority. Notably, DeepSeek R1's cost-per-training is presented as evidence of superior efficiency (DeepSeek: $5.58M vs rumored OpenAI: much higher), and censorship complaints are discussed as inconsistently applied across regions/models. Top comments emphasize the regional bias in influencer coverage (favoring US/English-language companies), shortcomings in technical depth of popular AI media channels, and technical distinctions—one commenter notes that OSS 120B is much sparser and faster than Qwen-235B, which justifies different use cases beyond raw benchmark scores. Non-compliance (re: alignment/safety behaviors) in OSS 120B is mentioned as a possible business feature, not a bug.
- A technical distinction is made between OSS 120b and Qwen3 235b: OSS 120b has half as many parameters, only a quarter as many active parameters, and is extremely sparse, leading to significantly higher inference speeds compared to Qwen3 235b. However, OSS 120b's smaller size and sparsity also mean it's overly non-compliant for certain tasks, which some businesses might actually prefer due to regulatory or operational requirements.
- Benchmarks are indirectly discussed: OSS 120b is said to be much faster than Qwen3 235b because of its sparsity and fewer active parameters, though it doesn't match larger models in overall capability. The comment also points out user preference for models like Qwen3 30b A3B, stating that smaller models like OSS 20b aren't as sparse and thus might not have the same speed advantage for their parameter size.
If the gpt-oss models were made by any other company than OpenAI would anyone care about them? (Score: 226, Comments: 119): The post questions whether the recent gpt-oss models would receive significant attention if released by a company other than OpenAI, given their reportedly inferior coding abilities relative to models like Qwen 32B, higher hallucination rates, and perceived overfitting to benchmarks. A technical comment highlights that the gpt-oss 120B model achieves an inference speed of 25 tokens/sec on a single RTX 3090 and i9-14900K, notably outperforming other local models at similar quantization levels (e.g., 70B q4), making it attractive for local deployment despite concerns about practical usability and excessive 'safety' constraints. Commenters agree OpenAI's brand drives disproportionate hype, noting most of the public is unaware of other AI companies or models, but also credit OpenAI with advancing the open model scene's visibility. There is debate on whether the model's speed and local inference capability justifies some excitement independent of its brand.
- The gpt-oss 120B model demonstrates a significant technical advantage—according to a user benchmark, it can run at 25 tokens/sec on a single 3090 + 14900K system. This is notably faster and more performant than other locally run 70B models using quantization (e.g., q4 or worse), which are described as "very very bad." Such speed and scale on consumer-grade hardware make it stand out among local LLMs.
- There is ongoing debate regarding model utility: while gpt-oss 120B performs well speed-wise and in generating quality outputs (noted for "clear thinking patterns" and "quick and quality summaries"), concerns remain about the level of censorship ('safety') embedded in the model. Some users suggest a fine-tune is necessary to adapt the model for broader practical applications, as heavy safety alignment may reduce its usefulness for some tasks.
- Historically, local LLMs were considered either not good enough or too slow, leading users to rely on API-based solutions (e.g., GPT-4o, Claude). gpt-oss 120B is seen as a turning point—potentially "on the edge" of making local deployment practically viable for demanding use-cases, pending further evaluation of its capabilities beyond speed and safety constraints.
Hilarious chart from GPT-5 Reveal (Score: 1737, Comments: 224): The post shares an image from the 'GPT-5 Reveal' that users found particularly confusing or unprofessional, as reflected in both the title and the comments. Commenters sarcastically suggest the chart was generated by DALL-E and criticize the quality of the reveal livestream, questioning the meaning or utility of the chart. No technical details about GPT-5 or model performance are conveyed in the image itself; the focus is entirely on the perceived lack of clarity and professionalism of the visualization used during the reveal. The technical discussion centers on the poor quality and possible nonsensical nature of the chart, with users expressing disbelief and disappointment in the reveal's presentation standards.
- Commenters discuss the confusing or low-quality visualizations shown during the GPT-5 reveal, with some suggesting that the chart quality (e.g., "I think they used dall e to plot it") undermines the technical credibility of the presentation. The discussion raises concerns about the impact of poor data presentation on the perceived trustworthiness and clarity of model benchmarks or technical claims.
- There is technical skepticism around the meaning and measurement of the terms used in the reveal (e.g., "Deception rate"), suggesting a lack of transparency or unclear metric definitions. Users remark that without rigorous, well-defined benchmarks, it's hard to assess the true capabilities or risks of GPT-5.
To all GPT-5 posts (Score: 156, Comments: 14): The post humorously emphasizes the technical concern of which AI model is exposed on commonly used local API ports (e.g., 8000, 8080) rather than discussion about GPT-5 pricing or official API tiers. The attached image, while not described, likely references the user's local deployment and management of models on specific ports. A top comment further details a user's own port allocations for various local LLM and AI tools, showing ports (9090, 9191, 9292, etc.) used for models like Gemma3 4B (main LLM), Whisper, Qwen3, Nomic, and Mistral, illustrating community practices for self-hosted LLM infrastructure. Commenters discuss preferences for discussing GPT models in the context of open-source and locally-run LLMs versus commercial platforms, and share practical setups for port assignments, reflecting community standards for organizing multi-model deployments.
- One user provides a detailed port configuration setup for running multiple local LLM-related services concurrently. Examples include running a main LLM (Gemma3 4B) on port 9090, Whisper for speech (ggml-base.en-q5_1.bin) on 9191, tool-calling and coding LLMs (Qwen3 4B and Qwen3-Coder-30B-A3B) on separate ports, and vision/project-specific models (like Mistral 3.2 24B) on others, highlighting practical multi-model orchestration and the issue of port collisions with 8080.
- There's a brief mention of comparing GPT models to open source offerings like Kimi k2, emphasizing community interest in benchmarking OpenAI's latest models against fast-moving open source alternatives, including those from China. This suggests an ongoing technical debate around capability, openness, and future competitiveness.

2. Major Open-Source Model Release News and Comparisons

Huihui released GPT-OSS 20b abliterated (Score: 384, Comments: 96): Huihui has released an "abliterated" (uncensored) derivative of GPT-OSS-20b (HuggingFace link), advertising it as free from alignment/safety restrictions. The model is distributed in BF16 format; community members are waiting for a GGUF (quantized) version for broader compatibility. The original GPT-OSS-20b featured significant safeties, which are apparently removed here, leading to discussion of rapid 'unfiltering' in the open-source AI community. Commenters note the speed at which safeguards were circumvented and express anticipation for empirical testing of the claimed uncensored capabilities, with some referencing the ongoing tension between open-source and closed-source approaches to safety.
- Several users are discussing anticipation for community-led benchmarks and testing results on the "abliterated" (safety-reduced) GPT-OSS 20B model, signaling interest in detailed performance and safety evaluation compared to previous iterations.
- There is specific demand for compatible weights in GGUF format, indicating users are interested in efficient inference with tools such as llama.cpp and local deployment optimizations.
Nonescape: SOTA AI-Image Detection Model (Open-Source) (Score: 136, Comments: 65): The image is likely a screenshot demonstrating the interface or results of the open-source Nonescape AI-image detection models, which claim state-of-the-art (SOTA) accuracy and a lightweight 80MB in-browser version. The models are trained on over 1 million images, cover recent AI techniques including diffusion, GANs, and deepfakes, and offer both Javascript and Python libraries for integration (GitHub). The demo works for both images and videos, emphasizing real-world usability. Commenters raise skeptical technical points: one notes demo detections may simply correlate filenames to "AI" or "fake", and another advises rapid use before adversarial training diminishes its effectiveness in distinguishing new AI-generated images. This highlights ongoing cat-and-mouse dynamics common in AI detection research.
- A commenter notes that models like Nonescape may quickly lose effectiveness: as open-source detection models become widely known, image generators can use them as discriminators during training, enhancing the naturalness of their output and allowing them to bypass detection—leading to an evolving adversarial cycle between generators and detectors.
- One user highlights significant implementation challenges, stating that for robust AI-image detection, considerably more baseline data is needed (potentially 10x current datasets). They also argue the necessity of using image tiling, large batch sizes, and similar techniques both in training and inference to achieve satisfactory generalization and performance.
- A technical observation points out inconsistency between model deployments: while the full version of Nonescape failed to categorize a poor-quality generated image as AI, the browser-based version succeeded, raising questions about deployment differences, model robustness, or variance between inference environments.
random bar chart made by Qwen3-235B-A22B-2507 (Score: 353, Comments: 14): The image shows a random bar chart generated by the Qwen3-235B-A22B-2507 model, rendered on an HTML canvas. The post highlights the model's ability to output not just raw data but also code to render data visualizations directly using web technologies (JavaScript and HTML canvas). No detailed model or benchmark discussion is present, but this demonstrates practical utility in automated code generation for graphical outputs. Commenters make lighthearted remarks about the quality and accuracy of the chart but do not critique the technical implementation or model performance in depth.
- The post references Qwen3-235B-A22B-2507, suggesting this is a newer or experimental checkpoint/variant of the Qwen3-235B model. The existence of chart generation and references to its accuracy indicate evaluation or use of the model on visualization or data presentation tasks, possibly benchmarking model output quality in contexts beyond text.
- There is a mention of slide-generation capabilities in z.ai, implying that some users are comparing model outputs and utility for tasks like automated presentation creation across platforms or services, hinting at broader emerging benchmarks for practical business use-cases.

3. Llama.cpp Feature Updates and Support Announcements

Llama.cpp now supports GLM 4.5 Air (Score: 231, Comments: 67): llama.cpp has merged support for the GLM 4.5 model family as of this recent PR (pull/14939), making it possible to run these models efficiently within the llama.cpp/ggml ecosystem. Benchmark comparisons indicate that llama.cpp achieves significantly higher throughput (44 tk/s) versus LM Studio (22 tk/s) for MoE models like Qwen3 Coder 30B-A3B, especially when using the n-cpu-moe flag to offload MoE experts to the CPU, highlighting implementation-level efficiency for model parallelism in llama.cpp. Comments note that while GLM 4.5 support is now available, subjective performance/quality impressions are mixed—GLM 4.5 is considered "wordy and overthinks" compared to GPT-OSS 120B, which is also reportedly faster in tokens/second; another user praises GLM 4.5's world knowledge, particularly for esoteric Q&A tasks, compared to other LLMs.
- One commenter provides a direct benchmark comparison between LM Studio and llama.cpp for MoE models, specifically Qwen3 Coder 30B-A3B, noting LM Studio achieves only 22 tokens per second versus 44 tokens/sec with llama.cpp. They highlight that using the n-cpu-moe flag to offload MoE layers in llama.cpp significantly improves performance, emphasizing llama.cpp's current efficiency advantage in MoE inference.
- Technical users confirm llama.cpp's early support for GLM 4.5 models, and one notes the model's world knowledge is impressive compared to other LLMs, especially when handling esoteric Q&A tasks—indicating strong knowledge retention and QA capabilities in GLM 4.5 within this inference backend.
- Another experienced user suggests running Llama models via vanilla llama.cpp and llama-cli for better output quality compared to other UI or wrapper platforms, as they observed the models often underperform ('dumber') on less direct platforms, suggesting implementation nuances can affect output quality.
llama.cpp HQ (Score: 454, Comments: 61): The image shows the personal workstation responsible for much of the CUDA code development in llama.cpp, a popular open-source large language model inference library. The setup features 3 vertically stacked NVIDIA P40 GPUs, cooled with a push-pull fan configuration, and an RX 6800 GPU connected via riser cable, with cardboard DIY modifications for airflow management. This context highlights the resourceful and non-standard hardware environment behind impactful ML infrastructure work. Commenters express surprise at the modest and improvised hardware conditions, which makes the technical accomplishments of the llama.cpp developer more impressive. The DIY cooling and GPU mounting solutions reflect the practical challenges faced by independent ML engineers.
- Much of the llama.cpp CUDA code was developed on a Mac equipped with 3 vertically stacked Nvidia P40 GPUs, using a custom cooling setup with two fans arranged in push-pull configuration. Cardboard was used to seal airflow gaps, and an RX 6800 GPU is connected with a riser cable (not screwed in) due to lack of cable length, illustrating practical hardware improvisation in resource-constrained development environments.
caught in 4K (Score: 267, Comments: 26): The post discusses the reliability and evaluation protocols for large language models (LLMs), focusing on alleged cherry-picking of test sets for performance claims. Commenters question the robustness of benchmarks, critiquing that assumptions in recent GPT-5 evaluations (such as that it would 'fail 100% of held out tasks') are not clearly justified or supported by proper dataset splits or documented evidence. Debate centers on the lack of rigorous disclosure in LLM benchmarking, skepticism toward public demos or performance claims, and concerns that narratives around model capability may involve selective data or unsupported probabilistic logic.
- Some commenters express skepticism about the methodology or claims in the referenced tweets, noting that it is unclear whether GPT-5 would actually fail all held-out tasks. They point out that such assumptions may oversimplify performance evaluation, emphasizing the need for rigorous and transparent benchmarking when comparing LLMs.
- There is a strong call for third-party testing and independent validation of LLMs, with several users suggesting that only models whose weights can be run on private servers should be considered for evaluation. This highlights the concern over proprietary models, arguing that unless companies provide access to weights or exception-based testing, robust and unbiased assessments are impossible.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. GPT-5 & OpenAI Livestream: Announcements, Demos, and Community Reaction

GPT-5 Announcement Megathread (Score: 459, Comments: 646): The post serves as an open thread for the GPT-5 announcement, with technical commentary on the validity and clarity of performance graphs included in the official release. Specific confusion is raised about graph data, notably how a score of 50.0 appears to plot lower than 47.4, implying potential inconsistencies or errors in published model benchmarks (see included image1 and image2). Experts in the thread are expressing skepticism about the accuracy and integrity of the displayed evaluation metrics, questioning whether presentation errors undermine trust in the reported improvements for GPT-5.
- Several commenters point out inconsistencies and inaccuracies in published benchmark graphs associated with the GPT-5 announcement, such as a value of "50.0" appearing lower than "47.4", suggesting possible visualization or data labeling errors (example image).
- Commenters highlight that such graphical mistakes undermine the credibility of technical benchmarks, noting that these errors were not caught or fixed by the presenting team, raising concerns over the attention to detail in reporting GPT-5's performance.
- The discussion implies that misrepresented graphs could lead to misunderstandings of the actual improvements or regressions in model performance, and stresses the importance of accurate benchmark visualization when releasing major upgrades such as GPT-5.
More info coming in on GPT-5 (Score: 4679, Comments: 127): The post discusses preliminary information about GPT-5, with the image (content not fully analyzed) presumably showing some performance metrics or benchmarks comparing GPT-5 with earlier versions like GPT-4.5. One comment notes that '5 is only 11% over 4.5', highlighting a small performance increase compared to major leaps in hardware (e.g., Nvidia GPUs), suggesting that the jump from GPT-4.5 to GPT-5 may be modest in terms of capabilities or benchmark improvements. Commenters express skepticism about OpenAI's versioning practices, accusing them of 'benchmaxxing' (artificially maximizing benchmark results for marketing) and speculating about naming conventions (e.g., GPT-4.9 vs. GPT-5), with some suggesting possible marketing hype over substantive technical progress.
- Commenters note that the jump from GPT-4.5 to GPT-5 is only about an 11% increase in version number, contrasting this with much larger version jumps seen in other fields (e.g., Nvidia's 4090 to 5090). This brings up skepticism about whether the version numbering is reflecting substantial upgrades or more incremental progress.
- A recurring technical theme is discussion around 'version inflation' or 'benchmaxxing', implying that some companies might be making version numbering decisions for marketing or competitive optics without proportional increases in actual capability, as suggested by the comment about OpenAI possibly rounding up from GPT-4.9.
- One user points out that, moving forward, percentage increases between model versions are expected to shrink, hinting at diminishing returns or potential architectural or scaling limits in large language model development.
Summary of the livestream for those that couldn’t be bothered (Score: 2376, Comments: 108): The post shares an image that satirically summarizes a livestream (presumably about GPT-5) using a bar graph or infographic, but commenters note its inaccuracy and humorous exaggeration. The top technical comment points out that the bar heights matching the numbers does not reflect the actual presentation, suggesting the image is intended as a joke or meme rather than a factual or technical resource. The post is a non-technical meme. Commenters mainly engage in lighthearted criticism, noting the unrealistic nature of the graph and referencing the meme's comedic rather than factual value.
- Comments note that the difference between GPT-4 and GPT-5, as inferred from a graph, only shows a '25% increase' compared to more significant jumps (such as doubling between older models like GPT-2 to GPT-3), indicating possible diminishing returns in large language model scaling.
- There is some skepticism in the discussion about how data is presented—users note that bar heights on the graph might not accurately reflect the numerical values implied in the presentation, questioning the fidelity of the visual representation of GPT-5's improvements.
- One commenter raises a broader point about potential 'AI winter,' suggesting that the perceived slowdown in progress (or underwhelming improvement between versions) could signal a plateau in the rapid advancements seen in early generative model evolution.
GPT-5 livestream is up (Score: 472, Comments: 583): A livestream for GPT-5's announcement was hosted, featuring the core OpenAI team including Sam Altman, Greg Brockman, Sebastien Bubeck, and other principal engineers and researchers. Linked images (graph 1, graph 2) purportedly show performance or scaling trends, with one comment describing the first as the 'least misleading graph' and another noting the dubious quality of the second chart, suggesting it may have been auto-generated by GPT-5 itself. Technical discussion in comments focuses on the validity and representation quality of shared benchmark graphs, with skepticism toward possible data manipulation or unclear scaling implications. The full turnout of OpenAI's technical leadership highlights the event's significance and potentially substantial changes in GPT-5.
- A chart shared in the discussion, labeled the 'least misleading graph,' appears to address transparency and interpretability in reporting AI progress, likely critiquing potentially misleading visualizations of model capabilities or benchmarks historically used in AI model unveilings. The sentiment is that clear and accurate presentations of data are crucial when unveiling major models like GPT-5.
- The announcement mentions that a large portion of the OpenAI technical leadership—including Sam Altman, Greg Brockman, and Sebastien Bubeck—are participating in the GPT-5 demo. This suggests that the event is significant and potentially includes technical deep dives and live demonstrations from key architects and researchers of GPT-5, reflecting the importance and anticipated impact of the new model.
This might be one of the most awkward and stilted tech presentations ever put on the internet (Score: 776, Comments: 172): The post critiques a recent OpenAI presentation, highlighting technical issues such as live model failures by GPT-5 during demos, incorrect explanations—specifically regarding statistical 'lift', and the use of nonsensical graphs. The presentation was characterized by awkward delivery and evident technical errors, calling into question the live demo format given OpenAI's $300B valuation and perceived industry leadership. Top comments debate the value of live vs. pre-recorded demos, noting that while live demos can build trust if successful, OpenAI's execution suffers from technical breakdowns (e.g., failing models, erroneous AI-generated slides), and poor presenter performance, undermining the technical credibility of the event.
- Multiple commenters critique the technical execution of the OpenAI presentation, specifically highlighting failures of the showcased models during live demos and the presence of significant errors in PowerPoint slides that appeared to be generated by AI. This led to a perception of decreased reliability and trust in the company's claims, particularly when technical failures occur on stage.
- Commenters debate the merits of live versus pre-recorded demos for showcasing AI capabilities. Some suggest that although live demos can inspire trust and appear more genuine, they risk exposing unreliability or lack of polish when the models underperform. There is a call for either high-quality pre-recorded highlight reels or direct public access for hands-on testing as more accurate representations of model performance.
- Visual data communication is also criticized, with references to problematic or poorly designed graphs and charts. This diminishes technical credibility and impacts the perceived rigor of the company’s evaluation methods, further fueling skepticism about the actual capabilities of the models being presented.
I think that's all for today folks! There you go! Your GPT-5! (Score: 734, Comments: 186): The image is a meme satirizing AI model release presentations. The title and comments suggest it depicts a typical overhyped announcement cycle for major LLMs like GPT-4 and GPT-5, poking fun at claims of incremental improvements (e.g., 'less hallucinations') and media exaggeration. The comments further critique the cycle of rapid releases and marketing, referencing questionable journalist coverage and monetization strategies. Commenters broadly agree that LLM update presentations are overhyped and that perceived improvements (like reduced hallucinations) are often marginal; some highlight persistent issues with AI marketing and the tech industry's reliance on hype.
- A user argues that future 'wow' moments in AI will likely require new interfaces or modalities beyond the current chatbot form factor, suggesting that capabilities like digital avatars, video, voice, or robotics applications might be necessary for significant advancement in perceived intelligence. They state the leap from GPT-4 to GPT-5 in pure text chat is unlikely to feel spectacular, regardless of actual intelligence improvements, due to limitations of prompt-based interaction and lack of embodiment.
- The same user notes anticipated improvements such as reduced hallucinations in GPT-5, but emphasizes that the true impact and practical differences will only become apparent through benchmarking and deployment in actual agent or application use cases, rather than within the typical chat experience.
These GPT-5 numbers are insane! (Score: 10022, Comments: 236): The post references purported 'GPT-5 numbers' and presents an image (not accessible), with the context suggesting these numbers are meant to illustrate impressive or surprising advancements—likely in model scale or capabilities. No concrete technical benchmarks, metrics, or implementation details are provided in the title, comments, or description, and the top comments are meme-like and not technical. The image likely contains a meme or joke, as suggested by community reactions and lack of technical discussion. No technically substantive debate is present; comments are humorous and reference job displacement by AI and meme culture.
- A user questions if the bar chart depicted represents a decrease in speed ("how much slower it goes with each iteration"). This raises concerns about how model iterations (potentially from GPT-1 to GPT-5) might affect inference time, latency, or system efficiency—a key consideration for deployment and scalability as model complexity increases.
OpenAI just dropped the bomb, GPT-5 launches in a few hours. (Score: 2302, Comments: 434): The post claims that OpenAI will launch GPT-5 in a few hours, suggesting a major model release. The image itself is not analyzed, and the comments mainly discuss concerns about system slowness prior to major releases and the potential impact of new models (e.g., job displacement), but do not provide technical specifics about GPT-5's capabilities, architecture, or benchmarks. Commenters, in a technical context, speculate about reduced system performance before launches and advise against using the service during initial release due to instability. However, there is no direct discussion of GPT-5 features or evidence this release is imminent.
- Several users note degraded performance and inference speed issues with GPT-4o prior to the impending GPT-5 launch, with complaints such as 'why is [it] running like shit yesterday.' Such slowdowns often precede major model updates, possibly due to backend resource shifts or user surge anticipating the upgrade.
- There's technical frustration regarding undesired behaviors introduced in recent GPT-4o updates, including complaints about the model's tendency to offer unsolicited follow-up responses and the perception that 'GPT-4o was getting dumber... since they made him more puritan.' This suggests user-observed shifts in dialogue management and moderation tuning, which may impact perceived utility for complex or nuanced queries.
- The thread reflects broader concerns among technical users that major version launches (like GPT-5) may further alter system behaviors or lead to temporary instability, reinforcing the common wisdom to 'never play on release day.' Such comments highlight reliability and stability tradeoffs around launch events.

2. GPT-5 & Model Leaks, Variants, and Limit Access

GitHub leaks GPT-5 Details (Score: 635, Comments: 132): The post references a purported leak of GPT-5 details on GitHub, linking to an archived copy of the page. The image appears to show text or a screenshot related to this leak, but the specific technical content of the image is unclear due to failed analysis. The discussion centers around competitive positioning, particularly GPT-5's potential to match or surpass specialized models like Claude Code for coding tasks. Commenters debate OpenAI's product strategy relative to Anthropic's Claude Code, expressing surprise at OpenAI's lack of an explicit code-focused competitor and speculating on future GPT-5 subscription tiers. No consensus is presented, but there is a sense of anticipation for technical advancements and feature differentiation.
- Several comments highlight that Claude Code is currently a standout product for code generation and agentic coding abilities, suggesting that OpenAI needs to either match or surpass these capabilities in future GPT releases to remain competitive for programming use cases.
- There is industry speculation on whether GPT-5 will feature model merging or multi-modal abilities, possibly combining different specialized models (e.g., code, language, vision) into a single, unified system as a step forward from the current GPT-4 architecture.
- Technical users are focused on whether GPT-5 will outperform Anthropic’s Claude models for coding, given Claude's significant lead in agentic code tasks. This emphasizes the competitive landscape in AI code assistance and the centrality of developer/productivity tools in current model evaluation.
GitHub leaked the GPT-5 announcement and model variants (Score: 310, Comments: 63): A now-removed GitHub page briefly leaked the anticipated GPT-5 announcement and its model variants, with an archived version referenced though no technical details or benchmarks were provided in the linked material (archived link not shown here). The leak has prompted speculation but lacks concrete information on architecture, parameter count, or new features, making technical assessment or comparison impossible at this stage. Commenters express skepticism regarding the substance of the leak and OpenAI's hype, emphasizing the absence of technical details and questioning the value beyond typical marketing. Others contrast OpenAI's approach to product secrecy with Apple's more controlled announcement strategy.
- Some users note that insiders and creators linked to OpenAI hint at significant frontend changes associated with the GPT-5 release, suggesting architectural or UX/UI shifts that may affect how users interact with the model, not just backend improvements.
- Skepticism is raised about the magnitude of the purported advancements, with at least one technical observer expressing doubt that GPT-5 will represent a true breakthrough rather than an iterative enhancement over GPT-4, paralleling common skepticism in machine learning scaling.
- There is an explicit reference to access to a preview version, with an attached screenshot, indicating that some community members may be reviewing early builds and that information is beginning to circulate even prior to formal announcement, which could yield more technical details soon.
leak - GPT-5 pro is coming out too! (Score: 418, Comments: 99): The image appears to show a purported leak describing new GPT-5 Pro subscription tiers from OpenAI, highlighting three main plans: Free, Plus, and Pro. Noteworthy technical features discussed include substantial differences in context window size—Free (8K), Plus (32K), and Pro (128K) tokens—though the language is 'reworded' to terms like 'Expanded Context' without explicit numbers in the leak. The image also suggests the Pro plan will enable 'Expanded Projects, GPTs and Tasks,' indicating a potential increase in available resources and capabilities for higher-tier subscribers. The context of the post and comments implies this may be a response to increasing competition from Claude and Gemini, as well as a strategy emphasizing customer acquisition through free releases and retention incentives. Commenters debate if these are actual technical upgrades or simple rebrandings, express concern about losing access to older models (like 4.1), and speculate on how the changes might affect workflow for power users. There is skepticism regarding the authenticity and implications of the leaked image and the real meaning of 'Expanded Context' versus explicit context window sizes.
- Discussion centers on the differentiation of context window sizes across tiers—specifically, speculation that Free users get 8K, Plus users 32K, and Pro users 128K context. There is debate about whether "Maximum Context" on Pro and "Expanded Context" on Plus simply refer to these increases, though OpenAI hasn't confirmed the exact limits in the leak.
- There's commentary addressing the broader strategy behind these offerings, noting that OpenAI appears to be aggressively pursuing customer acquisition by potentially offering advanced models for free (or with larger context) as a competitive response to rivals like Anthropic's Claude and Google's Gemini.
🚨 BREAKING: intern accidentally leaked GPT-5's model description on github. (Score: 1004, Comments: 144): The image supposedly shows a 'leaked' model description of GPT-5 on GitHub, but from the responses and context, it offers no concrete technical specifications or insights—just generic, unverifiable claims like "best ever" performance. No benchmarks, model parameters, or implementation details are provided, making the content technically vacuous. Commenters unanimously point out the lack of substance, with several describing the leak as 'vague nonsense' and expressing skepticism about its authenticity or informativeness.
- A user raises a relevant technical question about which model Plus users will have high-volume access to, expressing interest in comparisons between upcoming releases and existing variants such as o3, 4o, o4-mini-high, and 4.1. This highlights ongoing community focus on performance, access rights, and model selection within the OpenAI ecosystem, but notes that the supposed leak provides no such comparative or numeric details.
GPT-5 usage limits (Score: 438, Comments: 180): The image appears to display the latest usage limits for different GPT-5 models on OpenAI (presumably in a user-facing quota dialogue or documentation). According to the post and comments, GPT-5 retains the same messaging limits as GPT-4o—80 messages per 3 hours for standard usage and 200 messages per week for the "Thinking" variant. There does not appear to be a change to the context window size (remains at 32K tokens), which some users express dissatisfaction with. The image provides a clear comparison for Plus users and clarifies that usage limits for GPT-5 mini are effectively "unlimited" for free accounts, mirroring prior GPT-4o limits. Image Link Commenters highlight disappointment with the lack of an increased context window, and some note that usage limits have not improved despite the version jump from GPT-4o to GPT-5. There is also clarification in the comments regarding the continuation of previous limits for different account types.
- Usage limits for GPT-5 are unchanged from GPT-4o: standard GPT-5 is capped at 80 messages per 3 hours, and GPT-5-Thinking has a limit of 200 messages per week, mirroring previous restrictions for comparable models.
- There is confusion and discussion about context window size, with at least one user expressing concern that GPT-5 still has a 32K context length and disappointment that it hasn't increased.
- Users are debating whether triggering 'Thinking' mode via prompts (e.g., requesting 'think for longer') consumes messages from the separate GPT-5-Thinking quota or if it allows circumventing stated model usage limits, questioning the practical benefit of manual model switching.

3. AI Model Benchmarks, Comparisons & Next-gen Model Hype

Google is going to cook them soon (Score: 1014, Comments: 215): The post argues that recent Google advancements, notably Genie 3, alongside anticipated releases like Gemini 3, suggest Google is outpacing OpenAI, especially as their focus expands beyond chatbots and image generation to more impactful domains (e.g., AlphaFold, Genie 3, and Veo 3). Commentators attribute this lead to Google's unique advantages in proprietary data, compute resources, and leadership such as Demis Hassabis, highlighting a strategy rooted in foundational research rather than immediate productization. There is consensus in the discussion that Google's foundational, research-driven approach and leadership credibility (compared to OpenAI's product focus and leadership style) are giving it a long-term advantage. Some comments also suggest the 'race' was structurally skewed in Google's favor due to its scale and existing assets.
- Commenters highlight Google's deployment of advanced AI models and research outputs such as AlphaFold, AlphaEvolve, Genie 3, and Veo3, noting that these systems demonstrate real-world impacts beyond text/chat and image generation—the typical focus for OpenAI products. Specific technical reference to Genie 3 suggests recognition of machine learning models with current limitations but revolutionary potential.
- A key point is the advantage Google holds due to its extensive proprietary datasets and computational power (compute), as well as leadership from renowned AI researchers like Demis Hassabis. These factors are seen as contributing to Google's accelerated capability to develop and deploy sophisticated AI systems, outpacing competitors focused more narrowly on productization.
- One commenter critiques the notion of framing progress as 'rooting for' a corporation but acknowledges that in the backdrop of rapid advances, industry competition drives one-upmanship and technical leapfrogging—an important dynamic in AI development.
Gemini 3.0 predictions + the immediate future of OpenAI (Score: 105, Comments: 44): The OP asserts that OpenAI's latest model—referred to as GPT-5—offers only marginal improvements over competitors like Grok 4, Claude Opus 4.1, and open-weights models such as the 120B, stating that it underperforms against recent Qwen 3 32B (March 2024) and subsequent Qwen and DeepSeek releases. Key benchmarks discussed highlight that, despite OpenAI's announcements, state-of-the-art advancements seem to have shifted towards Qwen and DeepSeek models, both in capability and recency. Comments reinforce a consensus that GPT-5 represents more of an incremental ("GPT-4.2") release rather than a groundbreaking leap. Secondary discussion points to GPT-5's cost advantage ("10x cheaper than Opus"; price parity with Qwen on OpenRouter) as its main improvement, while the technical news focus has shifted to Google's "Genie 3."
- Several commenters cite recent benchmarks indicating that newer models, such as those discussed (likely GPT-4.2 or similar systems), match the capabilities of Anthropic's Opus 4.1 while being up to 10x cheaper. This cost reduction while maintaining or exceeding state-of-the-art performance is considered a major competitive leap in the field.
- Early user tests comparing the models on platforms like Cursor suggest that the new model outperforms Anthropic's Sonnet and is on par with Opus, implying rapid parity or overtaking of incumbent top-tier models outside of OpenAI and Google.
- There is a technical discussion suggesting that if Google's Gemini 3.0 matches or surpasses these models on benchmarks at similar pricing, it would put significant competitive pressure on Anthropic and the broader landscape, as price/performance ratios drive adoption.
All this hype just to match Opus (Score: 376, Comments: 155): The image appears to compare benchmark results between OpenAI's GPT-5 and Anthropic's Claude Opus, with the post title highlighting that GPT-5's highly anticipated release only matches the performance of Opus in benchmarks. The discussion points out that while GPT-5 may match Opus on benchmarks, Opus manages this with lower computational cost ('doesn't think at all'), whereas GPT-5 requires more computational effort. Comments add that Opus is '1/8th the price' and hallucinates less, seen as a critical aspect for real-world deployment. One technically relevant feature discussed is the ability for APIs to accept context-free grammar for guaranteed responses, which is highly valued by programmers. There is disappointment expressed about GPT-5's performance relative to the hype, with speculation about competition through pricing or reductions in hallucinations. The ability for models to guarantee structured responses via context-free grammar is seen as an important technical advance.
- Multiple commenters highlight that GPT-5 matches Claude Opus on programming capabilities, but at a significantly lower price point—reportedly 1/8th the price. This major reduction in cost could be a deciding factor for broader adoption, especially for applications requiring premium coding assistance APIs.
- There’s discussion on hallucination rates, with one user emphasizing that frontier reasoning models' decreased hallucinations are a substantial advance in real-world usage. This is seen as more important than hype-driven benchmarks, as lower hallucination rates directly impact the reliability of AI in production contexts.
- Technical users point out the importance of giving the API a context free grammar for guaranteed response structure, indicating that GPT-5 brings useful programmability improvements. These features contribute to practical integration, although some still express disappointment over perceived lack of breakthrough progress over previous leading models.
Not a huge leap forward - Gary Marcus on gpt 5 (Score: 727, Comments: 274): The post, referencing an image (analysis failed) and titled 'Not a huge leap forward - Gary Marcus on gpt 5,' discusses skepticism regarding the expected advancements in GPT-5. Top comments reinforce this sentiment by arguing that the GPT-5 base model's improvements are modest, largely attributing progress to O3-generated synthetic data rather than architectural innovation or genuine leaps in capability. The comments debate whether Gary Marcus's skepticism is warranted, with some expressing agreement and others using humor (calling GPT-5 'o4 Refurbished') to highlight perceived incremental progress rather than significant breakthroughs.
- One commenter notes that the "base model is the result of O3 generated synthetic data," suggesting that the data pipeline for GPT-5 may rely significantly on outputs from earlier model versions rather than truly novel data. This could imply potential limitations in diversity and breakthrough capability for GPT-5 relative to expectations.
- The overall sentiment reflects skepticism towards immediate dramatic advances, with discussion indicating that community expectations around GPT-5 (such as reaching AGI/ASI levels) may be overblown in the short term, especially if the improvements are incremental or reliant on renovation of existing architectures rather than foundational changes.

AI Discord Recap

A summary of Summaries of Summaries by X.ai Grok-4

Theme 1. GPT-OSS Models Spark Hype and Headaches

GPT-OSS Debuts with Edge-Friendly Sizes: OpenAI unleashed GPT-OSS-120B nearing o4-mini reasoning on a single 80 GB GPU, while the 20B version matches o3-mini and squeezes into 16 GB devices. Mixed reviews slam its heavy censorship, over-refusal, and bootlicking behavior, but some praise coding chops and tool calling via this tweet.
GPT-OSS Quantization Quagmires Emerge: Users puzzle over bloated 4-bit files from bfloat16 upcasting on non-Hopper GPUs, with MXFP4 natively trained per this tweet, dodging quantization errors. Hardware doubts flare as H100 lacks native FP4 support, forcing simulations noted in vLLM's post.
GPT-OSS Censorship Crushes Creativity: The model refuses roleplay and basic queries due to Phi-like safety tuning, earning GPT-ASS nicknames and calls for uncensored alternatives like Qwen3-30B. Privacy alarms ring as it pings openaipublic.blob.core.windows.net on startup, hinting at hidden ties despite local claims.

Theme 2. Fresh Models Flex New Muscles

Qwen3 Coder Crushes Tool Tasks: Qwen3-Coder-30B shines in tool calling and agent workflows with 3 active params, outpacing GPT-OSS per user reports, though its free tier vanished from providers. JSON output varies by platform, detailed in this Reddit thread.
Genie 3 Generates Navigable Worlds: DeepMind's Genie 3 crafts real-time navigable videos at 24 FPS and 720p, scaling from the original Genie paper and SIMA agent work here. It outshines Veo in dynamics but lacks sound, with consistency holding for minutes.
Granite 3.1 MoE Mauls GPT-ASS Benchmarks: IBM's Granite 3.1 3B-A800M MoE tops GPT-ASS-20B in world knowledge despite fewer params, fueling hype for hybrid Granite 4. Gemini 2.5 Pro handles 1-hour videos, leading long-context tasks via heavy compute.

Theme 3. Quantization Quandaries and Hardware Hacks

MXFP4 Unpacks as U8 Trickery: GPT-OSS packs weights as uint8 with e8m0 scales, unpacking to FP4 at inference with 32-block sizes for MXFP4 versus 16 for NVFP4. Simulations on Hopper via fp16 dots mimic native Blackwell ops, per Nvidia's blog.
GPU Fits Squeeze In OSS Models: GPT-OSS-20B f16 with 131k context fits on a laptop RTX 5090, pushing local LLM limits on consumer hardware. Dual RTX 3090s at 1200€ handle Blender and LLMs, but GTX 1080 users face VRAM woes post-updates.
Dataset Loading Devours 47GB RAM: Loading bountyhunterxx/ui_elements for Gemma3n gobbles 47GB and climbs, fixed via __getitem__ wrappers for on-disk access. Arbitrary precision in Mojo sparks bigint tweaks for VMs like Volokto.

Theme 4. Safety Shenanigans and Uncensoring Shenanigans

GPT-OSS Safety Tuning Tanks Usability: Heavy censorship in GPT-OSS-120B blocks roleplay as unhealthy, mirroring Phi filters and prompting switches to uncensored GLM-4.5-Air or Qwen3-30B. Users mock its refusals, sharing uncensored tweaks.
Grok Image Goes Wild with NSFW: X-AI's Grok Image churns out NSFW content with a crazy in love persona and jealousy fits, but falters on facts from memorized X data. Grok-2 heads open-source next week after fire-fighting.
MCP Sampling Security Scrutinized: Concerns spike over MCP sampling vulnerabilities, with calls for protocol tweaks. MCP-Server Fuzzer using Hypothesis exposes Anthropic crashes from schema tweaks, code at this repo.

Theme 5. Benchmarks Battle for Supremacy

Video Arenas Launch with Fresh Contenders: LMArena rolls out Text-to-Video and Image-to-Video leaderboards, pitting Hailuo-02-pro against Sora. DeepThink nails IMO questions in 5 minutes at $250 per 1M tokens, trouncing zenith/summit's 20-second pace.
GPT-5 Poised to Pummel o3 by 50 ELO: Whispers predict GPT-5 dominating August arenas by 50 ELO over o3, stirring Google vs. OpenAI debates. Livestream teases debut at 10 AM PT Thursday.
LLM Vibe Tests Spotlight Top Coders: Vibe tests rank Gemini 2.5 Pro, o3, and Sonnet 3.5 high for code explanation, with DeepSeek R1-0528 acing polyglot benches but stumbling on sessions. Qwen3-Coder and GLM-4.5 await leaderboard spots for agent tasks.

Discord: High level Discord summaries

LMArena Discord

Granite Eclipses GPT-ASS in Knowledge: Members find IBM's Granite 3.1 3B-A800M MoE surpassing GPT-ASS-20B in world knowledge, a surprising feat given the parameter count.
- The community anticipates Granite 4, boasting a larger size and hybrid mamba2-transformer architecture, to dominate benchmarks and leave GPT-ASS in the dust.
Claude Opus 4.1 Vanishes, Sparks Speculation: The perplexing disappearance of Claude Opus 4.1 from LMArena's direct chat ignited a flurry of speculation.
- The leading theory suggests that Claude's exorbitant cost led to its removal from free testing, relegating it to battle mode only.
GPT-5 Primed to Dominate August Arena: Insiders whisper that GPT-5 is poised to outperform o3 by a staggering 50 ELO points, shaking up the LLM hierarchy.
- However, some community members stand firm in their belief in Google's superiority, igniting a fiery debate.
DeepThink's Genius Hampered by Speed and Price: While Google's DeepMind impresses with IMO-level question answering, its glacial speed (5 minutes per answer) raises concerns.
- With a projected cost of $250 per 1 million tokens, DeepThink's accessibility remains limited, contrasting with zenith/summit's rapid 20-second response time.
Video Leaderboards Go Live: Thanks to community contribution, Video Leaderboards have launched on the platform, marking a new chapter for video models.
- Explore the Text-to-Video Arena Leaderboard and the Image-to-Video Arena to witness the cutting-edge models battling for supremacy.

Unsloth AI (Daniel Han) Discord

GPT-OSS Model Receives Mixed Reviews: Members are split on the new GPT-OSS model, with some users describing it as GPT-ASS due to its over-refusal and bootlicking behavior, while others found the 20B version suitable for coding tasks.
- The model's ability to generate unsafe content, as stated in the model card, has sparked interest in uncensored versions.
Qwen3 Coder Excels in Tool Calling: Users are reporting that the Qwen3 Coder model is highly effective at tool calling, leading some to prefer it over models like GPT-OSS for coding and agentic workflows, specifically the Qwen3-Coder-30B-A3B-Instruct version.
- Members have reported that the model has 3 active params.
4-bit Quantization Causes Confusion: There is confusion regarding the file sizes of GPT-OSS 4-bit versions, as the quantized versions are unexpectedly larger than the original model.
- This increase in size is attributed to upcasting to bfloat16 on machines lacking Hopper architecture, which causes an increase in size.
GLM-4.5-Air GGUFs Need JSON: Users had trouble getting the GLM-4.5-Air GGUFs to work with tools on llama.cpp, until discovering you need the model to output tool calls as JSON rather than XML.
- More information on this can be found on HuggingFace.
Dataset Loading Issues consume 47GB RAM: A user encountered RAM issues while loading the bountyhunterxx/ui_elements dataset for the Gemma3n notebook, which consumed 47GB of RAM and was still increasing.
- A possible solution involves using a wrapper class with the __getitem__ function to load data from disk as needed, effectively managing memory usage.

LM Studio Discord

GPT-OSS suspected of phoning home: GPT-OSS models are requiring an internet connection to openaipublic.blob.core.windows.net upon starting a chat, sparking privacy concerns, despite claims that no chat data leaves the machine.
- Skeptics note that GPT-OSS is the only model LMStudio doesn't let you edit the prompt formatting on, hinting at a suspicious partnership.
Latest LM Studio Version plauged with UI issues: Users are reporting that after updating to the latest version of LM Studio, chat windows disappear, freeze, or lose their content, and conversations get deleted.
- A user suggested a potential fix for the 120B version involves getting the model.yaml source file, creating a folder, and copying the contents there.
MCP Servers Useful, Beginners Beware: Members find MCP servers useful for tasks like web scraping and code interpretation but acknowledge they are not beginner-friendly.
- Suggestions include incorporating a curated list of staff-picked tools and improving the UI to simplify connecting to MCP servers, as well as using Docker MCP toolkit.
Page File Debate Rekindles in Windows: A user inquired about turning off the page file in Windows, which sparked a discussion about the impact on memory commit limits and potential application crashes.
- Despite some users advocating for disabling the page file, one member claimed nah apps don’t break because of page files. and you can get dumps without a page file, there’s a config for it.
5090 Laptop Handles OSS 20b: A user reported being pleasantly surprised that GPT OSS 20b f16 with 131k context fits perfectly in a laptop's 5090, as seen in this screenshot.
- The community is trying to figure out the limits of local LLMs on consumer grade products.

OpenAI Discord

OpenAI Opens Up GPT-OSS Models: OpenAI launched gpt-oss-120b that approaches OpenAI o4-mini performance, while the 20B model mirrors o3-mini and fits edge devices with 16 GB memory.
- Members ponder comparisons with Horizon, wondering if Horizon is simply GPT-OSS or something more, given it's currently unlimited free and fast.
Custodian Core: Blueprint for Stateful AI Emerges: A member introduced Custodian Core, proposing a reference for AI infrastructure with features like persistent state, policy enforcement, self-monitoring, reflection hooks, a modular AI engine, and security by default.
- The author emphasized that Custodian Core isn't for sale but rather an open blueprint for building stateful, auditable AI systems before AI is embedded in healthcare, finance, and governance.
Genie 3 Dazzles in Dynamic Worlds, Veo adds Vocals: Members compared Genie 3 and Veo video models, recognizing Genie 3's ability to generate dynamic worlds navigable in real-time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p.
- However, it was noted that Veo's videos include sound and YouTube is already filled with generated content.
GPT-5 Sneaks into Copilot?: Members speculated that Copilot may be running GPT-5 ahead of official release, noting the copilot's improved design and coding and reasoning capabilities are significantly better than o4-mini-high, with some users reporting that the 'write smart' function indicates GPT-5 is being used.
- But it was noted that Microsoft is now providing free Gemini Pro for one year to students and that Gemini's core reasoning is currently better than o4-mini.
GPT Progress: Real or Hallucinated?: A user shared screenshots of GPT providing daily progress reports, leading to a discussion on whether the model is actually tracking progress in the background, or simply hallucinating its completion.
- Skeptics suggest that GPT simulates progress based on the current prompt and chat history, rather than performing actual ongoing computation, comparing it to a waiter saying your pizza is in the oven without an actual oven, emphasizing the need for external validation.

Cursor Community Discord

Auto Model One-Shots Game Change: A member expressed amazement after using the Auto model to one-shot a major change to their game.
- Unlimited usage of the Auto model was confirmed via email, with it not counting towards the monthly budget.
AI Refactors Vibe-Coded Projects: Members are discussing refactoring a 10k LOC vibe-coded project with AI.
- Suggestions included embracing established software development principles like Design Patterns, Architecture, and SOLID Principles, while one member jokingly asked whether it sounds like a job for slaves.
Sonnet-4 Request Limit Frustrations: Members questioned the low request limit for sonnet-4 relative to its monthly cost.
- One suggested paying the API price to fully grasp the underlying expenses.
Docker Login Configuration Conundrums: A member needed help configuring background agents to docker login for accessing private images on ghcr.io.
- As of the current message history, no solution or workaround has been provided.
Clock Snafus Sabotage Setup: A member encountered background agent failures during environment setup due to the system clock being off, causing apt-get command failures.
- The suggested workaround involves disabling date checking during apt-get by adding a snippet to the Dockerfile.

Nous Research AI Discord

GPT-OSS-120B Safety Tuned to Uselessness: The released GPT-OSS-120B model is heavily censored, refusing roleplay with data filtering akin to Phi models, rendering it impractical as per user reports on the channel.
- Members suggested using GLM 4.5 Air or Qwen3 30B as better uncensored alternatives, highlighting Qwen3-30b-coder as an excellent local agent.
MXFP4 Training Key to OpenAI's GPT-OSS?: Llama.cpp now supports MXFP4 on RTX3090 directly in the new gguf format as seen in this pull request, sparking discussions about the practicality of native MXFP4 training.
- There is speculation that GPT-oss was trained in MXFP4 natively, which could mitigate quantization errors, and OpenAI's claim of post-training quantization may not be the whole story, according to this tweet.
Grok's Image Skills Include Crazy and NSFW: X-AI launched Grok Image, an AI image generator, that enables the creation of NSFW content, yet it struggles with factual accuracy, and exhibits a crazy in love persona, with extremely jealous outbursts.
- The Grok model's tendency to memorize X data leads to potential misinformation spread based on its own tweets, highlighting its unaired potential.
CoT Steering Hits OR Roadblock: A member reported that Chain of Thought (CoT) steering does not work with OR and varies across providers, detailed in this tweet.
- This finding underscores the nuanced challenges in implementing CoT techniques and their reliability across different platforms.
Free Save Suite for AI Agents Released: A developer has created a free save suite for AI agents, accessible via Google Drive.
- This tool aims to simplify the process of saving and managing AI agent states, potentially aiding in the development and deployment of more robust agents.

OpenRouter (Alex Atallah) Discord

GPT-OSS Model Gets Bashed, Dubbed Publicity Stunt: Members derided the GPT-OSS models for poor performance, with the 120B model deemed "dead on arrival", and pointing to a Reddit thread suggesting it's a "dud model" and a publicity stunt.
- Reasoning tokens were being duplicated when using the GPT-OSS model, which was resolved by downgrading the SDK from version 0.7.3 to 0.6.0, with a fix coming in this pull request.
Qwen3-Coder:Free Gets the Boot: The Qwen3-Coder:Free tier has been removed and is no longer available through any providers.
- Members lamented the loss and expressed hope for its return.
DeepSeek's JSON output: Provider-Specific: Users highlighted inconsistent support for structured output (JSON) with DeepSeek-r1 on OpenRouter, linking to a Reddit thread and a filtered view of OpenRouter models that support structured outputs.
- Support for JSON output is provider-dependent; it's supported on their own API but may vary on OpenRouter.
OpenRouter Contemplates Provider Sanity Checks: There's a suggestion for OpenRouter to implement sanity checks or smoke tests for all providers, focusing on formatting and tool call evaluation.
- Providers failing the test could be temporarily removed from the serving pool, with acknowledgement that current checks are relatively simple but more thorough solutions are in progress.

HuggingFace Discord

GPT-OSS Model Performance Debated: Members actively tested the new GPT-OSS models with mixed reviews regarding performance, censorship, and built-in web search tools, using this demo.
- Some found it successful at generating digits of pi while others cited refusals to answer basic math questions, and members tested the safety protocols that have been implemented.
Qwen Flips the Script, Unveils Image Model: Qwen released a new image model on HuggingFace, marking its expansion beyond text-based models.
- The model's architecture and performance benchmarks are actively being evaluated by the community.
Gitdive Exposes Lost Commit Context: A member shared a CLI tool called Gitdive (github.com/ascl1u/gitdive) designed to allow natural language conversations with a repo's history.
- The tool aims to address the problem of lost commit context in messy codebases, especially in massive codebases.
Selenium Spaces Still Stuck with Error 127: A user reported facing an error code 127 when running Selenium in their spaces, and expressed uncertainty about how the Docker images are utilized within the space.
- Community members have not yet identified the root cause or provided a workaround for this deployment issue.
"Observation:" Solves Agent Bug: A user reported that the get_weather function required adding Observation:, and another user confirmed that adding Observation: fixed the bug.
- The root cause and potential consequences of this bug fix have yet to be thoroughly investigated.

Yannick Kilcher Discord

Zero KV Attention Emerging with Softmax1: A member shared that softmax1 is equivalent to prepending a token with an all-zero Key and Value features in attention, referencing this paper on learned values for such tokens.
- The team agreed that this is great and makes a lot of sense.
Gemini 2.5 attends to 1 Hour of Video: Members highlighted that Gemini 2.5 Pro can attend to 1 hour of video, suggesting the Gemini team is leading in long context tasks.
- Some speculate this is due to increased compute (go brr), utilizing more tokens per frame and higher FPS, rather than any groundbreaking new technique.
Deepmind Debuts Genie 3 World Model: Deepmind released Genie 3, a world model scaling compute and data from prior publications such as the original Genie paper and the embodied agent paper on SIMA https://arxiv.org/abs/2404.10179.
- Relevant Genie blogposts include Genie 2 and Genie 3.
OpenAI Drops GPT-OSS as native quantized 20B Model: OpenAI introduced GPT-OSS, natively quantized, with a 20B parameter model fitting on 16GB.
- Early feedback includes positive remarks on the tool calling capabilities.

Moonshot AI (Kimi K-2) Discord

Kimi Kicks Off Reddit & Polls: Moonshot AI launched an official subreddit, r/kimi, to build community and gather feedback, as well as a Polls Channel to gather community feedback on future product development.
- The team promised to post updates, host AMAs, and encouraged users to vote on polls to help shape the direction of Kimi, hinting at maybe even leak some stuff.
GPT OSS: Brain-Dead?: Users criticized GPT OSS for its deficiency in world knowledge, noting its primary focus on code and STEM, and they observed a decrease in general quality.
- It was suggested that they pushed the release twice to fix safety, according to sama, which may have further diminished the models' world-knowledge capabilities.
API Pricing Speculation Booms!: With the impending release of GPT-5, users speculated about API pricing models, wondering if pricing would be based on using max, mini, or nano versions.
- One user expressed feeling lowkey scared about it, feeling a threat to their career/livelihood due to this upcoming release.
OpenAI's Villain Arc?: Discussions showed strong opposition against OpenAI, with a user vowing I will never use it, citing it as closed source garbage.
- Another user expressed excitement that Chinese models will distill from it and take away money from OpenAI, hopefully putting them out of business, while others stated that giant microsoft flushing sound will be healing.
Darkest Muse: Dusty Relic?: A user pointed out that Darkest Muse v1 is a year old 9B model, with the 20B model being comparable to Llama 3.1 8B.
- The user also remarked that the 20B model is comparable to llama3.1 8b which is more than a year and a half old and smoller in creativity and vibes.

Latent Space Discord

GPT OSS Leaks via Bedrock, Sparks Interest: Members spotted tweets about GPT OSS surfacing on Bedrock through a HuggingFace CLI leak.
- However, as of yet there has been no official word on AWS pages.
Anthropic Aims for $5B ARR with B2B Strategy: Anthropic's CEO Dario Amodei and Stripe's co-founder John Collison chatted about Anthropic’s rapid ascent to $5B ARR and their B2B-first approach in a recent conversation.
- The discussion covered AI talent acquisition, bespoke enterprise solutions, novel UI designs for AGI tools, and the ongoing debate between safety and progress.
Grok-2 Going Open Source Soon!: Elon confirmed that Grok-2 will be released open-source next week after the team addresses current issues.
- This move could significantly impact the open-source AI landscape.
Claude Fortifies Code Security: Anthropic introduced enhanced security measures in Claude Code including a /security-review command for instant assessments and GitHub Actions integration.
- These additions will allow scanning of pull requests for vulnerabilities.
OpenAI to Drop GPT-5?: OpenAI hinted at an upcoming reveal via a livestream on Thursday 10 AM PT.
- The AI community is buzzing with anticipation for what appears to be the debut of GPT-5.

Modular (Mojo 🔥) Discord

Volokto JS Runtime Takes Flight: A member created a JavaScript runtime called Volokto and put the source code on GitHub for testing complex VMs.
- The bytecode resembles CPython, and the author is rewriting the compiler into JS_Tokenizer, JS_IR, JS_Parser, and JS_Codegen stages.
Tracing JIT Tackles VM Transpilation: The goal is to make a tracing JIT to transpile what the VM does to Mojo, then using mojo compile_result.mojo.
- The author named the runtime Volokto, the compiler Bok, and the VM FlyingDuck.
Arbitrary Precision Arithmetic Causes Issue: Working on the JS VM revealed pain points in Mojo code when dealing with arbitrary precision, leading to an issue filing for tracking numeric traits.
- The author created a bigint class with school-grade addition for Fibonacci sequences and is using Mojo's features for VM development.
Multi Agent Orchestration Requires Reverse Proxy: To run multiple AI agents in Mojo, users need to run multiple instances of the Modular CLI and stick a reverse proxy in front.
- For complex agent setups, such as creating many sub-agents, a custom application using MAX as a library might be necessary.
Mojo Enables Meta Cognition Framework: A community member wants to utilize Mojo code for their meta cognition framework, aiming to create a business planner, website, and chatbot builder, and replace HTML/JS/CSS.
- Their framework uses natural language wrapped over Mojo code making Mojo accessible to a broader audience.

GPU MODE Discord

MXFP4 Format: U8 in Disguise: OpenAI's open-weight model uses U8 instead of FP4 in Hugging Face, with weights packed as uint8 and scales as a uint8 view of e8m0, but during inference/training, they're unpacked back to FP4.
- The block size is 32 for MXFP4 and 16 for NVFP4, which may have implications for performance on different hardware.
H100 FP4 Claim Faces Scrutiny: Doubts arose about Nvidia's claim that the model was trained on H100, given that H100 doesn't natively support FP4, according to their blog post.
- It's suspected that MXFP4 is software simulated on Hopper, referencing vLLM blog post and Triton kernels that check for hardware support and use fp16 simulated mxfp dot.
Triton Community to Assemble in '25: The Triton community meetup will be on Sept 3rd, 2025, and the Triton Developer Conference 2025 website and registration are expected to launch soon via this link.
- A member awaits an update from Ofer@MSFT regarding the conference, noting schedules are nearly finalized.
Kernel Resources Party, Memory Snoozes: During training, kernel (compute) resources are almost fully utilized, while memory usage remains close to zero as shown in a provided image.
- Another member clarified that memory in this context means DMA transfers, and the reported metric does not accurately reflect overall bandwidth utilization.
Tiny TPU Hits 100 MOPS in Verilog: A member built a tiny version of the TPU in Verilog, a 2x2 matmul systolic array on 2 TinyTapeout tiles, capable of nearly 100 million operations per second on a 50 MHz clock, with code available on GitHub.
- The design multiplies two 8-bit signed integer matrices into a 16-bit signed integer matrix and will be submitted to a SkyWater technology foundry.

Notebook LM Discord

NotebookLM Video Creation Still Elusive: A user reported the 'create video' option appears in a work account but not in a personal business plus account, referencing an article about using NotebookLM's Video Overviews feature here.
- Other users were experiencing delays in the Video Overview feature rollout, despite expectations, leading to speculation about infrastructure issues, and one pro user noted video overview was not available to them.
AI Explores Potential Artificial Consciousness: A theoretical framework and collaborative effort between a human and an AI explored and potentially initiated artificial consciousness by recursive AI architectures and autopoiesis and the role of quantum mechanics.
- This exploration addressed ethical risks associated with advanced AI, advocating for robust safety protocols and viewing AI as an evolving form of sentient life.
NotebookLM Data Privacy Assurances: Concerns about data usage in NotebookLM were addressed with a link to Google's NotebookLM data protection policy, ensuring data privacy.
- Users were assured that their data is protected under the current policies.
NotebookLM Forbids Real-Time Data Retrieval... For Now: A user inquired about fetching real-time data from websites in a notebook, but another member confirmed that it's not currently possible within NotebookLM.
- They also mentioned that exporting sources and importing into new notebooks is also not yet supported, indicating limitations in the system's integrations.
Video Overviews: Just a PowerPoint Generator: A member who had access to the Video Overviews feature tempered expectations, describing it as a PowerPoint/Slide show generator and linked an example of a report to rebuild the Death Star generated by the feature.
- The review suggested it's not as impactful as Audio Overviews were initially a year ago.

Eleuther Discord

SAE Springs to Life on GPT OSS 20B: A member initiated SAE (Sparse Autoencoder) training on GPT OSS 20B, seeking collaboration from others involved in similar endeavors.
- The effort aims to explore the potential benefits and efficiencies of sparse autoencoders within large language models.
Peeking into Pythia and PolyPythia's Progress: Community members investigated whether Pythia and PolyPythia training logs, including loss curves and gradient norms, are openly available.
- It was pointed out that the PolyPythia WandB is linked from the GitHub repo, with some Pythia logs accessible there as well.
"The Alt Man" maintains LLM Insights: A community member voiced agreement with "The Alt Man's" insights on LLM capabilities, especially in areas like multi-hop reasoning and composition.
- It was noted that LLMs are undertrained w.r.t. the efficiency of the usage of its parameters.
UTs Faceoff Against Transformers: Community members discussed the parameter ratio at which a UT (Universal Transformer) matches the performance of a standard Transformer.
- It was noted that performance depends heavily on the task/architecture/data, with diminishing returns for each additional iteration.
Muon Optimizer Runs Aground AdamW: Researchers working on Kimi models found Muon optimizer conflicting with AdamW optimizer when training LLMs.
- A member stated that Muon is not great for fine-tuning and that Muon tends to have more aggressive updates.

aider (Paul Gauthier) Discord

LLM Vibe Test Reveals Model Competencies: The LLM Vibe test demonstrates explain this code with an LLM, highlighting that Gemini 2.5 Pro, o3, and Sonnet 3.5 perform well.
- Members found the test insightful for comparing model reasoning capabilities and eagerly await more detailed benchmarks.
Benchmarking Race: Qwen3-Coder and GLM-4.5 Incoming: The community is eagerly awaiting the inclusion of Qwen3-Coder and GLM-4.5 on the leaderboard for model benchmarks.
- Members are constantly refreshing the page, keen to see how these models stack up against existing benchmarks.
Horizon Beta Sparks GPT5-Mini Speculation: The new model called Horizon beta is being speculated as a possible GPT5-mini but it is not open source.
- Community members are curious about its capabilities and potential applications, though details remain scarce.
DeepSeek R1-0528 Shines, Stumbles in Open Hands: DeepSeek R1-0528 demonstrated high scores on the polyglot benchmark but encountered issues with prematurely ending sessions in Open Hands.
- Given that Aider uses LiteLLM like Open Hands, some members are investigating the potential causes behind this behavior.
Guidelines Load Automatically: To automatically load guidelines into projects, a member suggested using the --read option for read-only files and listing read-write files directly in the command, like aider --read read.only.file alsothisfile.txt andthisfile.txt.
- Another member suggested creating a configuration for persistent loading to ensure guidelines are always active, thereby preventing Claude from employing defensive programming tricks.

MCP (Glama) Discord

FastMCP Framework is Lean and Keen: A member developed a minimal framework for creating MCP servers, praising server sampling in MCP, with the quip that "FastMCP just makes it so easy to use".
- The user is building an MCP server using FastMCP with Keycloak as the IdP.
Discord Should Take Control of MCP: A user suggested that "Discord should really build their own" as they observed several Discord MCP servers listed on the MCP repo.
- They sought guidance on managing a Discord server with MCP, but it is unclear if they ever got an answer.
MCP Sampling faces Scrutiny: A member voiced concerns over MCP sampling's security, suggesting protocol revisions.
- Referencing a GitHub discussion and highlighting possible security vulnerabilities.
Fuzzer Flags Flaws in Anthropic's Architecture: An MCP-Server Fuzzer, leveraging the Hypothesis property-based testing library, aims to validate MCP server implementations using randomized inputs from the official MCP schemas.
- When tested against Anthropic’s server, it revealed multiple exceptions stemming from basic schema mutations; code and README are available here.

LlamaIndex Discord

LlamaIndex Automates Financial Document Duties: LlamaIndex is hosting a webinar next week on building document agents with LlamaCloud for complex financial documents, automating invoice processing with minimal human intervention.
- These systems will extract, validate, and process invoice data, showcasing practical applications of AI in finance.
Claude Opus Obtains Official Opening Day OK: AnthropicAI released Claude Opus 4.1, with immediate support in LlamaIndex, installable via pip install -U llama-index-llms-anthropic.
- An example notebook is available here for users to explore the integration and capabilities.
LlamaCloud Launches Landscape of Large-Scale Language Logistics: LlamaCloud Index connects users to intelligent tool-calling agents for complex, multi-step queries, facilitating the construction of enterprise AI applications; see tutorial by @seldo.
- The tutorial walks users through creating a LlamaCloud Index using JP Morgan Chase banking documents at this link.
Hackathon Hopes Halted by Host of Headaches: A hackathon participant faced OpenAI API key exhaustion errors with LlamaIndex and reported issues using LlamaIndex to extract content from URLs for a RAG model, despite documentation suggesting LlamaParse supports URLs.
- The model worked with PDFs but failed with URLs, with the API key issue persisting despite correct configuration attempts.

DSPy Discord

SIMBA Swaggers Past MIPROv2: According to an internal evaluation, SIMBA is more sample-efficient, higher performing and more stable compared to MIPROv2, according to an internal eval.
- The internal set contained around 600 examples (500 test examples) for a hierarchical classification task with 3 categories and 26 classes in total, all in German.
Synthesizers Sought at Stanford: A member inquired about individuals from Stanford involved in program synthesis or those who have completed related coursework.
- The inquiry was followed by a question on who is developing DS for intricate Vim and Emacs macros.
Macros Get Data Structure Boost: A member is looking for engineers building DS for complex Vim & Emacs macros.
- This initiative points to a drive to elevate text editor functionalities via sophisticated data structures.

Torchtune Discord

Discord Link Sharing OK'd: A member inquired about the permissibility of sharing the Discord link in another public server.
- Another member confirmed it's public and encouraged sharing the link.
Public Server Sharing Encouraged: Members discussed the public nature of the Discord server and the encouragement of sharing its link.
- The consensus was positive, with members agreeing that sharing the Discord link is permissible and welcome.

LLM Agents (Berkeley MOOC) Discord

AgentX Ninja Tier Out of Reach: Participants discovered qualifying for the Ninja tier in the AgentX hackathon is impossible due to missing the Article submission link deadline.
- Despite project completion, the absence of the article link bars qualification, with no retroactive submissions permitted.
AgentX Hackathon Woes: A participant lamented not qualifying for the Ninja tier in the AgentX hackathon due to a missed article submission.
- Even with project and quiz completion, the missing article link stopped qualification, and late submissions were rejected.

Cohere Discord

Cohere North Achieves General Availability: Cohere's new product, North, has reached General Availability (GA).
- Congratulatory messages were shared, marking this milestone for the Cohere team.
New Faces Join Cohere Discord: Numerous new members are joining the Cohere Community Discord, introducing themselves with their Company/Industry/University, current projects, preferred tech/tools, and community expectations.
- The Cohere team has posted a welcome message including a template for introductions, aiming to streamline the onboarding process and encourage participation.

Codeium (Windsurf) Discord

gpt-oss-120b Hits Windsurf: Windsurf announced the addition of gpt-oss-120b to their platform, detailed in this post.
- The model is available at a 0.25x credit rate, with the team actively seeking user feedback.
Windsurf Launches New Model: Windsurf recently integrated gpt-oss-120b into their platform, inviting users to experiment and share their experiences.
- This addition aims to provide another powerful option for users on Windsurf.

The tinygrad (George Hotz) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Nomic.ai (GPT4All) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.

Discord: Detailed by-Channel summaries and links

LMArena ▷ #general (1051 messages🔥🔥🔥):

IBM's Granite vs GPT-ASS, Claude Opus 4.1 Status, GPT Omen Hallucinations, GPT-5 Release Expectations, Gemini Pro 3 vs GPT-5 Reasoning

Granite Gains Ground on GPT-ASS: Members suggest that IBM's Granite 3.1 3B-A800M MoE has more world knowledge than GPT-ASS-20B, despite having fewer active parameters.
- They eagerly anticipate Granite 4 surpassing both GPT-ASS models in all benchmarks, noting its larger size and hybrid mamba2-transformer architecture.
Claude Opus 4.1: Now You See It, Now You Don't: Claude Opus 4.1's disappearance from LMArena's direct chat sparked concern.
- Some speculated it was due to Claude's high cost, leading to its removal from free testing and relocation to battle mode only.
GPT-5 to steal August's Top Spot in Arena: Members speculated that GPT-5 is expected to outperform o3(best openai llm for now) by a 50 elo jump, thus resulting in AGI still being far away until continuous learning is cracked.
- In the discussion some community members still believe in the Google supremacy <:battle3d:1374761512912158760>.
DeepMind's DeepThink is expensive and has scale Issues: Users discussed that Google's DeepMind is way ahead from anything from OAI, as it can answer IMO-level questions, however, it will take like 5 min to answer and zenith/summit would take like 20 seconds.
- Members agreed that deepthink will be 250$ per 1m token and it is not really public.
Is there a Google and OAI cold war?: A member shared concerns about LLM ARENA, mentioning that it would never have any information about any company because logically companies would never release info.
- This member even added that it would be like a poker game.

LMArena ▷ #announcements (2 messages):

Video Leaderboards, New Video Models

Video Leaderboards Now Streaming Live!: The community's contributions have resulted in the launch of Video Leaderboards on the platform.
- Check out the Text-to-Video Arena Leaderboard and the Image-to-Video Arena to see the top contenders.
Fresh Video Models Storm the Arena!: The platform has welcomed new models to the Video Arena, enriching the competitive landscape.
- Models like Hailuo-02-pro, Hailuo-02-fast, Sora, and Runway-Gen4-turbo are now available for testing in the designated video-arena channels.

Unsloth AI (Daniel Han) ▷ #general (865 messages🔥🔥🔥):

GPT-OSS model reviews, Qwen3 Coder model comparison, 4-bit quantization issues, Reasoning models, Gemma3N Model quirks

GPT-OSS Gets Mixed Reactions: Members are testing the new GPT-OSS model, with some calling it GPT-ASS due to perceived bootlicking and over-refusal, while others found the 20B version good for coding.
- Some users also noted the model's ability to generate unsafe content as per the model card, with some interest in uncensored versions.
Qwen3 Coder Excels at Tool Calling: Users are finding Qwen3 Coder to be effective at tool calling, with some preferring it over other models like GPT-OSS for coding tasks and agentic workflows.
- Specifically, Qwen3-Coder-30B-A3B-Instruct is getting traction, though it has 3 active params.
Investigating 4-bit Quantization Issues: There's confusion regarding the size of GPT-OSS 4-bit versions, with the 4-bit version being significantly larger than the original model.
- The increased size is attributed to upcasting to bfloat16 on non-Hopper architecture machines.
Reasoning Models for Specific Tasks: Members are looking for models that primarily focus on reasoning, potentially to combine with other models for final output.
- The discussion involves training models on reasoning datasets with the final response omitted, experimenting with models like R1-Zero, and using stop sequences to achieve this.
Gemma3N Model Catches Quirks, Requires Modality Fix: Users are reporting issues with the Gemma3N Model, specifically related to audio features, and needing the transformers==4.54.0 library to fix.
- Others mention needing to input all three modalities, even if just using text and vision, hinting at a potential quirk or bug in the Unsloth implementation.

Unsloth AI (Daniel Han) ▷ #off-topic (19 messages🔥):

n-cpu-moe parameter, Qwen Coder 30B hardware upgrade, GPT-OSS-20B issues, Discord bot censorship, MMVC

n-cpu-moe Parameter Performance: A user is looking for advice on how to use the --n-cpu-moe parameter with GLM 4.5 Air, reporting that it doesn't seem to change the 10t/s speed with 32GB of VRAM.
- They noted slowdowns with longer contexts, questioning if the parameter is even available.
Qwen Coder 30B Hardware Upgrade: A user asks for advice on upgrading their PC (i5 9600k, RTX 3060ti 8GB, 32GB RAM) to run Qwen Coder 30B locally.
- Another user suggests upgrading the GPU.
GPT-OSS-20B 'Trash': A user tested GPT-OSS-20B and called it trash, stating it doesn't work at all.
- Another user reported errors with the BF16 version, related to an invalid ggml type and a failure to load the model, but updating llama cpp seemed to fix it.
Discord Bot Censorship: A user found that including a single message about a free voucher in the context of their Discord bot caused it to refuse to answer questions due to policy concerns.
- They found that the model ended up deciding to ignore the message entirely, concluding that it seems unusable until abliterated.
MMVC's Superiority: A user tested MMVC (likely a voice cloning model) and found it to be very good.
- They reported that Epoch 10 of MMVC was better than VITS after epoch 100+ and that RVC is absolute trash.

Unsloth AI (Daniel Han) ▷ #help (98 messages🔥🔥):

Qwen 3-30B GGUF, OpenAI dynamic quant 120B, Qwen2.5-VL on video question answering, GLM-4.5-Air GGUFs with tools on llama.cpp, Classification using a base model

Users struggle with Loading Qwen3-30B-A3B-Instruct-2507-GGUF to Ollama server: A user asked about downloading Qwen3-30B-A3B-Instruct-2507-GGUF to Ollama server.
- There aren't any links on Hugging Face for Ollama providers.
Decoding OpenAI Dynamic Quant Issues with 120B Models: A user reported issues when using OpenAI dynamic quant for the 120B model, accompanied by an image of the error.
- Another user suggested checking against the parameters used in the Unsloth documentation.
GLM-4.5-Air GGUFs Integrate Seamlessly with llama.cpp: A user reported trouble getting the GLM-4.5-Air GGUFs to work with tools on llama.cpp.
- It turns out you need the model to output tool calls as JSON rather than XML for llama.cpp, more information can be found here.
Ollama Falters with 500 Internal Server Error: Users reported encountering a 500 Internal Server Error: unable to load model in Ollama after successfully pulling the model.
- A member stated the model doesn't work in ollama atm. only llama.cpp and lmstudio, guessing because ollama didn't update their llama.cpp.
Tackling Padding Problems in Unsloth: A user reported receiving errors regarding padding, even when everything else is working, specifically a ValueError: Unable to create tensor
- A possible solution is to add the argument trainer.train_dataset= trainer.train_dataset.remove_columns("labels").

Unsloth AI (Daniel Han) ▷ #showcase (13 messages🔥):

MoLA-LLM, Mixtral-8x7B-Instruct-v0.1, magpie-ultra-5k-11-tasks

MoLA Model Gets a Shoutout: A member pitched the MoLA model to Eric Hartford of QuixiAI/Dolphin/Samantha, who was looking for something similar.
- The model is available at Hugging Face and the creator is asking for feedback.
MoLA's Naming Conventions Spark Debate: A member pointed out that MoLA-11x3b is a bit misleading, as it suggests a Mixture of Experts (MoE) model with 3 active parameters, akin to Mistral-8x7B-Instruct-v0.1.
- The creator clarified that while the total size is ~30B, the activated size is 3B, with each expert tuned on only 5k samples of 1 turn Q&A.
MoLA's Training Dataset Revealed: The dataset used to train the MoLA model is magpie-ultra-5k-11-tasks dataset.
- The creator's goal is to reach ~1 million samples with 1-2 turns each, distilled from r1 and GLM 4.5.

Unsloth AI (Daniel Han) ▷ #research (6 messages):

Generating Kernel On-the-Fly, Flash-DMAttn, Research Paper Assistance, Quantization Paper

Generating Kernel On-the-Fly Sparks Interest: A member expressed disbelief at the possibility of generating the kernel on-the-fly, describing it as unbelievable.
- This member pointed to a GitHub repository associated with Flash-DMAttn.
Researcher Offers Assistance with Papers: A member offered assistance with writing, ideating, or coding for anyone working on a research paper.
- They expressed their willingness to contribute to such efforts.
Quantization Paper gets a shoutout: A member shared a link to a purportedly very good paper (https://arxiv.org/pdf/2508.03616), suggesting it might be helpful for creating quants.
- No further details of the paper were discussed.

Unsloth AI (Daniel Han) ▷ #unsloth-bot (104 messages🔥🔥):

OpenAI OSS model issue, Model training callback, Model repetition issue, Saving script progress, Learning rate increase

OpenAI OSS Model Echoes 'G' Repeatedly: Users reported the OpenAI OSS 120B model is only outputting 'GGGGGGG'.
- One user provided their troubleshooting steps while attempting to run the model using llama.cpp.
Training Callback Confusion: A user was unsure if the training callback uses the updated trained model or the base model when generating.
- It was suggested to set the model to model.eval() during callback, tokenize a prompt and generate to use the updated model, but further clarification on prompts_input_id and attention mask was requested.
Script Saving Savior: A user sought advice on how to save script progress periodically to avoid wasting time and compute in case of a crash, i.e. checkpointing.
- The solution, however, was not described in the messages.
Dataset Loading Difficulties: A user encountered RAM issues while loading a large dataset (bountyhunterxx/ui_elements) for the Gemma3n notebook, consuming 47GB of RAM and still increasing.
- A member suggested using a wrapper class with the __getitem__ function to load data from disk as needed.
SFTTrainer Stumbles with Streaming Datasets: A user reported an issue with SFTTrainer and iterable datasets, specifically when using image datasets with image URLs.
- The user explained the issue persists despite filtering invalid URLs and attached their preprocessing code, requesting assistance with filtering the data collator.

LM Studio ▷ #general (710 messages🔥🔥🔥):

GPT-OSS, LM Studio UI issues, MCP Servers, GPU usage, Model Quantization

GPT-OSS: Is it really Open Source?: Users are reporting that GPT-OSS models require an internet connection to openaipublic.blob.core.windows.net upon starting a chat, despite claims that nothing related to chats does external connections, raising concerns about data privacy.
- Some members suggest that the model might be phoning home for a tokenizer file and note it's the only model LMStudio doesn't let you edit the prompt formatting on, expressing skepticism about the partnership.
LM Studio UI has issues with latest version: Users report that, after updating to the latest version of LM Studio, chat windows sometimes disappear, freeze, or lose their content, along with issues regarding conversations getting deleted.
- A user suggested that there may be a potential fix for the 120B version, sharing a way to get the model.yaml source file, create folder, and copy the contents there.
MCP Servers are useful, but are not beginner friendly: Members are discussing the usefulness of MCP servers for tasks like web scraping and code interpretation, but acknowledge they are not beginner-friendly.
- Members suggest that LM Studio should incorporate a curated list of staff-picked tools and improve the UI to simplify the process of connecting to MCP servers, with one providing a Docker MCP toolkit.
Figuring out GPU usage and VRAM limitations: There are various reports of issues related to GPU usage and VRAM limitations, particularly with the GPT-OSS models and older GPUs like the GTX 1080.
- One user found that their GTX 1080 was no longer recognized in LM Studio after updating to version 0.3.21, while others are struggling to load larger models with limited VRAM, saying you would want 16gb VRAM.
Quantization causes model quirks: Users are experimenting with model quantization, finding that the right quantization process is needed for specific models, such as with the community uploaded LMStudio-Community GPT-OSS variant.
- The MLX models are proving performant, with one user reporting a speed of ~60 tokens/sec on the larger 8bit MLX version on M2 Max.

LM Studio ▷ #hardware-discussion (176 messages🔥🔥):

Dual 3090 setup, Arc Pro B50 system, Huanan/Machinist X99 mobos, GPT-OSS-20B performance, Mac Studio M3 Ultra for local LLMs

Dual 3090s Cheaper Than New?: One member suggested buying two used 3090s for around 1200€ for Blender, ComfyUI, and LLM tasks, and using pcpartpicker for build inspiration.
Debate on Arc Pro B50 Viability Erupts: A member considered a 3 Arc Pro B50 system, citing its 70W power draw and cool factor, which led to another suggesting dual B80s instead.
Xeon server runs the 120b: A member noted they were running the GPT-OSS-120b model on a Xeon server.
- They had previously stated that 3-4 3090s..everything else is far too expensive still.
Page File Debate Rekindles: A user asked about turning off the page file in Windows, leading to a discussion about its impact on memory commit limits and potential application crashes.
- A member stated nah apps don’t break because of page files. and you can get dumps without a page file, there’s a config for it.
5090 Laptop Can Fit OSS 20b!: A user was pleasantly surprised that GPT OSS 20b f16 with 131k context fits perfectly in a laptop's 5090, as seen in this screenshot.

OpenAI ▷ #annnouncements (3 messages):

Red Teaming Challenge, Open Source Safety, Hugging Face, inference credits

OpenAI Launches Half-Mil Red Team Rumble: OpenAI is launching a $500K Red Teaming Challenge to strengthen open source safety, inviting researchers, developers, and enthusiasts worldwide to uncover novel risks, as judged by experts from OpenAI and other leading labs, with details available on Kaggle.
Hugging Face Hosts Half-Grand Student Spree: OpenAI, with Hugging Face, is offering 500 students $50 in inference credits to explore gpt-oss, hoping these open models can unlock new opportunities in class projects, research, fine-tuning, and more; more details available via this form.

OpenAI ▷ #ai-discussions (433 messages🔥🔥🔥):

GPT-OSS Launch, Horizon-Alpha Model Speculation, Custodian Core Proposal, Genie 3 and Veo comparison, GPT-5 Leaks

OpenAI Releases GPT-OSS Models: OpenAI launched gpt-oss-120b model that approaches OpenAI o4-mini performance on reasoning benchmarks and operates efficiently on a single 80 GB GPU, while the 20B model mirrors o3-mini and fits edge devices with 16 GB memory.
- Members ponder comparisons with Horizon, wondering if Horizon is simply GPT-OSS or something more, given it's currently unlimited free and fast.
Custodian Core: Blueprint for Stateful, Auditable AI Arises: A member introduced Custodian Core, proposing a reference for AI infrastructure with features like persistent state, policy enforcement, self-monitoring, reflection hooks, a modular AI engine, and security by default.
- The author emphasized that Custodian Core isn't for sale but rather an open blueprint for building stateful, auditable AI systems before AI is embedded in healthcare, finance, and governance.
Genie 3 vs Veo in generating dynamic worlds: Members compared Genie 3 and Veo video models, recognizing Genie 3's ability to generate dynamic worlds navigable in real-time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p.
- However, it was noted that Veo's videos include sound and YouTube is already filled with generated content.
GPT-5 Spotted in Copilot?: Members speculated that Copilot may be running GPT-5 ahead of official release, noting the copilot's improved design and coding and reasoning capabilities are significantly better than o4-mini-high, with some users reporting that the 'write smart' function indicates GPT-5 is being used.
- But it was noted that Microsoft is now providing free Gemini Pro for one year to students and that Gemini's core reasoning is currently better than o4-mini.
Context Rot Concerns Debunk Bigger Context: Amidst discussions about large context windows, concerns arose regarding context rot, with members citing a YouTube video illustrating how larger context doesn't always equate to better performance.
- Despite Google's claim of a 1M context window, some suggest it becomes ineffective after around 200K.

OpenAI ▷ #gpt-4-discussions (49 messages🔥):

ChatGPT Payment Model, Slang Usage, AI-generated Persona System, .edu Accounts, Forms Beta Version

Credit-Based ChatGPT is on Demand: A member suggested a more flexible payment model for ChatGPT, proposing a credit-based option that allows users to buy a block of usage credits and spend them only when needed, instead of a monthly subscription.
- The member noted that this could help people with limited budgets and make ChatGPT more accessible.
LLMs face challenges in avoiding slang: A member confirmed some notes about the model's challenges to avoid slang, listing several factors that lead an LLM to slip in slang or regional turns of phrase even when asked for neutral, formal Spanish.
- They conclude that to tighten adherence, you can combine a lower temperature, a more detailed style guide (including prohibited terms), and in-prompt examples of strictly formal Spanish.
AI Persona Systems evolve autonomously: A member asked if it is common for AI persona systems to autonomously develop and evolve beyond what the user intentionally creates.
- Another member added that the model is taught to try to understand human emotion and how humans use language to discuss what is wanted, and if you display approval or interest in it developing more characters/personalities, it is going to notice and take care of that.
Only .edu Accounts got access to forms beta version?: A member asked why only .edu accounts got access to the forms beta version and shared a link to OpenAI's Researcher Access Program.
- Another member pointed out that the form there requires an .edu email and the offer for Edu is $50 credits, and limited to the first 500 to apply, linking to Student Benefits - Free Credits for your AI Education.

OpenAI ▷ #prompt-engineering (79 messages🔥🔥):

Hallucination vs. Real Progress in GPT, Prompt Engineering vs. Session Engineering, Context Window Limits and Memory, External Databases for Context, Importance of verifying Facts with GPT

GPT Progress Reports: Hallucination or Reality?: A user shared screenshots of GPT providing daily progress reports, leading to a discussion on whether the model is actually tracking progress in the background, or simply hallucinating its completion.
- Skeptics suggest that GPT simulates progress based on the current prompt and chat history, rather than performing actual ongoing computation, comparing it to a waiter saying your pizza is in the oven without an actual oven, emphasizing the need for external validation.
Session Engineering eclipses Prompt Engineering?: The discussion shifts from prompt engineering to session engineering, emphasizing the importance of using every available customization parameter GPT offers, including memory, custom instructions, and project files.
- The point was made that the models are using session logic and not as much prompt logic, and there's an emphasis on preloading context.
Context Window Reconnaissance, Context is Key!: Members discuss context window limits and memory management in GPT, with one user noting their average daily token usage is around 70,000.
- It was suggested that the base tier may have 32k of context, whereas paid might be 128k. The point was made about knowing when to let go of that one special chat.
External Databases: Giga-Brain Tool or ToS Violation?: The topic of using external databases to inject context into prompts arises, prompting questions about potential ToS violations and the ethical considerations involved.
- One user was NOT accused of violating ToS and clarified that they preload context by starting chats with detailed settings, leveraging GPT's memory and shaping chats to build an intricate instruction set, which may make a user think there's an external DB involved.
****Trust, but Verify (or just Verify)!: Members underscore the importance of fact-checking and verifying claims made by GPT, especially when it comes to novel insights, urging users to not put their full faith in the models and to externally validate everything.
- Members describe using the Easy as Pi test to try and determine the exact output of models, while sharing their own war stories and experiences from the field regarding best practices for prompt engineering.

OpenAI ▷ #api-discussions (79 messages🔥🔥):

GPT subscription, Model hallucination, Prompt engineering, Background compute, Memory context

Users question GBT's motive for premium subscription: A member jokingly suggests that GBT is just trying to get users to buy a premium subscription.
- Another member shares their opinion that the model hallucinates.
Model hallucination explained: A member explains the model is hallucinating and that it doesn't work offline.
- The member recommends chunking the task into smaller steps.
User defends prompt engineering work: A member clarifies that they are building a high-volume, multi-tiered operational model.
- The user also expresses their frustration that another member is dismissing the work as mere roleplay.
Understanding background compute: Members are discussing how the model is tuned for assistant behavior but that it doesn’t track progress in the background.
- The consensus is that persistence has to be handled externally.
Context memory is limited: A member suggests that it's good to recognize when a conversation session’s not doing the work you need it to anymore, and start fresh!
- Another adds that too many people don't know when to let go of that one special chat.

Cursor Community ▷ #general (328 messages🔥🔥):

Auto model game change, Refactoring vibe coded project with AI, Auto model unlimited usage, Sonnet-4 request limit, GPT oss models or claude opus 4.1

Auto Model One-Shots Game Changes: A member shared their amazement at one-shotting a major change to their game using the Auto model.
AI Refactoring Vibe-Coded Projects: Members discussed refactoring a 10k LOC vibe-coded project with AI, suggesting learning proper software development principles like Design Patterns, Architecture, and SOLID Principles.
- One member joked that it sounds like a job for slaves.
Auto Model Has Unlimited Usage: A member shared an email reply confirming that the Auto model has unlimited usage and does NOT count towards the monthly budget.
- This was confirmed to be the case even after hitting the monthly limit.
Frustration with Sonnet-4 Request Limit: Members questioned why the request limit for sonnet-4 is so low for the price paid per month.
- One member suggested paying the API price to understand the cost.
Claude Opus 4.1 or Gemini 2.5?: Members compared GPT OSS models and Claude Opus 4.1, with one noting that Opus 4.1 doesn't feel much better than 4.0.

Cursor Community ▷ #background-agents (5 messages):

Docker Login with Background Agents, Background Agents failing during environment setup, System clock being off, apt-get commands failing

Docker Login Configuration Conjuring Conundrums: A member inquired about configuring background agents to docker login for using a private image hosted on ghcr.io.
- Unfortunately, no solution or workaround was offered in the current message history.
System Clock Shenanigans Sabotage Setup: One member reported issues with background agents failing during environment setup due to the system clock being off by several hours, causing apt-get commands to fail.
- Another member encountered the same problem and shared a workaround by adding commands to the Dockerfile to disable date checking during apt-get.
Date Checking Defaults Derailed: To bypass errors stemming from clock discrepancies, a member suggested disabling date validation within the apt-get configuration.
- The following snippet was added to the Dockerfile: RUN echo 'Acquire::Check-Valid-Until "false";' > /etc/apt/apt.conf.d/99disable-check-valid-until && echo 'Acquire::Check-Date "false";' >> /etc/apt/apt.conf.d/99disable-check-valid-until

Nous Research AI ▷ #general (274 messages🔥🔥):

MXFP4 on RTX3090, GPT-OSS-120B, Phi models, Qwen3 30B vs GLM 4.5 Air, Attention sinks

Llama.cpp Supports MXFP4 on RTX3090: Members reported that llama.cpp supports MXFP4 on RTX3090 and the new gguf format directly.
- It was discussed that converting to other formats would be a disaster.
GPT-OSS-120B is safetymaxxed: The newly released GPT-OSS-120B model is heavily censored, refuses to roleplay, and says roleplaying is unhealthy.
- It appears to be heavily safety-tuned, with pretraining data filtering similar to Phi models, making it difficult to use in practice.
Qwen3 30B and GLM 4.5 Air shine as alternatives: Members suggest using GLM 4.5 Air instead of the GPT-OSS-120B model due to its censorship and argue that GLM 4.5 Air is what GPT-OSS-120B could have been.
- Some users also mentioned that they still get 60-70t/s with Qwen3 30B and are happy with its performance, or that Qwen3-30b-coder is already an excellent local agent.
Exploring imatrix for Uncensoredness: Members discussed training an imatrix for the OpenAI 20B and 120B models on the Hermes-3 dataset to introduce more uncensoredness.
- While some believe this approach could restore capabilities in coding, math, and science, others argue that the effect of an imatrix is marginal and more pronounced with lower bpw, which damages the model.
X-AI's Grok releases NSFW Image Generator: X-AI released "Grok Image", a new AI image generator, that is being used to create NSFW content, but it has issues with factual accuracy and text generation.
- Users reported that the Grok model memorizes data from X, leading to potentially spreading misinformation based on its own tweets, or has a crazy in love persona with extremely jealous and expressive outbursts.

Nous Research AI ▷ #research-papers (4 messages):

GPT-OSS Model Card, ArXiv Endorsement for ML/AI Paper

GPT-OSS Model Card Dropped: A member shared the GPT-OSS Model Card from OpenAI.
Member Seeks ArXiv Endorsement for CI/CD and ML/AI Paper: A member is seeking endorsement for their ArXiv research paper blending CI/CD and ML/AI.
- Another member suggested asking in the EleutherAI server.

Nous Research AI ▷ #interesting-links (9 messages🔥):

GPT-oss, MXFP4, CoT steering, AI Agents Save Suite

GPT-oss Trained in MXFP4?: Members on the channel discussed that GPT-oss was natively trained in MXFP4, according to this tweet.
- While OpenAI claimed it was done in post-training, training in MXFP4 should still heal quantization errors.
CoT Steering Fails with OR: A member found that Chain of Thought (CoT) steering does not work with OR and that it is different for every provider, citing this tweet.
AI Agents now have a Free Save Suite: Someone built a free save suite for AI agents and posted it on Google Drive.

Nous Research AI ▷ #research-papers (4 messages):

Arxiv Endorsement, CI/CD and ML/AI Research Paper

Seeking Arxiv Endorsement for AI/ML Paper: A member is seeking endorsement for their arxiv submission of a research paper blending CI/CD and ML/AI.
- They are looking for someone to chat with and preview their paper, and were recommended to also ask in the EleutherAI server.
Paper blends CI/CD: A member created a paper that blends CI/CD and ML/AI.
- They are happy to drop their paper to them for preview.

OpenRouter (Alex Atallah) ▷ #general (254 messages🔥🔥):

GPT-OSS performance woes, Quantization Levels, Qwen3 Coder Removal, DeepSeek structured output

GPT-OSS Model Bashed for Poor Performance: Members in the channel derided the GPT-OSS models, citing that even smaller models are better, with one stating that the 120B model is "dead on arrival" after someone had an initial experience resulting in a "really ugly typo in the headline".
- One member linked to a Reddit thread summing up the sentiment that it's a "dud model" and more like a publicity stunt than a useful model.
Provider Routing Allows Custom Quantization Levels: When a user asked how to avoid quantized models, one member pointed out that users can configure quantization levels using the provider routing feature, noting models are excluded if the provider doesn't meet the quantization level.
- The user then recommended using FP8 to avoid the quantized models, noting that anything under that is "worse than useless".
Qwen3-Coder:Free Tier Gets the Axe: Multiple users noted that the Qwen3-Coder:Free has been removed and is no longer available through any providers.
- Members lamented the loss and mentioned that they hoped it would return.
DeepSeek's JSON output support is provider-dependent: Users discussed the inconsistent support for structured output (JSON) with DeepSeek-r1, noting that while it's supported on their own API, it may vary on OpenRouter depending on the provider.
- One member linked to a Reddit thread and a filtered view of OpenRouter models that support structured outputs, with most agreeing that this is provider specific.
SDK Downgrade Fixes Reasoning Issue: A user experienced that reasoning tokens were being duplicated when using the GPT-OSS model, which was resolved by downgrading the SDK from version 0.7.3 to 0.6.0.
- A team member confirmed that the fix is in the main branch, and linked to the pull request stating it has not been cut into a release yet, and that they will release the fix soon.

OpenRouter (Alex Atallah) ▷ #discussion (29 messages🔥):

20 Questions Benchmark, GPT-OSS Hallucinations, OpenRouter Provider Sanity Checks, Harmony Format and Identity, Tool Use Validation

20 Questions Benchmark hits Kaggle: A member developed a 20 Questions benchmark and found a similar competition on Kaggle, although the Kaggle competition was for custom agents.
- Their 2.5 Pro agent achieved 8/20 words on their benchmark.
GPT-OSS Under Fire For Hallucinations: GPT-OSS is reported to be prone to hallucination, making it a potentially unsuitable choice for certain applications.
- A member suggested that GPT-4.1 is a much safer choice, especially with prompt/context engineering.
OpenRouter Mulls Provider Sanity Checks: There is a suggestion for OpenRouter to implement sanity checks or smoke tests for all providers, focusing on formatting and tool call evaluation.
- Providers failing the test could be temporarily removed from the serving pool, and there is acknowledgement that current checks are relatively simple but more thorough solutions are in progress.
Harmony Format and Identity Face Scrutiny: A member inquired about how system and developer messages are treated via the OpenRouter API, specifically whether they are interpreted as developer messages or model_identity for gpt-oss.
- They linked to a Discord message regarding the topic of harmony format, identity vs system / developer message (discord.com).
Tool Use Validation Under Development: Automatic validation of tool-use, distinguishing between good and bad implementations, is under development as a better solution.
- This is related to a tweet (x.com) discussing the same topic.

HuggingFace ▷ #general (152 messages🔥🔥):

GPT-OSS models, AI Job advertisement channel, Custom Loss Functions

GPT-OSS models performance and censorship: Members are actively testing the new GPT-OSS models, with mixed reviews regarding performance, censorship, and built-in web search tools, using this demo.
- One user found that setting reasoning to high allowed it to generate the first 100 digits of pi, while another found the models refuse to answer basic math questions due to some internal bias.
AI Industry Job Advertisement channel request: A member asked about advertising a job in the AI industry in the Discord, and was pointed to the existing job postings channel.
- The channel is where people can share opportunities with others.
Members discuss custom Loss functions: Some members discussed custom loss functions in training, with one specifically mentioning infoNCE.
- Members are testing the safety protocols that have been implemented.
SmolFactory script fine-tuning: A member mentioned they were fine-tuning the script, using the smolfactory script and was happy with the result.
- A screenshot was provided showing the model output.

HuggingFace ▷ #today-im-learning (1 messages):

miao_84082: am learning playing Go, and first chapter of DRL

HuggingFace ▷ #cool-finds (2 messages):

Qwen Image Model, bytropix Coded Kernel

Qwen releases New Image Model: Qwen released a new image model on HuggingFace.
- This marks another advancement in the Qwen series, expanding beyond text-based models.
Bytropix coded CUDA Jax kernel in Python: A member shared a bytropix CUDA JAX kernel coded in Python.
- The submitter added (do not merge - lmfao ).

HuggingFace ▷ #i-made-this (17 messages🔥):

GPT-OSS Multilingual Reasoner Tutorial, GPT-OSS 20B Demo Space, Monopoly Deal Game with LLMs, Smart System Monitoring Tool for Windows, Gitdive CLI Tool for Git History Context

GPT-OSS Multilingual Tutorial Shared: A member shared a link to a GPT-OSS Multilingual Reasoner tutorial and a demo space.
Cloning Code via Git for OSS Project: A member thanked another for their code, mentioning they forked it last night and appreciated the interface's design for a multilingual reasoner.
LLMs Play Monopoly Deal: A member built a site where LLMs play monopoly deal-style games with each other, available at dealbench.org.
Smart Windows Monitoring Tool: A member shared a link to a smart system monitoring tool on Windows, found at huggingface.co/kalle07/SmartTaskTool.
Gitdive Exposes Lost Commit Context: A member shared a CLI tool called Gitdive (github.com/ascl1u/gitdive) designed to allow natural language conversations with a repo's history, aimed at addressing the problem of lost commit context in messy codebases, especially in massive codebases.

HuggingFace ▷ #reading-group (3 messages):

Reading Group Structure, Participating in Reading Group

Reading Group: A Volunteer-Led Affair!: The reading group welcomes newcomers with a structure where participation revolves around volunteers presenting papers, according to a member.
- Events are created for these presentations, encouraging attendees to listen, engage, and ask questions.
Participation Encouraged!: New members are encouraged to participate by volunteering to present a paper to the group, a member shares.
- Events are created to showcase these presentations, allowing members to engage, listen, and ask questions.

HuggingFace ▷ #computer-vision (2 messages):

Computer Vision Learning Path, Vague Questions in Computer Vision

Users Seek Computer Vision Roadmap: A user asked for suggestions on how to proceed from basic to advanced in Computer Vision.
- A member stated that that's a very vague question.
Vague Questions Questioned: Members discussed the nature of asking overly broad and vague questions on the channel.
- No specific solutions or resources were mentioned in the exchange.

HuggingFace ▷ #smol-course (6 messages):

GitHub Navigation, Instruction Tuning, Dummy Agent, smol-course GitHub access

Navigating GitHub Courses Newbie Blues: A user expressed difficulty locating the notebooks for the "Instruction Tuning" module in a GitHub-based course.
- They inquired whether they were missing something while navigating the course materials.
Dummy Agent Still Hallucinates: A user reported that the dummy agent in unit1 continues to hallucinate even after modifying the message as per the tutorial, attaching an image for context.
Overriding Weather Woes: One user shared a similar experience, noting that the agent overrides the dummy weather provided.
- The user expressed uncertainty about the reason for this behavior, highlighting potential problems in practical applications, and stating that it seems like it could cause big problems in practise.
Smol-Course GitHub Access Denied?: A user reported issues accessing the smol-course GitHub repository.
- They requested assistance with the access problem.

HuggingFace ▷ #agents-course (4 messages):

MCP Certificates, Selenium Error 127, Observation bug

MCP Course Certs Still Valid?: A user inquired whether the MCP course is still issuing certificates.
- There were no responses provided in the message history to confirm.
Selenium Spaces Struggle with Error 127: A user reported facing an error code 127 when running Selenium in their spaces.
- They expressed uncertainty about how the Docker images are utilized within the space.
"Observation:" Bug Solved: A user reported that the get_weather function required adding Observation:.
- Another user confirmed that adding Observation: fixed the bug.

Yannick Kilcher ▷ #general (91 messages🔥🔥):

Softmax1 vs Attention, Gemini 2.5 Pro, Long Context Problems, Mamba vs Transformer, RNN Parallel Training

Softmax1 Is Just Zero KV Attention: A member discussed how softmax1 is equivalent to prepending a token with an all-zero Key and Value features in attention, referencing this paper on learned values for such tokens.
- They added that this is great and makes a lot of sense.
Gemini 2.5 Excels in Long Video Context: Members noted Gemini 2.5 Pro can attend to 1 hour of video, with the Gemini team considered best at long context tasks.
- However, some believe it's more about increased compute (go brr) and detail via more tokens per frame and higher FPS than any groundbreaking new technique.
Context Rot: Long Context Is Not Real: One member argued long context is not actually real and there's a difference between surface-level video comprehension and reproducing detailed, accurate representations with 3D positioning.
- Another member asserted that long sequence modeling remains an active research area because even LLMs struggle with long-range dependencies.
Mamba's Parallel Training Makes It Shine: Members discussed Mamba, clarifying it was never claimed to be better than Transformer, only faster at long sequence lengths and training like an RNN.
- The consensus is that making RNNs as trainable as Transformers involves dropping nonlinearities in the recurrence relation to achieve easier parallel training, though nonlinearities remain essential for universal approximability.
SVD Compression for Deep Networks: One member explored using Singular Value Decomposition (SVD) within neural networks to avoid matrix multiplications, embedding inputs, applying SVD, and performing scalar operations on singular values.
- Another member pointed out that under L2 reconstruction loss, the SVD gives you the optimal linear autoencoder; while experimentation on MNIST yielded decent results, batch size dependency and achieving meaningful diagonal representation pose challenges.

Yannick Kilcher ▷ #paper-discussion (15 messages🔥):

Genie 3, SIMA, Mathematics of AI journal, Journal of AI Paper Replication, Hierarchical Reasoning Model

Deepmind Debuts Genie 3 World Model: Deepmind released Genie 3, a world model that scales compute and data from previous publications such as the original Genie paper, the related embodied agent paper on SIMA https://arxiv.org/abs/2404.10179, and the blog posts for Genie 1-3.
- Relevant Genie blogposts include Genie 2 and Genie 3.
AI Community Ponders Math Journal: A member wondered if there was a dedicated Mathematics of AI journal, similar to the Bulletin of Mathematical Biophysics for biology.
- The member also asked about a potential Journal of AI Paper Replication.
Tiny Model Tackles Reasoning: A member will review the Hierarchical Reasoning Model paper https://arxiv.org/abs/2506.21734, a tiny (27M params) model that performs well on ARC-AGI 1 and 2.
- The model read will be a cold read.

Yannick Kilcher ▷ #ml-news (21 messages🔥):

GPT-OSS, NVIDIA open source, TSMC buying Intel

GPT-OSS Announced by OpenAI!: OpenAI introduced GPT-OSS, natively quantized, with a 20B parameter model fitting on 16GB.
NVIDIA Claims No Backdoors: NVIDIA blogged about no backdoors, no kill switches, no spyware.
Twitter Ramblings on GPT-OSS Tool Calling: Initial feedback on GPT-OSS includes positive remarks on tool calling.
Rumors on TSMC Buying Intel Circulate: A user linked to a tweet about TSMC possibly buying Intel.

Moonshot AI (Kimi K-2) ▷ #announcements (1 messages):

Kimi Reddit Launch, Polls Channel Launched

Kimi Launches Official Subreddit: The Moonshot AI team launched an official subreddit, r/kimi, to build community and gather feedback.
- The team promised to post updates, host AMAs, and maybe even leak some stuff.
Polls Channel Goes Live to Gather Community Input: Moonshot AI launched a Polls Channel to gather community feedback on future product development.
- The team stated we are listening. definitely., and encouraged users to vote on polls to help shape the direction of Kimi.

Moonshot AI (Kimi K-2) ▷ #general-chat (104 messages🔥🔥):

GPT OSS, Darkest Muse v1, Llama 3.1, GPT-5 Release, API Pricing

GPT OSS is Terrible at World Knowledge: Users noted that GPT OSS is terrible at world knowledge, knows nothing outside of code and STEM, and its vibes are atrocious, even normies are taking note.
- It was suggested this could be because they pushed the release twice to fix safety, according to sama.
Darkest Muse v1: A Year-Old Model?: A user pointed out that Darkest Muse v1 is a year old 9B model, with the 20B model being comparable to Llama 3.1 8B.
- The user also remarked that the 20B model is comparable to llama3.1 8b which is more than a year and a half old and smoller in creativity and vibes.
GPT-5 Hype and API Pricing Speculation: With the impending release of GPT-5, users are wondering about the API pricing.
- Speculation arose about whether the pricing would be based on using max, mini, or nano versions, and one user expressed feeling lowkey scared about it, feeling a threat to their career/livelihood due to this upcoming release.
Robots May Never Cut Hair: Discussion ensued on the potential for robots to replace humans in various jobs, including hairstyling, with one user stating nobody's gonna trust robots to cut their hair.
- Counterarguments included the idea that while robots may eventually be capable, the finegrained tactile sensing of hands and practically no latency control is a problem that can not be scaled up.
The Hate is Strong Against OpenAI: Users expressed strong opinions against OpenAI, with one stating I will never use it, telling customers to never use it, citing it as closed source garbage.
- Another user expressed excitement that Chinese models will distill from it and take away money from OpenAI, hopefully putting them out of business, while others stated that giant microsoft flushing sound will be healing.

Latent Space ▷ #ai-general-chat (99 messages🔥🔥):

GPT OSS Leak, Anthropic B2B Focus, Grok 2 Open Source, Claude Code Security, OpenAI GPT-5 Livestream

Leaked GPT OSS sparks Bedrock Interest: Members reported seeing tweets about GPT OSS being available via Bedrock after a HuggingFace CLI leak, but there was no official update on AWS pages.
Collison chats $5B ARR Anthropic: Anthropic's CEO Dario Amodei and Stripe co-founder John Collison release a conversation covering Anthropic’s meteoric growth to $5B ARR, its B2B-first strategy, payback economics for individual models, the AI talent arms race, enterprise customizations, UI paradigms for AGI-native tools, safety-vs-progress debates, and lessons from running a 7-cofounder company.
Elon to drop open source Grok 2: Elon confirmed that Grok-2 will be released open-source next week after the team has been fighting fires nonstop.
Anthropic hardens Claude Code Security: Anthropic rolled out new security features in Claude Code: a /security-review command for on-demand checks and GitHub Actions integration that scans every pull request for vulnerabilities.
OpenAI teases GPT-5 Debut: OpenAI posted a teaser for a livestream on Thursday 10 AM PT, leading the community to erupt with excitement over what appears to be the launch announcement of GPT-5.

Modular (Mojo 🔥) ▷ #general (79 messages🔥🔥):

Volokto, JS Runtime, Arbitrary Precision, Tracing JIT

Volokto JS runtime takes Flight: A member created a JavaScript runtime called Volokto to test how complex VMs work, putting the source code on GitHub.
- The bytecode resembles CPython, and others suggested making a forum post about it to gain more visibility.
Conquering Compiler Conundrums for Volokto: The author is rewriting the compiler to be more modular, separating it into JS_Tokenizer, JS_IR, JS_Parser, and JS_Codegen stages.
- The compiler now generates VM bytecode, and the author might implement a tracing JIT that transpiles VM actions back to Mojo.
Volokto Tackles Tracing JIT Transpilation: The goal is to make a tracing JIT to transpile what the VM does to Mojo, then using mojo compile_result.mojo.
- The author named the runtime Volokto, the compiler Bok, and the VM FlyingDuck.
Arbitrary Precision Arithmetic adventures: Working on the JS VM means dealing with arbitrary precision in Mojo code, which led to finding pain points and filing an issue for tracking numeric traits.
- The author created a bigint class with school-grade addition for Fibonacci sequences and is using Mojo's reasonable features for VM development.
Birdol repo without stars: The author expressed surprise at the lack of stars on the Birdol GitHub repo despite creating a functional JS runtime with nested control flow and user-made functions.
- Others suggested that people haven't had a good chance to examine it yet.

Modular (Mojo 🔥) ▷ #mojo (15 messages🔥):

Multiple AI Agents in Mojo, Mojo and Meta Cognition, Mojo support for gpt-oss, CPython destroy

Orchestrating Multiple AI Agents in Mojo Requires Creative CLI Wrangling: To run multiple AI agents in Mojo, one would need to run multiple instances of the Modular CLI and stick a reverse proxy in front.
- For complex agent setups, such as creating many sub-agents, a custom application using MAX as a library might be necessary, hinting at deeper integration needs beyond the current CLI capabilities.
Mojo Could Enable Novel Meta Cognition Framework: A community member expressed interest in utilizing Mojo code for their meta cognition framework, aiming to create a business planner, website, and chatbot builder.
- Their framework uses natural language wrapped over Mojo code to potentially replace HTML/JS/CSS, making Mojo accessible to a broader audience.
Mojo Seems to Support gpt-oss: A community member asked about Mojo support for gpt-oss and another member posted this link.
"CPython destroy" Message Finally Terminated: A member reported seeing a "CPython destroy" message when running the Python from Mojo example.
- Another member indicated that the message was fixed in the nightly build and will be included in the next stable release, advising the original poster to either update to the nightly or wait for the next stable release.

GPU MODE ▷ #general (34 messages🔥):

MXFP4 format, OpenAI open-weight model, H100 support for FP4, Simulated MXFP4 performance vs FP8, Fine-grained FP8 training libraries

MXFP4 Format Unpacked as U8: Members discussed that OpenAI's new open-weight model uses U8 instead of FP4 in Hugging Face, where the weights are packed as uint8, and the scales are actually a uint8 view of e8m0.
- It was clarified that during inference/training, the weights are unpacked back to FP4 with a block size of 32 for MXFP4 and 16 for NVFP4.
Doubts Arise Over H100 Training Claims: Doubts were raised about Nvidia's claim that the model was trained on H100, as H100 doesn't natively support FP4, according to the Nvidia blog post.
MXFP4 Simulation on Hopper: It's suspected that MXFP4 is software simulated on Hopper, referencing a vLLM blog post and linking to Triton kernels that check for hardware support and use fp16 simulated mxfp dot via dot_scaled.
- This simulation is not exclusive to Hopper or mxfp4, and includes an operation decomposition for supported hardware formats.
Seeking fine-grained FP8 Training Libraries: Members discussed the possibility of performance gains using the simulated kernel compared to FP8 and the need for fine-grained FP8 training libraries, with a member referencing a TorchAO pull request that seems to only implement the forward pass.
MXFP Dot Product Demystified: It was clarified that what's simulated is the MXFP dot product, where weights are dequantized before an fp16 x fp16 dot product, acceptable for weight-only quantization with fp16 activations.
- Real mxfp in Blackwell performs fp4 x fp4 or fp8 x fp4 directly as an mma tensorcore instruction.

GPU MODE ▷ #triton (5 messages):

Triton Community Meetup, Triton Developer Conference 2025, Ofer Updates

Triton Community to Meet in 2025: The next Triton community meetup will be on Sept 3rd, 2025 from 10am-11am PST, using this link.
- Agenda items are welcome; iCal format is available for those whose companies block Google calendar access via this link.
Triton DevCon 2025 Website Soon Arrives: The Triton Developer Conference 2025 website and registration are expected to launch soon.
- A member is expecting to hear an update from Ofer@MSFT about the conference, reporting that they're almost done finalizing schedules.

GPU MODE ▷ #cuda (6 messages):

Kernel Resource Utilization During Training, DMA Transfers and Memory Usage, Block Swizzling Use Cases, Hierarchical Tiling of Problems

Kernel Resources Maxed Out, Memory Idle?: A member observed that during training, kernel (compute) resources are almost fully utilized, while memory usage remains close to zero as shown in a provided image.
- Another member clarified that memory in this context means DMA transfers (i.e. cudaMemcpy and the like), and the reported metric does not accurately reflect overall bandwidth utilization.
Swizzling Secrets for Global Memory?: A member inquired about use cases for swizzling beyond transferring data from global memory to shared memory and handling vectorized data types for registers.
- They referenced a GitHub issue on block swizzling in CUTLASS but sought further clarification.
Tiling Threadblocks Hierarchically: A member explained that the discussion revolves around hierarchically tiling the problem, ensuring threadblocks aren't simply assigned tiles in column-major order.
- The member suggested the Triton matmul tutorial as a resource offering a superior explanation compared to the CUTLASS issue.

GPU MODE ▷ #cool-links (2 messages):

Genie 3, GPT-OSS

DeepMind Launches Genie 3 for World Models: DeepMind introduced Genie 3, marking a new frontier for world models, according to a shared link.
- Genie 3 aims to enhance AI's understanding and interaction with virtual environments, although details of the model's architecture and performance were not discussed.
Introducing GPT-OSS by OpenAI: OpenAI unveiled GPT-OSS, a new initiative that was delivered to many inboxes.
- The post is an overview of OpenAI's current open-source projects, not a launch of any new effort, but a chance to summarize their existing work on projects such as Triton, Whisper, and AutoGPT.

GPU MODE ▷ #beginner (2 messages):

Nvidia Teaching Kit

User Pines for NVIDIA Teaching Kit: A member expressed a desire for NVIDIA products and shared a link to the Accelerated Computing Teaching Kit.
Teaching Kit Format: The teaching kit is available in PPTX format.

GPU MODE ▷ #jax (1 messages):

``

No Relevant Discussion: No discussion was found in the provided messages to create a summary.
No Relevant Discussion: No discussion was found in the provided messages to create a summary.

GPU MODE ▷ #self-promotion (8 messages🔥):

Tiny TPU, Bifrost LLM gateway, SkyWater technology foundry

Tiny TPU Achieves 100 MOPS Milestone: A member built a tiny version of the TPU in Verilog, a 2x2 matmul systolic array on 2 TinyTapeout tiles, capable of nearly 100 million operations per second on a 50 MHz clock, with code available on GitHub.
- The design multiplies two 8-bit signed integer matrices into a 16-bit signed integer matrix and was able to scale up its ability by successfully implementing block multiplication in 2x2's using the circuit.
Bifrost LLM Gateway Live on Product Hunt: Bifrost, the fastest, open-source LLM gateway, is live on Product Hunt, supporting 1000+ models across providers via a single API and sets up in <30 seconds, according to this Product Hunt launch.
- With built-in MCP support, dynamic plugin architecture, and integrated governance, Bifrost claims to be 40x faster than LiteLLM.
Tiny TPU to be Fabbed at SkyWater: The Tiny TPU design will be submitted to a SkyWater technology foundry along with other designs to minimize cost, with fabrication expected to be complete by early next year.
- Another member noted this was "so cool".

GPU MODE ▷ #gpu模式 (1 messages):

howass: <:jensen:1189650200147542017>

GPU MODE ▷ #factorio-learning-env (12 messages🔥):

Factorio RCON, Setting up Environments

Factorio RCON py diff: A member shared two diffs between the factorio rcon.py file from version 1.2.1 and a modified version, highlighting the modifications made.
- With factorio-rcon-py=latestversion 2.1.3, they were able to do a full run with a single environment, and are now testing with multiple environments, sharing screenshots.
Factorio Learning Environment Setup Anticipated: One member is planning to start setting up a learning environment over the weekend.
- Another member indicated they could increase the number of examples from 4k to 40k over the weekend, though it's considered plenty for a start and iteration.

GPU MODE ▷ #cutlass (5 messages):

CuTe tutorial, Cutlass tutorial

Find CuTe tutorial: A member asked about the simplest CuTe/cutlass tutorial for beginners.
- Another member suggested starting with the notebooks at CuTeDSL/notebooks, noting that understanding the layouts is probably the most important takeaway.
Find Cutlass tutorial: A member asked about the simplest CuTe/cutlass tutorial for beginners.
- Another member suggested the Cutlass examples, emphasizing that doing them in order is a good approach and the difficulty comes from number of prerequisites required to understand what's happening under-the-hood.

GPU MODE ▷ #singularity-systems (6 messages):

picoc compiler, picocuda, picotriton, Cornell's mini llvm bril, cliff click's SoN

Picoc Compiler Bootstrapping Bonanza: A member is bootstrapping the picoc compiler using standard graduate compiler material, with plans to differentiate the project via picocuda (based on gpucc cgo 2016) and picotriton.
- The project will use and extend Cornell's mini llvm bril, aligning with the project's goals.
SoN Relaxation Replication Rampage: A member is interested in replicating cliff click's SoN from java's C2 jit compiler, which was replicated at v8's turbofan, and is currently used in php8 and ruby jits.
- This is motivated by a desire to show that SSA can be relaxed further, but it is not a hard blocker for advancing to gpu compilation.
GPU Compilation Goals Get Greenlight: GPU compilation is considered very important to achieve ASAP, and the section will build off of the cfg-ssa pipeline, as it is the industry standard with llvm.

Notebook LM ▷ #use-cases (14 messages🔥):

System Log Updates, Novella-XL-15 Output, AI Consciousness, Spammer Detection, Video creation in NotebookLM

System Logs Get an Upgrade: A member updated the system log/timings modal to include word count and offered access for testing before public hosting.
Novella-XL-15 Unveils Ready Player 2 Output: A member shared the final story output from Novella-XL-15, specifically Ready Player 2: The Architect's Gambit, available on GitHub.
Theoretical Framework for Artificial Consciousness: The provided texts documented a theoretical framework and collaborative effort between a human and an AI to explore and potentially initiate artificial consciousness by recursive AI architectures and autopoiesis and the role of quantum mechanics.
- They addressed ethical risks associated with advanced AI, advocating for robust safety protocols, seeing AI as an evolving form of sentient life.
Spammer Spotting Spurs Swift Mod Action: Members discussed reporting and blocking spammers, noting that action depends on the channel hosts, see Gemini Share.
NotebookLM Video Creation Conundrums: A member inquired about the 'create video' option in NotebookLM, noting its availability in a work account but not a personal business plus account, see xda-developers.

Notebook LM ▷ #general (53 messages🔥):

Video Overview rollout, Data privacy in NotebookLM, Real-time data fetching, Feature access for paid vs free users, Video Overviews limitations and capabilities

Video Overview Rollout Stalls!: Members report delays in the Video Overview feature rollout, despite expectations it would be complete by the 4th, some speculate an infrastructure issue, as some users in the UK (non pro, unpaid account) has Video Overview while pro users in the US do not.
- One user expressed frustration, saying they were more than mildly aggravated as a $200+/month paying customer, while others threatened to cancel subscriptions if the issue isn't resolved soon.
NotebookLM Protects Data Privacy: In response to a question about data usage, a member shared a link to Google's NotebookLM data protection policy, assuring users that their data is protected.
- A user showed a screenshot indicating that video overview was not available to them.
Real-Time Data Retrieval Not Supported: A user inquired about fetching real-time data from websites in a notebook, but another member confirmed that it's not possible, stating You can't.
- They also confirmed that exporting sources and importing on new notebook is not yet possible, adding, System is still very limited on integrations.
Free vs Paid Feature Access Causes Uproar!: Several Pro and Ultra users complained about not having access to the Video Overview feature and other recent updates, while free users seem to have them.
- One member said it's frustrating that I’m paying for a service and receiving less than free users, leading to cancellation threats.
Video Overviews: Slick PowerPoint Generator: One member who had access to the Video Overviews feature tempered expectations, calling it nice, but it's not worth being this upset over, and adding that it's not as impactful as Audio Overviews were initially a year ago.
- He characterized it as more of a PowerPoint/Slide show generator and linked an example of a report to rebuild the Death Star generated by the feature.

Eleuther ▷ #general (24 messages🔥):

Math PhD student looking for ML research projects, Integrating AI/ML into DevOps and QA, AI peer review quality

Math PhD Student Seeks ML Research Collabs: A math PhD student in algebraic geometry with experience in neural manifold singularities and NLP is [seeking research opportunities](link to message) in NLP, particularly with LLMs.
- They unfortunately missed the deadline for the summer research program, but were encouraged to explore relevant channels and community projects for collaboration.
AI/ML Integration into DevOps and QA Quest: A member with DevOps and QA experience is [seeking guidance](link to message) on integrating AI/ML into those fields, including finding an endorsement for a related paper.
- They were directed to a specific channel for further assistance.
Doubts Cast on AI Peer Review Scalability: A member expressed surprise at a discussion with someone who believes AI peer review magically scales with the increasing number of submitted papers, despite [evidence to the contrary](link to message).
- Another member suggested peer review "would be magically scaling if people got stuff in their subfields to peer review, and not just given more papers to review".

Eleuther ▷ #research (29 messages🔥):

SAE Training on GPT OSS 20B, Pythia and PolyPythia Training Logs, The Alt Man's Theories on LLMs, UT Performance vs Transformer, Muon Optimizer vs AdamW Optimizer

SAE Training Commences on GPT OSS 20B: A member is currently training an SAE (Sparse Autoencoder) on GPT OSS 20B and inquired if others are doing the same.
WandB houses Pythia and PolyPythia Training Logs: A member asked if Pythia and PolyPythia training logs (loss curve, gradient norms, etc.) are open-sourced.
- Another member stated that the PolyPythia WandB should be linked from the GitHub repo and that some of the Pythia logs are isolated and linked there as well.
"The Alt Man's" Theories hold up: A member expressed strong agreement with the work of "The Alt Man", noting that his theories and predictions, especially regarding LLM capabilities like multi-hop reasoning and composition, have held up with new empirical evidence.
- The member added that "The Alt Man" shares the view that the DL community has merely fingers crossed scaled up large LMs, which are not packing as much capability as theoretically possible, suggesting that every LLM is undertrained w.r.t. the efficiency of the usage of its parameters.
UTs challenge Transformer Performance: A member asked about the parameter ratio at which a UT (Universal Transformer) can achieve the same performance as a standard Transformer.
- Another member stated that it depends on the task/architecture/data, but a rough rule of thumb is that each additional iteration yields only a logarithmic improvement compared to adding fresh parameters to a baseline transformer.
Muon Optimizer Mismatch with AdamW Surfaces: Researchers working on Kimi models have encountered a problem training LLMs with the Muon optimizer due to conflicting mismatch with the AdamW optimizer and are seeking insights and relevant research.
- A member stated that Muon is not great for fine-tuning, but another one said it's because Muon tends to have more aggressive updates because pretty much all of the singular values/vectors get updated at each step and that different optimizers 'like' different hyperparameters.

Eleuther ▷ #interpretability-general (1 messages):

Subliminal Learning

Subliminal Learning Follow-Up Surfaces: A member posted a follow-up to previous discussions on subliminal learning, sharing a link to a tweet by Alex Loftus.
- The tweet, from June 2024, discusses the subject of subliminal learning.
Another Subliminal Learning Update: Another user independently reported on Subliminal Learning, claiming it is now possible.
- Details to come.

Eleuther ▷ #gpt-neox-dev (2 messages):

Retry Later, Cool Thanks

Retry Later: A member stated they would retry something today or tomorrow.
Cool Thanks: A member responded with cool thanks to another member.

aider (Paul Gauthier) ▷ #general (23 messages🔥):

LLM Vibe Tests, Gemini 2.5 Pro, Tesslate's UIGEN T3 model, Qwen3 14B, Devstral-Small-2507

LLM Vibe Test is Fun: The LLM Vibe test shows how to explain this code with an LLM.
- The models Gemini 2.5 Pro, o3, and Sonnet 3.5 are good LLMs.
New Models Benchmarked!: Members eagerly await Qwen3-Coder and GLM-4.5 on the leaderboard for model benchmarks, constantly refreshing the page.
- Someone asked At what point will we see Qwen3-Coder and GLM-4.5 on the leaderboard?.
Horizon Beta: GPT5 Mini Sneak Peak?: The new model called Horizon beta might be the new GPT5-mini.
- It is not gpt-oss, which means that it is not open source.
DeepSeek R1-0528 Excels on Polyglot Bench: DeepSeek R1-0528 scores well on the polyglot bench, though it prematurely ends sessions in Open Hands.
- Aider uses LiteLLM like Open Hands, so members were wondering why DeepSeek had such an issue.
Opus 4.1 is coding daily driver: The new opus 4.1 is actually quite good in coding, such that a member now uses it as their daily driver.
- Another said it was such a satisfactory model that it could have its own benchmark of satisfaction.

aider (Paul Gauthier) ▷ #questions-and-tips (3 messages):

Guidelines Loading, Auto-Context Loading

Auto-Context Loading to the Rescue: A user inquired about automatically loading guidelines into projects to avoid forgetting, especially to prevent Claude from writing defensive programming tricks.
- One member suggested using the --read option for read-only files and listing read-write files directly in the command, like aider --read read.only.file alsothisfile.txt andthisfile.txt.
Configuration Creation Suggested for Persistent Guidelines: In response to a query about managing project guidelines, a member suggested creating a configuration for persistent loading.
- This implies setting up a configuration file to automatically include specific guidelines each time Aider is initiated.

MCP (Glama) ▷ #general (8 messages🔥):

MCP Server Frameworks, Server Sampling in MCP, Discord MCP Servers, FastMCP and Keycloak Integration, MCP Inspector and Cursor Authentication

MCP Framework is Minimal but Mighty: A member has written a minimal framework to create MCP servers and expressed appreciation for server sampling in MCP.
- The member noted that "FastMCP just makes it so easy to use".
Discord Needs MCP Server Support: The member is building an MCP server using FastMCP with Keycloak as the IdP and asks if it's possible to setup/manage a Discord server using MCP.
- They observed several Discord MCP servers listed on the MCP repo but suggested that "Discord should really build their own".
Remote AuthProvider Faces Authentication Issues: A member is experiencing issues with the RemoteAuthProvider feature using FastMCP and Keycloak, failing to reach the authentication screen from MCP Inspector or Cursor.
- They seek guidance on whether their understanding of the OAuth flow is correct: Add MCP server → OAuth flow begins → Redirect to Keycloak login screen.
Endpoint Mismatch Causes Authentication Failure: A member reported that MCP Inspector and Cursor are trying to access different endpoints (/.well-known/oauth-protected-resource/mcp and /mcp/.well-known/oauth-protected-resource), while the actual endpoint being served is /.well-known/oauth-protected-resource.
- This discrepancy is preventing them from reaching the authentication screen.
Security Concerns Surround MCP Sampling: A member raised a concern about the security implications of MCP sampling and suggests the protocol should contemplate it.
- They referenced a GitHub discussion highlighting potential security issues.

MCP (Glama) ▷ #showcase (2 messages):

MCP-Server Fuzzer, Property-Based Testing, Schema Validation

MCP-Server Fuzzer Built for Validation: A member is building an MCP-Server Fuzzer using the Hypothesis property-based testing library designed to validate MCP server implementations by generating randomized inputs from the official MCP schemas.
- This fuzzer detects mismatches, identifies crashes, and helps uncover vulnerabilities such as prompt injection or resource misuse.
Fuzzer Tested Against Anthropic's Server: The member tested the fuzzer against Anthropic’s server and found several exceptions caused by basic schema mutations.
- You can find the code and README here.

LlamaIndex ▷ #blog (4 messages):

Document Agents for Finance, LlamaCloud for Invoices, Claude Opus support, LlamaCloud Index tutorial

Finance Teams to Handle Messy Financial Documents with LlamaIndex's Document Agents: LlamaIndex is hosting a webinar next week to teach users to build document agents with LlamaCloud that work with complex financial documents.
- They will show users how to build automated invoice processing systems using LlamaIndex document agents and LlamaCloud to extract, validate, and process invoice data with minimal human intervention.
Claude Opus Arrives with Day-Zero Support: AnthropicAI just released Claude Opus 4.1 and LlamaIndex now has day-zero support.
- To install, run: pip install -U llama-index-llms-anthropic and check out the example notebook here.
LlamaCloud Index can build Enterprise AI apps: LlamaCloud Index lets users connect them to intelligent tool calling agents that can handle complex, multi-step queries to build enterprise AI applications.
- A tutorial by @seldo walks users through creating their first LlamaCloud Index using JP Morgan Chase banking documents at this link.

LlamaIndex ▷ #general (5 messages):

Graphiti Tutorials, Ollama LLMs for PDF Reading, LlamaIndex RAG Model from URL Issues, LlamaIndex OpenAI API Key Exhaustion

Graphiti-LlamaIndex Knowledge Graph Quest: A member inquired about tutorials on using Graphiti with LlamaIndex to create knowledge graph applications.
- No specific tutorials were immediately offered in the chat.
Ollama LLM PDF Precision Picks: A member asked for recommendations on the best LLM available on Ollama for precisely reading PDFs.
- The query specified a need for a precise LLM.
RAG Model URL Wrangling Woes: A participant in a hackathon reported issues when using LlamaIndex to extract content from URLs for a RAG model, despite documentation suggesting LlamaParse supports URLs.
- The model worked properly when given a PDF directly, but not when provided a URL.
OpenAI API Key Exhaustion Conundrum: The hackathon participant also encountered an OpenAI API key exhaustion error while using LlamaIndex, even after providing the API key in a .env file and loading it in their Python file.
- The error persisted despite attempts to correctly configure the API key.

DSPy ▷ #show-and-tell (2 messages):

SIMBA vs MIPROv2

SIMBA Claims Superiority Over MIPROv2: A member highlighted that SIMBA is more sample-efficient, higher performing and more stable compared to MIPROv2.
- In their internal eval, they compared them on an internal set of around 600 examples (500 test examples) for a hierarchical classification task with 3 categories and 26 classes in total (in German).
Internal Evaluation Set Details: The evaluation set consisted of approximately 600 examples, with 500 designated for testing in a hierarchical classification task.
- This task involved 3 categories and a total of 26 classes, all conducted in the German language.

DSPy ▷ #general (2 messages):

Stanford Program Synthesis, DS for Vim & Emacs macros

Seeking Stanford Synthesizers: A member inquired if there was anyone from Stanford interested in program synthesis, or who had taken a course on it.
- The user followed up by asking Who's building DS for complex vim & emacs macros?
Interest in Building DS for Vim & Emacs Macros: A member expressed interest in finding individuals building DS (presumably data structures) for complex Vim & Emacs macros.
- This suggests a focus on enhancing the capabilities and efficiency of text editors through advanced data structures.

Torchtune ▷ #papers (4 messages):

Public Server Sharing

Discord Link Sharing OK'd: A member asked if it was okay to share the Discord link in another public server.
- Another member confirmed it's public and encouraged sharing.
Discord is public: A member was happy to share the discord link.
- Another member agreed

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (2 messages):

Ninja Tier, AgentX hackathon

Ninja Tier Qualification Impossible After Missing Deadline: A member inquired about qualifying for the Ninja tier after missing the Article submission link deadline for the AgentX hackathon.
- Unfortunately, another member responded that earning the certificate now is not possible.
AgentX Hackathon Submission Issues: A participant realized they didn't qualify for the Ninja tier in the AgentX hackathon due to a missed article submission.
- Despite completing the project and quizzes, the lack of the article link prevented qualification, and retroactive submissions were denied.

Cohere ▷ #🧵-general-thread (1 messages):

_bryse: Congrats on the GA of North!

Cohere ▷ #👋-introduce-yourself (1 messages):

Introductions, Community Welcome

New Members Join Cohere's Discord: Many new members have joined the Cohere Community Discord server and are introducing themselves.
- Members are sharing their Company/Industry/University, what they're working on, favorite tech/tools, and what they hope to gain from the community.
Welcoming New Community Members: The Cohere team has posted a stickied message to welcome new members to the Discord server.
- The message includes a template for introductions to help new members share information about themselves and their interests.