a quiet day.

AI News for 3/23/2026-3/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. New Model and Benchmark Launches

  • Mistral AI to release Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights that the company says outperformed ElevenLabs Flash v2.5 in human preference tests. The model runs on about 3 GB of RAM, achieves 90-millisecond time-to-first-audio, supports nine languages. (Activity: 1306): Mistral AI has announced the release of Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights, claiming it surpasses ElevenLabs Flash v2.5 in human preference tests. The model is designed to run efficiently on approximately 3 GB of RAM, achieving a 90-millisecond time-to-first-audio and supports nine languages. The open weights are available for free, as detailed in VentureBeat. Commenters express skepticism about Mistral’s past models but note significant improvement with Voxtral TTS, highlighting its impressive output quality. There is anticipation for the release of the model’s weights, with some users already testing it on the Mistral Console and reporting positive results.

    • The Voxtral TTS model by Mistral AI is a 3-billion-parameter model that reportedly outperforms ElevenLabs Flash v2.5 in human preference tests. It operates efficiently, requiring only about 3 GB of RAM and achieving a 90-millisecond time-to-first-audio, which is significant for real-time applications. The model supports nine languages, making it versatile for various linguistic needs.
    • A user expressed skepticism about Mistral’s previous models, noting that ‘Small 4 was turbo ass’ and ‘Large 3 was also incredibly disappointing.’ However, after testing Voxtral on the Mistral Console, the user was impressed with the output quality, indicating a significant improvement over past models. This suggests that Mistral has made substantial advancements in their TTS technology.
    • There is a comparison being drawn between Voxtral and other TTS models like Qwen-3 TTS and TADA. A user inquired about the latency and streaming capabilities of Qwen-3 TTS on VLM-omni, questioning if its low latency streaming claims are verified. This highlights the competitive landscape in TTS technology, where latency and streaming capabilities are critical performance metrics.
  • nvidia/gpt-oss-puzzle-88B · Hugging Face (Activity: 436): NVIDIA’s gpt-oss-puzzle-88B is a deployment-optimized large language model derived from OpenAI’s gpt-oss-120b, utilizing the Puzzle framework for post-training neural architecture search (NAS). This model is specifically optimized for NVIDIA H100-class hardware, achieving a 1.63× throughput improvement in long-context scenarios and 1.22× in short-context scenarios, while reducing parameters to 88B (approximately 73% of the parent model). It maintains or slightly exceeds the parent model’s accuracy across reasoning tasks. The architecture is a Mixture-of-Experts Decoder-only Transformer with a modified global/window attention pattern. One comment suggests that the gpt-oss-puzzle-88B may outperform the gpt-oss-120b, while another highlights that AMD should pursue similar optimization strategies, implying a competitive edge for NVIDIA in this domain.

    • A user expressed skepticism about NVIDIA’s models, noting that despite strong benchmark performances, they often find local models to be more versatile and effective. They describe NVIDIA’s models as ‘one trick ponies,’ suggesting that while they may excel in specific tasks, they lack the general applicability or adaptability of other models.

2. Intel GPU Launches

  • Intel will sell a cheap GPU with 32GB VRAM next week (Activity: 1723): Intel is set to release a new GPU with 32GB VRAM on March 31, priced at $949. The GPU offers a bandwidth of 608 GB/s and a power consumption of 290W, positioning it slightly below the NVIDIA 5070 in terms of bandwidth. This GPU is anticipated to be beneficial for local AI applications, particularly for models like Qwen 3.5 27B at 4-bit quantization. More details can be found in PCMag’s article. Commenters express skepticism about the price being considered ‘cheap’ at $989, while others compare it to the R9700 AI PRO, noting similar VRAM and bandwidth but with slightly higher power consumption. There is interest in how Intel’s offering will compete, particularly for AI and LLM applications.

    • Clayrone discusses their experience with the R9700 AI PRO, highlighting its 32GB VRAM and 640 GB/s bandwidth, which they find satisfactory for their needs. They mention using llama.cpp built for Vulkan, which operates well within a 300W power limit. They express interest in how Intel’s upcoming GPU will compare, suggesting it could be a direct competitor in terms of performance and efficiency.
    • KnownPride suggests that Intel’s decision to release a GPU with 32GB VRAM is strategic, as it caters to the growing demand for hardware capable of supporting large language models (LLMs). This indicates a market trend where consumers are increasingly interested in GPUs that can handle AI workloads efficiently.
    • qwen_next_gguf_when raises a question about the feasibility of producing GPUs with 96GB VRAM, hinting at potential technical challenges or market considerations that might limit such configurations. This reflects ongoing discussions in the tech community about balancing VRAM capacity with cost and performance.
  • Intel launches Arc Pro B70 and B65 with 32GB GDDR6 (Activity: 541): Intel has launched the Arc Pro B70 and B65 GPUs, featuring 32GB GDDR6 memory. The B70 is priced at $949 and offers 387 int8 TOPS with a memory bandwidth of 602 GB/s, compared to the NVIDIA RTX 4000 PRO’s 1290 int8 TOPS and 672 GB/s. The B70’s power draw is 290W, higher than the RTX 4000 PRO’s 180W. A 4-pack of B70s costs $4,000, offering 128GB of GPU memory, which is competitive against the RTX 4000 PRO’s price range of $6,400-$7,200. The collaboration with vLLM ensures day-one support for these GPUs, enhancing their performance potential. Commenters note that while the B70 offers more memory and efficiency, it has slower inference speeds compared to the RTX 3090 and lacks CUDA support. However, its price-per-GB makes it attractive for local inference of large models.

    • The Intel Arc Pro B70 and B65 GPUs have been integrated into the mainline vLLM, ensuring day-one support and solid performance. However, the B70’s performance lags behind the RTX 4000 PRO, with the B70 achieving 387 int8 TOPS compared to the RTX’s 1290. The B70 offers 32GB VRAM and 602 GB/s memory bandwidth, while the RTX 4000 PRO has 24GB VRAM and 672 GB/s bandwidth. The B70’s power draw is higher at 290W compared to the RTX’s 180W. Pricing for a 4-pack of B70s is $4,000, making it a competitive option for those needing 128GB of GPU memory.
    • The Arc Pro B70’s 32GB VRAM at $949 positions it as a cost-effective option for local inference, particularly for 70B models. Despite slower inference speeds compared to the RTX 3090 and lack of CUDA support, the B70 offers more memory and improved prompt processing efficiency, making it a viable alternative for specific use cases.
    • While the Arc Pro B70 offers tempting hardware specifications, users express frustration with Intel’s driver support. Comparatively, the B70 is similar to the AMD R9700 in class but is slightly slower and cheaper, with inferior software support, indicating that it doesn’t bring significant innovation to the market.

3. Innovative AI Techniques and Tools

  • RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params) (Activity: 480): RotorQuant introduces a novel approach to vector quantization using Clifford Algebra, achieving 10-19x speed improvements over TurboQuant with 44x fewer parameters. The method replaces the d×d random orthogonal matrix with Clifford rotors in Cl(3,0), reducing the computational load from 16,384 FMAs to approximately 100 by chunking vectors into 3D groups and applying a rotor sandwich product. Benchmarks show a cosine similarity of 0.990 compared to TurboQuant’s 0.991, with significant speed gains on both CUDA and Metal platforms. The trade-off involves higher synthetic MSE on random vectors, but real-model performance remains robust with QJL correction. GitHub Paper A key debate centers on the theoretical versus practical implications of RotorQuant. While it offers significant speed and parameter efficiency, it lacks TurboQuant’s global random rotation property, which optimizes scalar quantization by spreading energy across dimensions. This limitation affects low-bit quantization performance, particularly for worst-case vectors. However, RotorQuant’s practical utility in real-world KV cache distributions is acknowledged, suggesting a valuable speed/quality trade-off.

    • Juan_Valadez highlights a key theoretical difference between RotorQuant and TurboQuant, noting that TurboQuant’s global random rotation (Haar) spreads energy across all dimensions, making scalar quantization near-optimal. In contrast, RotorQuant only mixes within 3D blocks, which limits its ability to spread energy and affects low-bit quantization performance, particularly in worst-case vectors like one-hot vectors. Despite this, RotorQuant may still be effective in practical scenarios, such as KV cache distributions, where vectors are not adversarial.
    • Dany0 draws parallels between TurboQuant and techniques used in graphics programming, specifically referencing QuiP from 2023. They express skepticism about the novelty and effectiveness of TurboQuant, noting that while the math behind RotorQuant seems sound, the presentation and visualizations are less convincing. They liken the approach to using quaternions instead of Euler angles, suggesting that the efficiency comes from the fact that most multiplications result in zeros.
    • sean_hash comments on the unexpected application of Clifford algebras in quantization, noting that this cross-pollination from geometric algebra is surprising to those outside of graphics fields. This highlights the interdisciplinary nature of the innovation behind RotorQuant, which leverages mathematical concepts from one domain to optimize performance in another.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Code Usage and Issues

  • Open Letter to the CEO and Executive Team of Anthropic (Activity: 1607): The open letter to Anthropic’s CEO and Executive Team highlights significant issues with the reliability and transparency of the Claude AI service, particularly concerning opaque usage limits and inadequate customer support. Users report that the advertised 1M context windows and MAX x20 usage plans do not align with actual performance, as tasks like analyzing a 100k document can deplete a premium account in minutes. The letter calls for transparency on dynamic throttling, functional context windows, and human support for paid tiers, emphasizing that the current service reliability is driving users to alternative local LLMs like Qwen and DeepSeek. The letter is a plea for improved service to prevent further erosion of professional trust in Claude. Commenters express disbelief at the severity of the token limit issues, with some not experiencing the same problems, suggesting variability in user experiences. The lack of human support for paying customers is a recurring point of contention.

  • A very serious thank you to Claude Code (Activity: 817): The post criticizes Claude Code for its restrictive usage limits, highlighting a scenario where a user hit a 5-hour usage limit after minimal interaction, specifically after asking two questions involving two files with ten lines changed. The user expresses frustration over the lack of responsiveness from the company regarding these limitations, contrasting it with Codex, which reportedly resets limits and offers a better user experience. The issue seems to be related to a project involving 5 Python files for database reformatting, where the usage limit was unexpectedly consumed by 55% from a single prompt with minimal output. Commenters express dissatisfaction with Claude Code’s customer service and usage limit policies, noting that Codex provides a more reliable alternative. One user mentions switching to Codex due to these issues, indicating a preference for its handling of usage limits and overall service.

    • Users are experiencing issues with Claude’s usage limits, with some reporting that limits are being reached unusually quickly. For example, msdost noted that after a 5-hour limit reset, a simple task using Opus 4.6 exhausted the limit in just 8 minutes, generating only 200-300 lines of test code. This suggests potential dynamic limit calculations based on resource availability, as indicated by the ongoing outage on Claude’s status page.
    • Codemonkeyzz and others express frustration over Claude’s handling of caching and usage limit calculations, noting a lack of communication or apology from the company. This contrasts with Codex, which reportedly resets limits more reliably. Users are considering alternatives like Codex due to these issues, as highlighted by chalogr, who finds Codex to be a viable substitute.
    • Opening-Cheetah467 reports a sudden change in usage patterns, hitting the 5-hour limit easily despite no changes in workflow. This aligns with other users’ experiences of increased throttling, possibly due to technical issues on Claude’s end, as they dynamically adjust limits based on available capacity.
  • In 13 minutes 100% usage , happened yesterday too! Evil I’m cancelling subscription (Activity: 1717): The image and post highlight a potential bug in a subscription service’s usage tracking system, where the user experiences an unexpected 100% usage notification within just 13 minutes of use. This issue has led to significant frustration, as the user has already spent an additional $30 and is considering canceling their subscription due to the perceived error. The image shows detailed usage statistics, including a high percentage of extra usage costs, suggesting a possible miscalculation or system error in tracking usage limits. Commenters express empathy and share similar experiences, with one noting that they are on a similar plan without issues, suggesting the problem might be isolated or regional. Another commenter expresses hope for alternative models to replace the current service, indicating dissatisfaction with the current provider.

    • ArWiLen reports hitting their daily limit after just three prompts using ‘sonnet 4.6 extended’, which they find absurd and led to canceling their subscription. This suggests potential issues with the model’s usage tracking or quota management, especially for users engaging in debugging tasks.
    • jadhavsaurabh shares a personal experience of unexpectedly high usage charges, mentioning a $34 overage and a quick hit to 100% usage upon reset. This highlights potential problems with the subscription model’s transparency and the effectiveness of customer support in addressing these issues.
    • TriggerHydrant notes a discrepancy in usage experiences, as they are on the ‘5Max’ plan in the EU and use Claude extensively without hitting limits. This suggests that the issue might be region-specific or related to specific account settings, indicating a need for further investigation into the service’s regional performance consistency.
  • Saying ‘hey’ cost me 22% of my usage limits (Activity: 1235): The Reddit post discusses a significant issue with Claude Code where revisiting open sessions after a period of inactivity results in a substantial increase in usage limits, reportedly up to 22% for a simple message. This is attributed to the system’s caching mechanism, where every message resends the entire conversation context, including system prompts and conversation history, to the API. The cache, which is cheaper to read from, expires after 5 minutes on Pro and 1 hour on Max plans, leading to expensive cache writes when sessions are resumed. Additionally, the usage tracking uses 5-hour rolling windows, causing context from previous sessions to be charged against new windows, exacerbating the issue. A GitHub issue highlights that workloads consuming 20-30% of usage previously are now taking 80-100%, with no official response from Anthropic yet. The recommended workaround is to start fresh sessions or use /clear and /compact commands to manage conversation history efficiently. Commenters note that this issue is widely discussed online but not officially acknowledged by Claude. Some users suggest that the problem worsens when Claude retries prompts during system issues, leading to excessive usage.

    • Fearless_Secret_5989 explains that Claude Code’s architecture involves resending the entire conversation context with each message, which includes system prompts, tool definitions, and conversation history. This can lead to high token usage, especially when session caches expire (5 minutes on Pro, 1 hour on Max plans), causing a full cache write that is 1.25x more expensive than regular input. A GitHub trace showed 92% of tokens in resumed sessions were cache reads, consuming 192K tokens per API call with minimal output.
    • Fearless_Secret_5989 also highlights a rate limit window boundary issue where Claude Code uses 5-hour rolling windows for usage tracking. Resuming a session in a new window can charge the accumulated context from the old session against the new window, leading to sudden high usage. Users have reported up to 60% usage consumed instantly due to this rollover, with some experiencing increased consumption since March 23rd, potentially due to a backend change or bug.
    • Fearless_Secret_5989 suggests practical solutions to mitigate high token usage, such as starting fresh sessions instead of resuming old ones, using /clear to switch tasks, or /compact to compress conversation history. The official documentation advises clearing stale context to avoid wasting tokens. Users can also use /cost or /stats to monitor token consumption and prevent exceeding usage limits.
  • WTAF? (Activity: 1906): A physician with extensive coding experience since the late 70s shares their positive experience using Claude, an AI coding assistant, to work on a project involving esp32 hardware and Slink bus commands for Sony jukeboxes. They highlight how Claude accelerates their workflow by iterating through complex code, allowing them to focus on functionality rather than low-level details. The user compares this technological leap to historical shifts in programming paradigms, such as moving from assembly to compiled languages and then to modern scripting languages. They emphasize the democratizing potential of AI in coding, enabling non-developers to create functional projects without deep technical expertise.

    • The discussion highlights the divide between the anti-AI and pro-AI communities. The anti-AI crowd often dismisses AI-generated work as meaningless, while the pro-AI crowd critiques the technical execution, such as improper linting and database architecture errors. This reflects a broader debate on the value and quality of AI-assisted creations, especially in personal projects where scalability and technical perfection may not be the primary goals.
    • A physician with a background in programming shares their experience of launching an app on the App Store after taking a year off. This underscores the potential of AI and coding agents to empower individuals to realize their projects, even those with extensive non-technical careers. The comment emphasizes the transformative impact of AI in enabling personal projects that might have been too complex or time-consuming otherwise.
    • The comment by ‘kurushimee’ points out that AI is particularly beneficial for hobby projects, which might otherwise be too tedious or require too much effort. This highlights AI’s role in democratizing access to technology, allowing individuals to pursue personal interests and projects without the traditional barriers of time and complexity.

2. Sora Shutdown and Implications

  • Sora shutdown is a good early example of what private AI companies will do when they achieve AGI (Activity: 1037): The post speculates that the shutdown of Sora, a private AI company, is indicative of a future where AI companies will prioritize achieving Artificial Superintelligence (ASI) over maintaining consumer services. The argument suggests that as companies approach AGI, they will redirect resources to accelerate ASI development, potentially leading to increased costs for consumers and higher hardware prices due to increased demand for compute resources. Commenters argue that Sora’s shutdown was primarily due to financial losses rather than strategic shifts towards ASI. They suggest that the technology, while advanced, was not yet viable for the general public, leading to significant financial losses for companies like OpenAI and Google.

    • CatalyticDragon points out that the shutdown of Sora was primarily due to financial reasons, emphasizing that the service was not profitable. This highlights a common challenge in AI ventures where cutting-edge technology does not always translate to immediate financial success.
    • solbob argues that Sora’s shutdown indicates the limitations of their state-of-the-art video generation technology, suggesting it was not practical for widespread use and resulted in significant financial losses. This reflects a broader issue in AI development where advanced capabilities may not meet market needs.
    • eddyg987 mentions that open-source models from China outperformed Sora, suggesting that competition from freely available alternatives can significantly impact proprietary AI services. This underscores the competitive pressure in the AI field where open-source solutions can rapidly advance and challenge commercial offerings.

3. Google TurboQuant and Gemini Updates

  • Google just dropped TurboQuant – 6x less memory, 8x faster inference, zero accuracy loss. Could this be the biggest efficiency boost for LLMs yet? (Activity: 98): Google Research has introduced a new compression algorithm called TurboQuant, which claims to reduce key-value cache memory by 6x and speed up inference by 8x without any accuracy loss. This is achieved through adaptive precision and entropy-aware grouping, targeting the KV cache that often constitutes 80-90% of inference memory, especially for long contexts. Although the research paper is not yet published, Google has reportedly deployed TurboQuant internally for some Gemini workloads. The potential impact includes significantly reduced inference costs, enabling 1M+ token contexts on consumer GPUs, and facilitating more AI applications on edge devices. Some commenters are skeptical, noting that the paper is allegedly 11 months old and that the improvements only affect the KV cache, which is a small part of the model (10%). There is also skepticism about the claim of zero accuracy loss, with some questioning the validity of the sources.

    • Bakanyanter points out that the TurboQuant paper is not new, being 11 months old, and highlights that its impact is limited to the kvcache, which constitutes only about 10% of the model. This suggests that the claimed efficiency improvements might not be as significant as suggested, especially since the kvcache is a relatively small component of the overall model architecture.
    • Old_Stretch_3045 mentions that TurboQuant is already deployed internally for some Gemini workloads, implying that Google has been testing and possibly refining this technology for some time. This internal deployment could indicate that the technology is mature enough for practical use, although the comment sarcastically suggests dissatisfaction with its performance.
    • Bakanyanter questions the claim of zero accuracy loss, indicating skepticism about the marketing claims. This highlights a common concern in AI model optimization where improvements in efficiency might come at the cost of model accuracy, and the need for clear evidence or benchmarks to support such claims.
  • Google Research: TurboQuant achieves 6x KV cache compression with zero accuracy loss (Activity: 93): Google Research has unveiled TurboQuant, a novel quantization technique that achieves 6x compression of key-value (KV) caches without any loss in accuracy. This advancement is particularly significant for large language models and vector search engines, as it optimizes high-dimensional vector storage, thereby enhancing retrieval speeds and reducing memory costs. The technique is expected to alleviate memory bottlenecks and improve efficiency in AI systems. More details can be found in the original article. Some users express hope that Google will implement TurboQuant in their systems soon, while others are considering integrating it into projects like llama.cpp due to its potential to address specific use cases.

    • The TurboQuant method achieves a 6x compression of the KV cache without any loss in accuracy, which is significant for optimizing memory usage in large-scale models. This could be particularly beneficial for models like llama.cpp, where memory efficiency is crucial for performance on limited hardware resources.
    • There is a discussion about the potential implementation of TurboQuant in existing systems, with some users expressing hope that Google will integrate this into their systems soon. The implication is that while the theoretical improvement is substantial, practical implementation and real-world performance gains are yet to be fully realized.
    • A user expressed interest in integrating TurboQuant into llama.cpp, highlighting its potential to address specific use cases that require efficient memory management. This suggests that TurboQuant’s compression capabilities could be particularly useful for developers working with models that need to run on constrained hardware.
  • Gemini 3.1 Flash Live is here! (Activity: 130): Gemini 3.1 Flash Live has been released, focusing on improvements in voice model performance. The update addresses previous issues such as ‘robotic sounding echo and reverberation,’ enhancing the overall audio quality. However, the release strategy has raised questions, as the voice model was deployed before the standard 3.1 Flash model, which some users find unusual. The previous live model was considered outdated, making this update a significant improvement. Some users express confusion over the deployment order, questioning why the voice model was prioritized over the standard model. Despite this, the update is generally seen as a positive step forward, addressing key audio quality issues.

    • TheMildEngineer notes the unusual deployment sequence of the Gemini 3.1 voice model before the standard 3.1 flash model, highlighting a potential strategic decision by the developers. They also observe that the update has resolved issues with ‘robotic sounding echo and reverberation,’ indicating an improvement in audio processing quality.
    • Zemanyak comments on the outdated nature of the previous live model, suggesting that the new release is a significant upgrade. However, they express a preference for the release of the full 3.1 Flash model, indicating that the current update may not fully meet user expectations for comprehensive improvements.
    • douggieball1312 mentions the global rollout of ‘Search Live in AI Mode/Google Lens’ alongside this release, noting its prior availability in the UK. This suggests a broader strategy to integrate AI capabilities across different regions, potentially enhancing user experience with more advanced search functionalities.
  • Gemini 2.5 Pro was so Goated, they had to bring it Back! 🙏 (Activity: 248): The image highlights the Google Gemini interface, specifically focusing on the ‘Deep Research with 2.5 Pro’ feature, suggesting its significance or popularity among users. This feature is part of the Gemini 3 suite, which includes capabilities like fast answers, solving complex problems, and advanced math and code with 3.1 Pro. The emphasis on bringing back the 2.5 Pro version indicates that it may have had unique or superior functionalities that users appreciated, prompting its reintroduction. One comment questions whether the ‘deep research’ capability in 2.5 Pro is superior to that in 3.1 Pro, indicating a potential debate about the effectiveness of different versions. Another comment expresses frustration with Google’s user interface, comparing it to OpenAI’s, suggesting a broader dissatisfaction with tech UI design.

    • Head_Map4196 raises a technical question about the comparative performance of Google’s Gemini 2.5 Pro versus 3.1 Pro, specifically in the context of ‘deep research’ capabilities. This suggests a focus on how these versions handle complex queries or data analysis tasks, though no specific benchmarks or performance metrics are provided in the comment.
    • hasanahmad speculates whether the reintroduction of Gemini 2.5 Pro indicates that versions 3 and 3.1 may have underperformed or not met user expectations. This implies a potential gap in performance or features between these versions, though no specific technical shortcomings are detailed.
    • ameeno1 notes a potential regional availability issue with Google AI Pro features, questioning if being in the UK affects access to Gemini 2.5 Pro. This highlights a common technical issue with software rollouts where features may be region-locked or subject to phased releases.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.