a quiet day.

AI News for 4/27/2026-4/28/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Inference Systems, vLLM 0.20, and the Hardware/Kernel Race Around DeepSeek V4

vLLM’s latest release is heavily about memory and MoE serving efficiency: vLLM v0.20.0 shipped with TurboQuant 2-bit KV cache for 4× KV capacity, FA4 re-enabled for MLA prefill on SM90+, a new vLLM IR foundation, fused RMSNorm for a reported 2.1% end-to-end latency improvement, plus support updates spanning DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and easier GB200/Grace-Blackwell setup. In parallel, SemiAnalysis highlighted early DeepSeek V4 Pro serving results on B200/B300/H200/GB200 disaggregated setups, claiming B300 can be up to 8× faster than H200 for this workload and pointing to upcoming vLLM 0.20 benchmarking with DeepGEMM MegaMoE, which fuses EP dispatch + EP combine + GEMMs + SwiGLU into a single mega-kernel.
The ecosystem is converging on fast day-0 support for new open models: vLLM added Day-0 support for Poolside’s Laguna XS.2, and separately for Ling-2.6-flash, while vLLM also published Day-0 support for NVIDIA’s Nemotron 3 Nano Omni. Outside vLLM, several posts focused on serving tradeoffs: Jeremy Howard noted DeepSeek V4’s support for prefill as a capability many providers have dropped, while Maharshi pointed out the overheads of dynamic activation quantization, arguing that static quantization often wins on inference speed despite calibration cost. There was also growing interest in alternate stack portability: teortaxesTex argued DeepSeek is structurally moving away from CUDA lock-in via TileKernels, suggesting model vendors may increasingly optimize for heterogeneous or domestic accelerator fleets rather than NVIDIA-only deployment.

Open Model Releases: Poolside Laguna XS.2, NVIDIA Nemotron 3 Nano Omni, and TRELLIS.2

Poolside made its first public model release with an unusually deployment-friendly open-weight coder: @poolsideai announced Laguna XS.2, a 33B total / 3B active MoE coding model trained fully in-house, released under Apache 2.0, and advertised as able to run on a single GPU. Poolside’s broader release also included Laguna M.1 and an agent harness, emphasizing that the company trained from scratch on its own data, training infra, RL, and inference stack. Community summaries added more color: Aymeric Roucher described two coder models—225B/23B active and 33B/3B active—with hybrid attention, FP8 KV cache, and claimed performance near Qwen-3.5; Ollama shipped it immediately.
NVIDIA’s Nemotron 3 Nano Omni was the day’s biggest infra-native model launch: @NVIDIAAI introduced Nemotron 3 Nano Omni, an open 30B / A3B multimodal MoE with 256K context built for agentic workloads spanning text, image, video, audio, and documents. Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability. Key specs surfaced in follow-on posts: Piotr Żelasko described it as NVIDIA’s first omni release with speech/audio understanding backed by a Parakeet encoder, English-only for now, and a 5.95% WER on the Open ASR leaderboard. Several hosts cited ~9× throughput versus comparable open omni models.
Other notable model/paper releases: Microsoft’s TRELLIS.2 is an open-source 4B image-to-3D model producing up to 1536³ PBR textured assets, built on native 3D VAEs with 16× spatial compression. On the world-model side, World-R1 claims existing video models already encode 3D structure and can be “woken up” with RL, requiring no architecture changes, no extra video training data, and no added inference cost.

Agents, Local-First Tooling, and Production Orchestration

Agent builders are shifting from demos to production primitives: Mistral launched Workflows in public preview as an orchestration layer aimed at turning enterprise AI processes into durable, observable, fault-tolerant production systems. Related posts echoed the same theme: Sydney Runkle framed durable execution as a key requirement for long-running agents, and threepointone described work on subagents / agents-as-tools with persistence, streaming, and resumption.
Local/offline agents moved from aspiration to credible workflow: Teknium asserted “totally offline agents are possible”, while Niels Rogge demoed Pi + local models for desktop cleanup and Google Gemma shared a tutorial for local coding agents. Hugging Face’s local push also showed up in adoption numbers: Clement Delangue said 300,000 users have added hardware specs to the Hub to discover what can run locally. Complementing this, Ammaar open-sourced a vibe-coding app running Gemma 4 fully on-device with MLX, and Kimmonismus highlighted Sigma, a private browser-based local-agent concept using open models.
Hermes and adjacent agent harnesses are gaining real-world traction: multiple posts reported Hermes outperforming OpenClaw in instruction-following or practical workflows, including SecretArjun, somewheresy, and users deploying Hermes through Telegram or for medical literature extraction. On the research-agent side, Hugging Face’s ML Intern was trending among Spaces, and later gained native metric logging + Trackio integration to make its training jobs observable rather than black-box.

Benchmarks, Evals, and Research Findings Worth Watching

Model benchmarking remains fragmented, but a few signals stood out: Epoch reported GPT-5.5 Pro reaching 159 on the Epoch Capabilities Index and new highs on FrontierMath—52% on Tiers 1–3 and 40% on Tier 4—including two Tier 4 problems not previously solved by any model. Separately, Greg Kamradt said ARC-AGI-3 testing for GPT-5.5 and Opus 4.7 had completed, with failure modes now under analysis.
Several new benchmarks target more realistic agent and engineering behavior: Lysandre announced a benchmark for making Transformers more agent-friendly, and VibeBench proposed subjective testing by 1,000 qualified software engineers to measure how models actually feel in real work. On document intelligence, LlamaIndex’s ParseBench emphasized that OCR benchmarks miss semantic formatting such as strikethroughs and superscripts, which materially alter meaning for agents.
Research notes with concrete engineering implications: Rosinality flagged bugs in DeepSpeed and OpenRLHF that reduce SFT performance, with implications for prior studies. Arjun Kocher published a faithful implementation of Compressed Sparse Attention from the DeepSeek-V4 paper. che_shr_cat showed single-block transformers can solve Extreme Sudoku only with an explicit scratchpad and inverted routing init, otherwise performance is zero. On optimization, Keller Jordan released a lightweight Modded-NanoGPT optimizer benchmark designed to compare methods like Muon and AdamW on a reproducible speedrun-style task.

Platform Economics, API Pricing, and Closed-Model Reliability Concerns

Open-model economics are becoming a real forcing function: Aidan Gomez argued private deployments matter because controlling the model means controlling cost, and Vtrivedy made the case that many Haiku/Flash workloads should be re-evaluated against open models, citing large price gaps and improving quality from families like DeepSeek, Minimax, GLM, and Nemotron. DeepSeek itself amplified that narrative with aggressive V4 Pro pricing cuts and cache discounts, later extended through end of May.
Closed-model dependence is being framed as an operational risk, not just a preference issue: Gergely Orosz summarized Anthropic’s recent silent changes and customer-impacting behavior as evidence that closed models are “massive risks,” while Zach Mueller documented regressions in Claude 4.7 for his coding workflow and ultimately switched away. Tokenization economics also came under scrutiny: Aran Komatsuzaki quantified a strong non-English token tax, especially for Anthropic, later extending the comparison across more model-language pairs and finding that Gemini and Qwen were among the least punitive for non-English text.
Top tweets (by engagement, filtered for tech relevance)
- Codex usage expansion: OpenAI’s team temporarily reset Codex rate limits for all paid plans to spur more GPT-5.5 building.
- Claude outage / concentration risk: Yuchen Jin’s joke about Claude Code being down and “the whole Silicon Valley” reacting captured how central coding agents have become to daily workflows.
- OpenAI on AI-assisted mathematics: OpenAI promoted a podcast on GPT-5.4 Pro helping solve a 60-year Erdős problem, a notable example of frontier models’ growing role in formal research.
- GPT-5.5 adoption signal: Sam Altman noted strong enthusiasm for 5.5, while Epoch’s ECI post supplied the harder benchmark signal behind that sentiment.

AI Governance and Defense: Google’s Pentagon Deal Draws Sharp Internal Backlash

The most contentious policy story was Google’s classified Pentagon AI deal: Kimmonismus summarized reporting that Google signed an agreement allowing use of its AI for classified work and “any lawful government purpose”, with contract language reportedly enabling the government to request modifications to safety filters while offering only non-binding “not intended for” limits on surveillance or autonomous weapons. This drew unusually public criticism from inside Google/DeepMind, including BlackHC calling it “shameful” and saying there had been no internal announcement or discussion beforehand.
The response matters because it sharpens distinctions between frontier labs’ red lines: S. Ó hÉigeartaigh argued Google DeepMind should be scrutinized by the same standards applied to OpenAI, while TurnTrout said Google’s terms were weaker than OpenAI’s fig-leaf restrictions. The story also reinforced Anthropic’s contrasting posture in public debate, since earlier reporting suggested its refusal to drop certain red lines had created procurement friction. For engineers, the practical takeaway is less about politics than platform governance: safety policy, deployment control, and contract language are increasingly part of the product surface for frontier AI providers.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Model Benchmarks and Performance

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (Activity: 731): The image provides a benchmark comparison of the Qwen 3.6 27B model across three quantization variants: BF16, Q4_K_M, and Q8_0 GGUF, evaluated using llama-cpp-python and Neo AI Engineer. The benchmarks include HumanEval for code generation, HellaSwag for commonsense reasoning, and BFCL for function calling. The Q4_K_M variant stands out for its practical benefits, offering 1.45x faster throughput than BF16, using 48% less peak RAM, and having a 68.8% smaller model size, while maintaining nearly identical function calling scores. Despite a slight drop in HumanEval accuracy, Q4_K_M is recommended for local/CPU deployment unless maximum quality is required, in which case BF16 is preferred. Commenters appreciate the detailed comparison across quantization variants, though some express concern about the lack of error bars and potential sampling errors, particularly regarding the Q8_0 model’s performance. There is interest in extending these evaluations to other models or sizes, and a request for the full code used, as some suspect potential issues with the Q8_0 results, such as possible quantization of the KV cache.
- audioen raises concerns about the lack of error bars in the evaluation of Qwen 3.6 27B BF16, Q4_K_M, and Q8_0 GGUF models. They suggest that the unexpected ordering of Q4_K_M outperforming Q8_0 could be due to sampling error, highlighting the importance of statistical rigor in benchmarking processes.
- spaceman_ and Look_0ver_There express skepticism about the Q8_0 model’s performance, suspecting that the quantization of the KV cache might have affected the results. spaceman_ requests the full code used for the evaluation to verify if the KV cache was quantized, as this could explain the unexpected performance drop.
- One_Key_8127 points out discrepancies in the reported HumanEval scores for Qwen 3.6 27B, noting that it should be scoring significantly higher based on comparisons with other models like Gemma 3 4B and Llama3-8b. They reference external benchmarks to support their claim that the current results might be inaccurate.
Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (Activity: 982): Luce DFlash is a new implementation of speculative decoding for the Qwen3.6-27B model, optimized to run on a single RTX 3090 GPU using a standalone C++/CUDA stack built on top of ggml. This setup achieves up to 1.98x throughput compared to autoregressive decoding across benchmarks like HumanEval, GSM8K, and Math500, without requiring retraining. The system uses advanced techniques such as DDTree tree-verify speculative decoding, KV cache compression, and sliding-window flash attention to optimize performance and memory usage, allowing for efficient processing of large contexts up to 256K tokens. Commenters appreciate the innovation in local AI inference, noting the potential for significant speed improvements. However, there is concern about the impact of quantization on accuracy, particularly for use cases involving coding or tool calling, where precision is critical.
- drrck82 highlights the potential of using dual RTX 3090 GPUs for running Qwen3.6-27B, noting the appeal of achieving up to 2x throughput compared to their current setup with Q6_K_XL. This suggests significant performance improvements in local AI inference, particularly for users with high-end hardware setups.
- Tiny_Arugula_5648 raises concerns about the impact of quantization on model accuracy, emphasizing that while increased throughput is attractive, it may not be suitable for all use cases. They caution that heavy quantization can lead to significant accuracy loss, especially in tasks like coding or tool calling, where precision is critical.
- Deep90 expresses a need for centralized benchmarking resources to navigate the growing number of AI model options. This reflects a broader community challenge in evaluating and selecting models based on performance metrics, which are crucial for informed decision-making in AI deployments.
To 16GB VRAM users, plug in your old GPU (Activity: 797): The post discusses leveraging an old GPU with at least 6GB VRAM alongside a primary 16GB VRAM GPU to run dense models like the Qwen3.6-27B using llama-server. The setup involves using a 5070Ti and a 2060 to achieve a combined 22GB VRAM, approaching the performance of a 24GB class card. The configuration includes settings like dev=Vulkan1,Vulkan2 to enable both GPUs, no-mmap to keep the model off RAM, and n-gpu-layers=999 to maximize GPU offloading. Benchmarks show significant speed improvements, with 186 tokens/s for prompt processing and 19 tokens/s for generation at 128k max context, compared to 4 tokens/s on a single card. Commenters debate the use of Vulkan over CUDA, with some suggesting CUDA for better performance. Others note that while additional VRAM from a secondary GPU can improve performance, it may bottleneck the primary GPU, as seen with a 3090 Ti and 2070 setup.
- Mysterious_Role_8852 discusses the performance bottleneck when using a 3090 Ti and a 2070 together. They note that the 2070 significantly bottlenecks the 3090 Ti, resulting in a decrease from 30t/s to 20t/s when splitting tasks between the GPUs. This highlights the importance of matching GPU capabilities to avoid performance degradation, especially when handling large models like Qwen 3.6 27b Q6 Quant.
- mac1e2 provides a detailed account of running Qwen3.6-35B-A3B on a constrained system with a GTX 1650 4GB and 62GB RAM. They emphasize the importance of understanding hardware limitations and optimizing configurations, such as using --cpu-moe, --mlock, and specific cache settings, to achieve around 20-21 tok/s. This showcases how disciplined resource management can still yield effective results on older hardware.
- jacek2023 mentions using a 3060 as an additional GPU alongside three 3090s, but only for the largest models. This suggests a strategic approach to leveraging available hardware resources, where additional GPUs are utilized selectively to maximize performance for demanding tasks, rather than uniformly across all workloads.

2. New Model and Tool Announcements

Something from Mistral (Vibe) tomorrow (Activity: 312): The image is a social media post from “Mistral Vibe” teasing a significant announcement scheduled for the following day. The post has garnered moderate engagement, suggesting anticipation or interest in the announcement. The comments speculate on potential developments, such as a new model release or a tool upgrade, with some users expressing hope for improvements to match industry standards like Qwen 3.6 27B. There is also speculation about potential military contracts, which could impact the company’s focus on state-of-the-art (SOTA) advancements. Commenters express skepticism about Mistral’s current offerings, with one user describing the current model as “meh” and hoping for improvements. Another comment suggests that military contracts might delay advancements in state-of-the-art technology.
- LegacyRemaster mentions a benchmark score, noting ‘Devstral SWE Bench 81.00+’, which suggests a high performance level for the model in question. This indicates that the model might be competitive in specific technical benchmarks, potentially aligning with industry standards.
- new__vision clarifies that the announcement might not be about a new model but rather related to the Mistral Vibe X account, which is distinct from Mistral AI X. They suggest that Vibe is a ‘coding agent’ that integrates well with local models, hinting at a possible announcement related to a ‘coding harness’ rather than a new model release.
- AvidCyclist250 speculates about the possibility of another military contract, which could delay state-of-the-art (SOTA) developments. This implies that resource allocation towards military projects might impact the timeline for releasing cutting-edge models.
Deepseek Vision Coming (Activity: 318): Deepseek Vision is anticipated to be released soon, as indicated by a post from Xiaokang Chen on 𝕏. The infrastructure for Deepseek Vision is largely in place, with base models already developed, suggesting that the integration of multimodality will follow the pretraining phase. This implies a potentially short interval between the preview of Deepseek V4 and its full release, given that Deepseek V4 was deployed approximately 2-3 weeks ago. Commenters express a preference for a unified model, such as a V4.1 with native multimodality, rather than separate vision-specific models, emphasizing the importance of integrated multimodal capabilities.
- Few_Painter_5588 discusses the infrastructure readiness for Deepseek Vision, noting that the base models are already in place, which simplifies the integration of multimodality. They suggest that the transition from Deepseek V4-preview to the full V4 version might be swift, given that V4 was deployed 2-3 weeks ago, indicating a potentially short development cycle for the vision capabilities.
- dampflokfreund expresses a preference for a unified model approach, hoping for a V4.1 release that includes native multimodality rather than separate vision-specific models. This reflects a broader desire for integrated solutions that seamlessly handle multiple data types, emphasizing the importance of native multimodality in modern AI systems.
Microsoft Presents “TRELLIS.2”: An Open-Source, 4b-Parameter, Image-To-3D Model Producing Up To 1536³ PBR Textured Assets, Built On Native 3D VAES With 16× Spatial Compression, Delivering Efficient, Scalable, High-Fidelity Asset Generation. (Activity: 786): Microsoft has introduced “TRELLIS.2”, a cutting-edge 4-billion parameter model for generating high-fidelity 3D assets from images. This model utilizes a novel “field-free” sparse voxel structure called O-Voxel, enabling the reconstruction of complex 3D topologies with sharp features and full PBR materials. It achieves efficient and scalable asset generation with 16× spatial compression, producing assets up to 1536³ in resolution. The model is open-source, with resources available on GitHub and a live demo on Hugging Face. Some users noted that TRELLIS.2 was released months ago and had been previously discussed, suggesting that the announcement might not be new to everyone. However, it appears to be news to a significant portion of the community, justifying the continued interest.
- The model, TRELLIS.2, was released four months ago, but it seems to be news to many in the community, indicating a lack of widespread awareness or coverage at the time of its initial release.
- A user attempted to run TRELLIS.2 on an AMD 7800XT GPU using ROCm, but encountered segmentation faults. The model has primarily been tested on NVIDIA GPUs with 24GB VRAM, suggesting potential compatibility issues with AMD hardware and ROCm dependencies.
- There was a recent pull request approved to add ROCm support, but users are experiencing difficulties due to dependency issues and the need for gated repository access. Despite these challenges, the model can process images and begin asset creation, indicating partial functionality.

3. Local LLM Usage and Challenges

I’m done with using local LLMs for coding (Activity: 1981): The Reddit post discusses the author’s dissatisfaction with local LLMs like Qwen 27B and Gemma 4 31B for coding tasks, particularly in comparison to Claude Code used at work. The main issues highlighted include poor decision-making and tool-calling, especially in tasks like Dockerization, where the models fail to follow logical steps or handle long-running processes effectively. The author also notes performance issues, such as slow response times and broken prompt caches, which hinder productivity. Despite attempts to guide the models with detailed instructions, the local LLMs did not meet expectations, leading the author to consider using cloud-based models like OpenRouter for more demanding tasks, while reserving local models for simpler automation and language tasks. Commenters suggest that the choice of harness significantly impacts performance, with some harnesses like Hermes potentially offering better handling of long-running processes. There is also a debate about the unrealistic expectations set by some community posts, which may exaggerate the ease of achieving successful outcomes with local LLMs.
- A key technical point raised is the importance of optimizing settings for local models like Claude Code to improve performance. A user shared a link to Unsloth’s documentation that details how to address issues like slow inference and ineffective caching, which are common frustrations when using local LLMs for coding tasks.
- Another insightful discussion revolves around the significance of the harness used with local models. A commenter noted that different harnesses can lead to vastly different outcomes even with the same model, emphasizing that the choice of harness is crucial. They mentioned that some harnesses, like the Hermes agent, have specific strengths and weaknesses, such as handling long-running processes and using log file outputs effectively, which can impact the perceived performance of local models.
- The debate also touches on the cost-effectiveness of local models versus centralized providers. While local models on consumer GPUs may not match the performance of models like Claude, they can offer cost savings. For instance, the Kimi K2.6 model is highlighted as a cost-effective alternative to Claude Opus, providing similar performance at a lower API cost. This suggests that while performance may lag, local models can still be financially viable for certain use cases.
Duality of r/LocalLLaMA (Activity: 575): The image highlights the contrasting opinions within the r/LocalLLaMA community regarding the use of local Large Language Models (LLMs) for coding. One post expresses frustration and dissatisfaction after weeks of trying local LLMs, while another post is optimistic, suggesting that local models have become viable for real work, as evidenced by testing on Terminal-Bench 2.0. This duality reflects the varied experiences and expectations users have when deploying local LLMs, often influenced by the model’s size and the user’s ability to optimize workflows. Commenters discuss the unrealistic expectations some users have when comparing local LLMs to large-scale models running on expensive hardware. They emphasize the importance of understanding the limitations of smaller models, like the 27 billion parameter Qwen 3.6, and the need for efficient workflow architecture to maximize their potential.
- Memexp-over9000 discusses the limitations and potential of using the Qwen 3.6 27B model, emphasizing that while it cannot compete with trillion-parameter models, it can produce comparable outputs if workflows are efficiently architected. The comment highlights the importance of using AI for ‘grunt work’ rather than creative tasks, suggesting that understanding and optimizing the model’s capabilities is crucial for effective use.
- FoxiPanda provides a detailed analysis of the variables affecting local model performance, such as harness configuration, system prompts, and model quantization. They note that different models require specific ‘glue’ in system prompts to address their unique issues, and that quantization levels (e.g., IQ2 vs. Q8) can significantly impact user experience. The comment underscores the importance of well-structured prompts and planning in achieving optimal results with local models.
- Scared-Tip7914 highlights the impact of model quantization on performance, noting that users often fail to specify the quantization level (e.g., Q2 vs. Q8) when discussing model capabilities. They share their experience with Qwen 3.5-35B, Q4, emphasizing its role as a complement to larger proprietary models for token-efficient execution rather than as a standalone solution. The comment suggests a strategic approach: planning with large models and executing with local ones.
A warning to newbies - A lesson on network security (Activity: 355): The post highlights a significant network security issue where 373 devices are publicly exposing LM Studio instances without requiring an API key, making them vulnerable to unauthorized access. The image shows a world map with countries shaded in red, indicating the number of exposed devices, with Thailand having the highest count at 194. The author emphasizes the importance of securing LLM platforms by not exposing them to the internet without proper security measures like using Tailscale or reverse proxies with authentication. One commenter appreciates the ethical hacking approach, while another notes the ability to remotely execute prompts on exposed devices, highlighting the potential risks. A third comment sarcastically suggests exploiting these unsecured devices for computational resources.
- DatMemeKing highlights a critical security vulnerability where they were able to remotely execute prompts on devices, suggesting a potential flaw in network configurations or software that allows unauthorized access. This underscores the importance of securing network ports and ensuring that remote execution capabilities are tightly controlled and monitored.
- AdultContemporaneous raises a question about whether the security issue pertains to running local LLMs on a computer with internet access or if it specifically affects those who attempt to remotely access their locally-hosted LLMs. They mention using an IDS/IPS with geo IP blocking, indicating a layered security approach to protect against unauthorized access.
- Illeazar clarifies that the security risk is primarily for users who have chosen to publicly forward ports on their router, which can expose their systems to external threats. This emphasizes the need for careful network configuration and the risks associated with exposing local services to the internet without proper security measures.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude and Opus Model Pricing and Access Issues

Anthropic just quietly locked Opus behind a paywall-within-a-paywall for Pro users in Claude Code (Activity: 1053): Anthropic has introduced a new pricing structure for its Claude Code users, requiring an additional payment to access the Opus models even for those on the Pro plan. This change was quietly noted in their support documentation, indicating that while the Pro plan costs $20/month, accessing the flagship Opus models requires further purchase. The default model available is Sonnet 4.5, and while Opus 4.5 is listed, it is locked behind this additional paywall. This move suggests a shift towards a metered model, raising concerns about transparency and cost implications for users. Some users express frustration over the lack of transparency and the additional costs, with comments suggesting dissatisfaction and anticipation for alternative models like Qwen 4 27b. Others note that they are already using Opus 4.7, indicating possible inconsistencies in the rollout or a limited test phase.
- ClaudeOfficial clarified that the information about Opus being locked behind a paywall is outdated. They mentioned that Opus 4.5 was rolled out for Pro plans in January, and the support article was never updated. They provided a link to the Wayback Machine to verify this information, indicating a lapse in documentation updates.
GitHub Copilot 9x price increase for Claude models (Activity: 803): GitHub Copilot is implementing a 900% price increase for Claude models starting in June, transitioning from fixed plans to usage-based billing. This change is detailed in GitHub’s documentation and their press release. The increase is part of a shift to API-based billing, which may significantly impact enterprise customers relying on Claude agents for production, as it could drastically affect unit economics due to increased inference costs. Commenters express concern over the lack of visibility into agent operations and token usage, which could exacerbate the financial impact of the price increase. The shift is seen as a strategic move by Anthropic to leverage their position in the market.
- Emerald-Bedrock44 highlights a critical issue with the 9x price increase for Claude models, emphasizing that such a drastic change can severely impact unit economics for companies using these models in production. The lack of visibility into token usage exacerbates the problem, as teams may not fully understand or control their inference costs, leading to potential financial strain.
- CricktyDickty notes that the shift from fixed, subsidized plans to API-based pricing represents a strategic move by Anthropic. This transition could be seen as a way for Anthropic to assert its market position, potentially increasing revenue but also placing a heavier financial burden on enterprise customers who were previously benefiting from lower, fixed costs.
- dotheemptyhouse points out that the price increase is not isolated to Claude models, mentioning a 6x increase in some competing models as well. This suggests a broader trend of rising costs across AI model providers, which could indicate a shift in the industry’s pricing strategies or a response to increased demand and operational costs.
Anthropic just quietly locked Opus behind a paywall-within-a-paywall for Pro users in Claude Code (Activity: 653): The image highlights a controversial change by Anthropic where the Opus model, part of the Claude Code suite, is now locked behind an additional paywall for Pro users. This means users who already pay $20/month for the Pro subscription must pay extra to access the Opus model, despite it being marketed as part of the Pro package. The default model available is Sonnet 4.5, and while Opus 4.5 is listed, it requires an additional purchase. This change was not widely announced, only noted in a support article, leading to user frustration over transparency and cost implications. One comment suggests the support article is outdated and references a rollout of Opus 4.5 in January, indicating a lack of updated communication from Anthropic. Another comment criticizes the Opus model for its high token usage, which quickly depletes user quotas, suggesting the additional cost may not be justified.
- ClaudeOfficial clarified that the support article was outdated and that Opus 4.5 had been included in Pro plans since January, as evidenced by the Wayback Machine. This suggests a communication lapse rather than a deliberate paywall change.
- Faangdevmanager highlighted a significant issue with Opus’s token consumption, noting that it uses a lot of tokens, which are expensive. This has been a pain point for users who found their quotas depleted quickly, indicating a need for more efficient token usage or clearer communication about costs.
- Academic-Proof3700 expressed frustration over several issues with the Pro subscription, including quality degradation, bugs, and the introduction of a new model that consumes more tokens without delivering proportional value. This reflects broader dissatisfaction with the service’s cost-effectiveness and transparency.

2. GPT 5.4 and 5.5 Performance and Benchmarks

Differences Between GPT 5.4 and GPT 5.5 on MineBench (Activity: 465): The post discusses the benchmarking of GPT 5.4 and GPT 5.5 using the MineBench framework, highlighting that GPT 5.5 shows marginal gains over GPT 5.4. The benchmarks suggest that GPT 5.5 achieves similar output quality with reduced computational resources, aligning with OpenAI’s claims of efficiency improvements. The cost for running GPT 5.5 was $19.98 with an average inference time of 624 seconds, compared to GPT 5.4’s approximate $25 cost. The differences between the Pro and standard models in the 5.5 family are minimal, indicating similar output quality. The benchmark involves creating 3D structures from block palettes, with GPT 5.5 demonstrating more detailed and intricate designs. Commenters noted the impressive detail in GPT 5.5’s outputs, such as modeling reflections on an astronaut’s visor, though some builds appeared noisier with random colored blocks. Overall, GPT 5.5’s designs were considered slightly better.
- WithoutReason1729 highlights a significant improvement in GPT 5.5’s visual modeling capabilities, noting that it can accurately model complex reflections, such as the Earth on an astronaut’s visor. This suggests enhanced rendering and spatial reasoning capabilities in the newer version.
- Kamimashita observes that GPT 5.5’s builds appear noisier with random colored blocks, yet the overall design quality is improved. This indicates a trade-off between detail and noise, suggesting that GPT 5.5 might be experimenting with more complex design patterns.
- FateOfMuffins discusses a notable 270 ELO increase from GPT 5.4 to 5.5, and an additional 220 ELO jump to 5.5 Pro. This substantial improvement in ELO ratings reflects enhanced performance and possibly more sophisticated algorithms in the newer versions, though it also raises questions about benchmark saturation and the need for increased difficulty.
GPT 5.5 is unbelievably wasteful with tokens (Activity: 14): The post discusses the high token consumption and associated costs of using GPT 5.5, particularly outside of its Codex application, with a single request reportedly costing $5. This highlights concerns about the model’s efficiency and cost-effectiveness in non-coding contexts. A comment suggests that the cost of using models like GPT 5.5 or Claude Opus 4.7:1m xhigh should be considered relative to the value they provide, implying that high costs might be justified by the benefits in certain applications.

3. ChatGPT Solving Mathematical Problems

Chat GPT 5.4 solved a 60+ years unsolved erdos problems in a single shot (Activity: 2265): The image depicts a mathematical proof related to an unsolved Erdős problem, highlighting inequalities involving sums over primitive sets. The claim is that Chat GPT 5.4 solved this problem in 80 minutes and 17 seconds, suggesting a significant advancement in AI’s capability to reason through complex mathematical problems. The proof involves constants and logarithmic expressions, indicating a sophisticated level of mathematical reasoning typically expected at a PhD level. This development challenges the notion that LLMs merely predict the next token without true reasoning capabilities. While the achievement is impressive, some commenters argue that the claim of reasoning better than 50 years of mathematicians is an overstatement. They acknowledge the potential of LLMs as powerful tools for mathematicians but note that these models still have limitations and cannot independently develop novel ideas.
- enilea points out that while the Erdos problems are numerous and many remain unsolved due to lack of attention, claiming that an LLM ‘reasoned better than 50 years of mathematicians’ is an overstatement. They acknowledge that LLMs are becoming powerful tools for mathematicians but emphasize that these models still have limitations and cannot independently develop novel ideas yet.
ChatGPT 5.4 Solved a 64-Year-Old Math Problem (Activity: 13896): The image depicts a mathematical proof related to an Erdős problem, specifically focusing on primitive sets and logarithmic inequalities. The post claims that ChatGPT 5.4 Pro was used by a 23-year-old to solve this 64-year-old problem in about 1 hour 20 minutes. The solution reportedly applied a known formula in a novel way to this problem, which had not been done before. The problem in question is actually Erdős 1196, not 1176, and the proof has been verified as legitimate, with notable mathematician Tao commenting on it. The AI’s success is attributed to the user guiding it with a different approach than previous partial solutions attempted by experts. Commenters highlight the significance of this achievement, noting that the AI’s solution is both short and elegant. They emphasize the importance of asking the right questions, as the user did not follow the same partial solutions as experts but instead used a familiar approach that led to the breakthrough.
- EmergencyFun9106 highlights that the problem solved is Erdos 1196, not 1176, and confirms the legitimacy of the proof, referencing a comment by Terence Tao here. The significance lies in the problem’s history of partial solutions, with the AI’s contribution being notably concise and elegant.
- yubario explains the AI’s success by contrasting it with previous attempts that followed partial solutions from experts, which led to dead ends. The breakthrough came from guiding the AI with a different approach, emphasizing the importance of asking the right questions to unlock solutions.
- MannOfSandd anticipates a vigorous response from the academic community, indicating the potential impact and scrutiny the AI’s solution will face from faculty and researchers.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

Apr 28
not much happened today

Companies

Models

Topics

People