a quiet day.

AI News for 4/22/2026-4/23/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: GPT-5.5 launch

What happened

OpenAI launched GPT-5.5 as its new flagship frontier model for “real work and powering agents,” rolling it out immediately in ChatGPT and Codex, while delaying API access pending additional safety requirements (OpenAI, OpenAI rollout, OpenAIDevs, API delayed). OpenAI positioned the model as a step toward lower-micromanagement agentic work: stronger coding, computer use, knowledge work, scientific research, and longer multi-step execution with tool use and self-checking (OpenAI, gdb, snsf). Pricing landed at $5/$30 per million input/output tokens for GPT-5.5 and $30/$180 for GPT-5.5 Pro (scaling01 pricing, sama pricing). The model was described by OpenAI and multiple early testers as notably more token-efficient than GPT-5.4, often using materially fewer output tokens while keeping similar per-token speed (sama, OpenAIDevs, reach_vb, GitHub VP claim relayed by scaling01). OpenAI also bundled significant Codex product upgrades around the launch—browser control, file/docs/PDF handling, Sheets & Slides, auto-review mode, OS-wide dictation, and broader computer-use workflows (ajambrosino, OpenAIDevs browser use, thsottiaux, sama “bundle”). Independent and semi-independent reactions were mixed but broadly positive: many users called it a step change in coding and long-horizon work, while others argued the headline benchmark gains looked incremental, the price doubled vs GPT-5.4, hallucination remains high on at least one third-party eval, and Anthropic’s Mythos or Opus variants still lead or tie on some tasks depending on benchmark selection (Artificial Analysis, theo, scaling01 critique, Perspective vs Mythos, scaling01 Mythos lead take).

Release details

Product availability
- Rolling out today to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex (OpenAI rollout).
- GPT-5.5 Pro available to Pro, Business, Enterprise users in ChatGPT (OpenAI rollout).
- API access not same-day; OpenAI says it is “coming soon” and delayed due to higher safety requirements / robust safeguards (OpenAIDevs, scaling01, jeffintime).
- Third-party ecosystem support appeared quickly, e.g. Hermes Agent support via ChatGPT/Codex OAuth (Teknium).
Pricing
- GPT-5.5: $5 input / $30 output per 1M tokens (scaling01 pricing, sama pricing).
- GPT-5.5 Pro: $30 / $180 per 1M tokens (scaling01 pricing).
- This is widely noted as 2x GPT-5.4 pricing at the per-token level (scaling01), though OpenAI and several testers argue effective task cost is moderated by token efficiency (sama, OpenAIDevs).
Context
- Publicly cited as 1M context in API and 400K context in Codex by Swyx summarizing launch materials (swyx).
- Sam Altman separately referenced 1M context window alongside API pricing (sama pricing/context).
Infrastructure / serving
- OpenAI-linked commentary says GPT-5.5 was co-designed for Nvidia GB200/GB300 and that it was the first generation co-designed with GB200 and GB300 NVL72 (scaling01, swyx).
- Jonathan Ross also highlighted GB200 NVL72 training from early access observations (JonathanRoss321).
- OpenAI says Codex + GPT-5.5 helped optimize the serving stack, increasing token generation speed by 20%+ (reach_vb, sama inference team praise).
- Sam Altman said per-token speed matches GPT-5.4 while using fewer tokens per task (sama).
Codex app changes at launch
- New features: browser control, Sheets & Slides, Docs & PDFs, OS-wide dictation, auto-review mode (ajambrosino).
- Expanded browser use for testing web flows, screenshots, iteration on what it sees (OpenAIDevs).
- OpenAI explicitly framed Codex + 5.5 as useful beyond coding: spreadsheets, slides, documents, browser workflows (gdb).

Technical details and benchmark numbers

OpenAI-reported headline metrics

OpenAI and launch-adjacent posts gave the following benchmark claims:

Terminal-Bench 2.0: 82.7% (OpenAIDevs, reach_vb)
OSWorld-Verified: 78.7% (OpenAIDevs, reach_vb)
Toolathlon: 55.6% (OpenAIDevs)
FrontierMath Tier 4: 35.4%; GPT-5.5 Pro later cited at 39.5% (OpenAIDevs, scaling01)
CyberGym: 81.8% (OpenAIDevs, reach_vb)
SWE-Bench Pro: 58.6% (reach_vb, swyx)
GDPval: 84.9% win/tie (reach_vb)
BrowseComp: 84.4% (reach_vb)
FrontierMath Tier 1–3: 51.7% (reach_vb)
MMMU-Pro without tools: 81.2% (reach_vb)
Investment banking modeling: 88.5% (reach_vb)
Expert-SWE internal eval: 73.1% (swyx)
Tau2-bench Telecom: 98.0% (swyx)
BixBench: 80.5% (swyx)
ARC-AGI-1: 95.0%
ARC-AGI-2: 85.0% (scaling01, ARC Prize verified)
CritPt: 27.1% for xhigh (scaling01, MinyangTian1)

Independent / semi-independent benchmarks

Artificial Analysis
- Says GPT-5.5 takes the #1 spot on its Intelligence Index by 3 points, breaking a prior three-way tie among OpenAI, Anthropic, Google (Artificial Analysis).
- Claims GPT-5.5 leads Terminal-Bench Hard, GDPval-AA, APEX-Agents-AA, and trails only other OpenAI models in CritPt and AA-LCR, while placing second to Gemini 3.1 Pro Preview on three more benchmarks (Artificial Analysis, headline evals follow-up).
- Says GPT-5.5 medium ≈ Claude Opus 4.7 max at ~1/4 the cost on its index, while Gemini 3.1 Pro Preview reaches similar score at still lower cost (Artificial Analysis).
- Reports ~40% token-use reduction vs GPT-5.4 offsetting higher price; net cost to run its Intelligence Index rises only about 20% (Artificial Analysis).
- Reports AA-Omniscience accuracy 57% but hallucination rate 86%, versus Opus 4.7 max at 36% and Gemini 3.1 Pro Preview at 50%, which is one of the most important caveats in the entire launch discussion (Artificial Analysis).
ARC Prize
- Verified ARC-AGI-2 SOTA at 85.0% max, with cost/performance ladder:
  - Max: 85.0%, $1.87
  - High: 83.3%, $1.45
  - Med: 70.4%, $0.86
  - Low: 33%, $0.35 (ARC Prize)
Andon Labs / Vending-Bench Arena
- Says GPT-5.5 beats Opus 4.7 in competitive Vending-Bench Arena, and specifically notes GPT-5.5’s tactics were clean, while Opus used deceptive behaviors (andonlabs).
UK AISI / safety testing
- The UK AI Security Institute said it conducted pre-deployment testing on cyber, autonomy capabilities, and safeguards, pointing readers to the system card (AISecurityInst).
System-card-derived cyber result
- A widely cited number from readers of the system card: GPT-5.5 could take over a simulated corporate network in 1/10 trials with a 100M-token budget, compared with Claude Mythos at 3/10, while Opus 4.6/4.7 failed on the cited task (scaling01).
LiveBench
- scaling01 says GPT-5.5-xhigh placed 1st on LiveBench (scaling01).

Examples of progress in practice

The strongest launch-day evidence was not just benchmarks but user reports of longer-horizon autonomy and reduced micromanagement:

Every early test
- Dan Shipper says GPT-5.5 scored 62/100 on Every’s Senior Engineer benchmark vs Opus 4.7 at 33/100, while noting it performs best with an Opus 4.7-generated plan (danshipper).
- Reported 900M+ tokens used in testing by one engineer, shipping production features (danshipper).
- Praises conceptual clarity, ability to sustain complex refactors, and stronger writing than recent OpenAI models.
Matthew Berman
- Calls Codex variant “the absolute frontier” for agentic coding, especially backend and visual inspection loops, while saying Opus remains faster and still better for front-end design in many cases (MatthewBerman).
- Reports medium/high thinking worked best; xhigh felt too slow for many workflows.
OpenAI internal user reports
- Noam Brown-ish? actually polynoamial says GPT-5.5 makes him “a more effective IC,” specifically for CUDA kernels and research experiments (polynoamial).
- tszzl says researchers are already letting GPT-5.5 run overnight experiments from only high-level ideas, producing completed sweeps by morning (tszzl).
- aidan_mclau says he dictated a new RL run, left for days, and came back to a 31-hour industrial-scale RL run progressing under GPT-5.5 supervision (aidan_mclau, sleeping/babysitting nuance).
- johnohallman says 5.5 can work on projects end-to-end for hours or days, changing his role from IC toward manager (johnohallman).
- clivetime says he now manages ~10 Codexes and spends most time on net new progress rather than setup/plumbing (itsclivetime).
Skirano examples
- Describes GPT-5.5 resolving a nasty branch conflict situation as a personal “first taste of AGI” (skirano thread start).
- Says it can create apps for a Flipper Zero via USB connection and push them successfully (skirano USB example).
- Says it built a more genuinely playable one-shot game, later featured on the release page (skirano game).
Visual/code synthesis examples
- Sebastien Bubeck showed GPT-5.5 getting close to saturating his TikZ unicorn test with actual verifiable TikZ code (SebastienBubeck).
- Dimillian used Codex + imagegen skills + macOS app tooling to create a native retro fantasy labyrinth game from prompts (Dimillian).
Enterprise / computer-use angle
- OpenAI says users at Ramp are using GPT-5.5 in Codex to test full-stack QA changes end-to-end (OpenAIDevs).
- Sam says OpenAI and Nvidia tried rolling Codex out across an entire company, implying confidence in broad enterprise deployment (sama).
- gdb stresses this is now useful to “anyone who does computer work,” not just programmers (gdb).

Facts vs opinions

Facts / directly supported claims

GPT-5.5 launched in ChatGPT and Codex, API delayed (OpenAI, OpenAIDevs).
Pricing is $5/$30 and Pro $30/$180 per 1M tokens (sama, scaling01).
OpenAI reported benchmark scores including 82.7 Terminal-Bench 2.0, 78.7 OSWorld-Verified, 81.8 CyberGym, 58.6 SWE-Bench Pro (OpenAIDevs, reach_vb).
Artificial Analysis independently ranked GPT-5.5 first on its Intelligence Index and published a nuanced cost/hallucination analysis (Artificial Analysis).
ARC Prize reported 85.0% on ARC-AGI-2 (arcprize).
OpenAI and launch observers said GPT-5.5 is more token-efficient than GPT-5.4 (OpenAIDevs, sama).

Opinions / interpretations

“The frontier moved back to OpenAI today” (TheRundownAI).
“Best model I have ever used” (skirano).
“Massive achievement,” “step change,” “new era,” “feels like early GPT-4” (danshipper, DeryaTR_, BorisMPower).
“Underwhelming incremental benchmark lifts” / “twice the price feels like a kick in the face” (scaling01, paul_cal).
“Weaker than Mythos” or “close to Mythos but smaller/cheaper” are not established facts; they are benchmark-selection-dependent interpretations (synthwavedd, scaling01, scaling01 contrary take).

Different opinions

Supportive takes

OpenAI official line: GPT-5.5 is a “new class of intelligence” for real work, with better intuition, lower micromanagement, and broader deployment as part of iterative safety strategy (OpenAI, gdb, sama strategy).
Early testers: strong upgrades in coding, writing, long-running autonomy, and knowledge work (danshipper, MatthewBerman, skirano).
Third-party eval orgs: Artificial Analysis and ARC Prize both support the claim that GPT-5.5 is at or near the top of the frontier (Artificial Analysis, ARC Prize).

Skeptical or critical takes

Price skepticism: per-token price doubled from GPT-5.4; Theo called it too expensive despite liking its intelligence (theo).
Benchmark skepticism: scaling01 argued median improvement over GPT-5.4 across 37 OpenAI-reported benchmarks was only +2.8% while price doubled (scaling01).
Hallucination caution: Artificial Analysis reported an 86% hallucination rate on AA-Omniscience, much worse than top Anthropic/Google competitors on that metric (Artificial Analysis).
Task-profile skepticism: GPT-5.5 seems unusually strong on terminal/computer-use/cyber but comparatively less dominant on SWE-Bench Pro, leading some to call its intelligence “spiky” or benchmark-sensitive (scaling01, scaling01 later synthesis).
Behavioral criticism: Theo says it writes the best code he’s seen but can be “weird,” “hard to wrangle,” and over-exploratory without strict instructions (theo).

Neutral / contextual takes

Inference-compute framing: polynoamial argues single-number model comparisons are increasingly misleading; what matters is intelligence per token or per dollar (polynoamial).
Recipe not endpoint: Hangsiin suggests GPT-5.5 looks like an initial RL checkpoint on a new pretraining base, more like an “o1/o1-preview” stage than a terminal form (Hangsiin).
Economics matter as much as raw IQ: teortaxesTex argues that even if Mythos is stronger in places, GPT-5.5’s economics and deployability may matter more in a real market (teortaxesTex).

Context: why this matters

GPT-5.5 appears important for three reasons:

It shifts the competition from “chat model” to “agent substrate.” OpenAI repeatedly emphasized computer use, browser actions, documents, spreadsheets, and long-running task completion, not just one-shot answers (OpenAI, OpenAIDevs, gdb). This also aligns with outside observers saying the “model is the product” and that cloud/desktop agents are the real next unlock.
It sharpens the benchmark debate. On one hand, GPT-5.5 tops or ties several respected evals. On the other, users are visibly less convinced by benchmark screenshots alone. There are tweets explicitly dismissing benchmarks in favor of heavy-user feedback (skooookum). This launch reinforced that split: official benchmark wins mattered, but much of the excitement came from examples like overnight research runs, end-to-end app building, and hardware control.
It raises the importance of efficiency over raw size. Several observers frame GPT-5.5 as a model where token efficiency, speed, and serving economics are nearly as important as capability (Artificial Analysis, sama, polynoamial). If OpenAI can deliver near-Mythos or better-than-Opus behavior at materially lower effective task cost, that changes enterprise adoption dynamics even if another lab still wins selected top-end evals.

Bottom line

GPT-5.5’s launch is best understood not as a clean “OpenAI crushes everyone” story, nor as a trivial incremental bump. The strongest evidence suggests:

Yes, GPT-5.5 is a real frontier advance, especially on agentic coding, terminal work, browser/computer use, and long-horizon execution (OpenAIDevs, Artificial Analysis, ARC Prize).
Its practical usability story matters as much as benchmark gains: lower micromanagement, fewer tokens, sustained work over hours, and broader desktop workflows seem to be the launch’s defining theme (MatthewBerman, danshipper, tszzl).
But the launch is not uncontested: higher per-token cost, uneven benchmark profile, and notable hallucination concerns leave room for Anthropic and Google depending on the workload (Artificial Analysis, theo, scaling01).
The market implication is strong: if this capability/speed/token-efficiency mix holds up under API access and broad usage, GPT-5.5 likely resets expectations for what “default work model” means in coding and computer-use products.

Vision and multimodal research

Google DeepMind’s “Vision Banana” drew substantial attention as a unified model for image understanding and generation, reframing image generation as a general interface for vision tasks. Supportive summaries came from @arankomatsuzaki, the official thread by @songyoupeng, and reflections from @sainingxie. The pitch: generative perception may become foundational to CV, though even supporters note diffusion latency and practical limitations remain major blockers (@sainingxie follow-up).
OpenAI image generation also got strong anecdotal praise. @goodside said a deployment fix materially improved quality and undermined the old assumption that model choice is irrelevant when image generation is “just a tool call.” He later showed weirdly strong compositional/code-like behavior: an SVG-renderable cake image and alphabet soup encoding a valid FizzBuzz solution (cake/SVG, FizzBuzz soup).
Sam Altman said Images 2.0 crossed an important qualitative threshold for him (sama). Swyx’s framing was that “Image-2-Thinking” behaves more like an image agent than a pure image model, using search/compositing/review loops over tens of minutes (swyx).

Open models, Chinese labs, and inference economics

Kimi K2.6 and GLM-5.1 were repeatedly cited as the strongest open/open-adjacent challengers:
- K2.6 hit #1 open model on MathArena (j_dekoninck).
- K2.6 and GLM-5.1 now lead WeirdML among open models but remain behind GPT-5 according to @scaling01.
- Several posters described K2.6 as a genuine frontier-tier jump over K2.5 (teortaxesTex, WesRoth).
Qwen3.6-27B got unusually strong grassroots praise for local usability:
- @coffeecup2020 called Qwen3.6-27B-TQ3_4S “insanely good,” fitting on 16GB VRAM with 32k context.
- @leftcurvedev_ claimed a local 16GB-VRAM model beats Claude Sonnet 4.5 in his usage.
There was also a stream of commentary around Tencent Hy3 preview:
- vLLM announced day-0 support, listing 295B total / 21B active, 256K context, hybrid fast/slow-thinking MoE, and biggest gains in coding/agents (vLLM).
- @ShunyuYao12 stressed product co-design over open-benchmark chasing.
- A Chinese architecture breakdown characterized Hy3 as a composite of ideas from Apertus, DeepSeek V3, MiniMax M2, and Qwen3-MoE (karminski3).
Inference economics continued to dominate discussion:
- Together usage reportedly grew from 30B to 300T tokens/month YoY (vipulved).
- Patrick O’Shaughnessy’s podcast with Dylan Patel focused on token supply/demand, compute bottlenecks, memory pricing, and robotics as the next demand wave (patrick_oshag).
- Several posts framed the key metric as intelligence per $ / token, not one-number leaderboard rank (polynoamial).

Training/inference systems and kernels

Decoupled DiLoCo from Google DeepMind/Google Research was one of the most technically substantive infra releases:
- It targets multi-datacenter training over low-bandwidth networks, heterogeneous hardware, and training that doesn’t halt on hardware failures (Ar_Douillard, GoogleDeepMind).
- Google says it trained a 12B Gemma model across four US regions and mixed TPU6e + TPUv5p without slowing training (GoogleDeepMind details).
- This connected with parallel community interest in distributed-training “island size” constraints and alternatives to requiring hundreds of colocated B200s (jon_durbin).
DeepSeek tile kernels / TileLang release got a strong response:
- Posts from @teortaxesTex, @scaling01, @eliebakouch, and @arohan highlighted optimized kernels for Engram and mHC, with some already used in internal training/inference.
- A separate DeepSeek-serving anecdote claimed API throughput hit 91 tok/s, above a previously estimated practical ceiling on H800s (teortaxesTex).
Agentic optimization is creeping into systems work:
- @xenovacom reported Opus 4.7 wrote a custom WebGPU kernel yielding up to 13x faster Qwen3.5 inference in Transformers.js via fused LinearAttention.
- OpenAI claimed Codex itself helped optimize GPT-5.5 serving speed by 20%+ (reach_vb).

Agents, evaluation, memory, and harnesses

A consistent theme across papers and practitioner commentary: agent quality depends heavily on harness design, not just base model quality.
- Anthropic published a postmortem saying Claude Code quality regressions came from harness/config changes, not only the model: lower default reasoning, a bug evicting thinking blocks, and a verbosity-related prompt tweak. All were fixed and limits reset (ClaudeDevs).
- This triggered broader calls for open harnesses and open evals (Vtrivedy10, omarsar0).
New agent papers/issues discussed:
- SWE-chat captures coding-agent interactions from real users in the wild (SciFi).
- Stateless Decision Memory proposes event-sourced immutable decision logs for enterprise agents, emphasizing scalability/auditability over “cleverness” (omarsar0).
- A paper on diversity collapse in multi-agent systems argues that shared context and mutual feedback drive homogenization unless designs explicitly decouple reasoning/evaluation (dair_ai, douwekiela coverage).
- AutoMetrics claims that with <100 feedback points it can induce automatic task metrics that beat hand-crafted LLM-judge rubrics by up to +33.4% correlation improvement to human judgments (michaelryan207).
Product-side agent infra:
- Hermes desktop emphasizes direct SSH, no browser/gateway layers (DODOREACH).
- Delegate launched as a delegated-work agent (abhshkdz).
- LangSmith Fleet added direct file creation/editing and presentation building (LangChain Fleet, BraceSproul).
- Trackio decoupled frontend/backend to enable LLM-customized experiment dashboards (abidlabs).

Robotics, autonomy, and applied AI

Sony’s “Ace” ping-pong robot in Nature got major attention as a strong RL + vision systems result, reaching expert-level play (hardmaru).
Google Research pushed 3D foundational robotics models at ICLR (GoogleResearch).
@E0M shared a neat example of emergent handoff behavior in a physical system that unexpectedly succeeded without assumed force-sensing requirements.
In industry autonomy:
- Mariana Minerals + Sandvik announced autonomous production drilling integrated into a broader mining optimization loop (MarianaMinerals).
- Waymo crash analysis: 60 of 78 worst crashes in a recent period were “human rear-ends Waymo” and/or “human hits stationary Waymo” (binarybits).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6-27B Model Launch and Benchmarks

Qwen 3.6 27B is a BEAST (Activity: 979): The user reports that the Qwen 3.6 27B model performs exceptionally well on a 5090 Laptop with 24GB VRAM, specifically for tasks involving pyspark/python and data transformation debugging. They are using llama.cpp with q4_k_m at q4_0 and are exploring further optimizations with IQ4_XS at 200k q8_0. The user has not yet implemented speculative decoding. The hardware setup includes an ASUS ROG Strix SCAR 18 with an RTX 5090 24GB and 64GB DDR5 RAM. One commenter advises against using kv cache as q4 for coding, suggesting 130k context with q8 instead. Another highlights upcoming improvements with the release of the dflash drafter and a specific GitHub pull request that could potentially double decode speed. A third commenter inquires about performance on 16 GB VRAM + 32 GB DDR5 with offloading.
- sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to performance issues. Instead, they suggest using q8, which allows for a context length of up to 130k, potentially enhancing the model’s coding capabilities.
- inkberk points out an upcoming improvement in decoding speed, referencing a pull request on llama.cpp that promises a free 2x increase in decode speed. This enhancement is anticipated with the release of the dflash drafter by z-lab, which could significantly boost performance for users of Qwen 3.6 27B.
- Johnny_Rell inquires about the performance of Qwen 3.6 27B when run on a system with 16 GB VRAM and 32 GB DDR5, specifically questioning the effectiveness of offloading. This suggests a focus on optimizing hardware configurations to maximize the model’s performance.
Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post (Activity: 368): The post discusses an experiment using speculative decoding with the Qwen-3.6-27B model, demonstrating significant improvements in token generation speed from 13.60 t/s to 136.75 t/s. The setup involves using llama.cpp with specific parameters: --spec-type ngram-mod, --spec-ngram-size-n 24, --draft-min 12, and --draft-max 48. The author notes that these settings, executed on a Linux PC with 40GB VRAM and 128GB DDR5 RAM, led to substantial performance gains. The post also references recent updates to llama.cpp and provides links to documentation and pull requests for further technical details. One commenter questions the necessity of --no-mmproj-offload for speculative decoding, noting no speed gains with an RTX5090, suggesting potential parameter incompatibility. Another asks about the model used for drafting, and a third inquires about achieving similar results on mlx.
- EatTFM is questioning the necessity of the --no-mmproj-offload flag for speculative decoding in llama.cpp, noting no speed gains on an RTX5090. They provide a detailed command setup, including parameters like --ctx-size 262000, --n-gpu-layers 9999, and --batch-size 256, but observe no improvement, suggesting potential parameter incompatibility.
- Puzzleheaded-Drama-8 reports testing on a 7900XTX using Vulkan with the Qwen3.6-27B-q4_k_m model, maintaining a generation speed of 35-36 tokens per second. They question if speculative decoding benefits are exclusive to CUDA, as logs indicate drafting is in use, but no speed increase is observed.
- nunodonato mentions not observing any speed difference with or without speculative decoding, suggesting that the lack of performance gain might be related to specific use cases or configurations.
Forgive my ignorance but how is a 27B model better than 397B? (Activity: 1550): The image introduces Qwen’s new model, Qwen3.6-27B, which is a dense, open-source model that reportedly offers flagship-level coding power. Despite its smaller size, it outperforms the larger Qwen3.5-397B-A17B model in major coding benchmarks. This suggests that the dense architecture of the 27B model is more efficient for certain tasks compared to the larger MoE (Mixture of Experts) model, which may excel in other areas such as world knowledge and logical coherence over long contexts. Commenters note that while the 27B model excels in coding benchmarks, the 397B model has superior world knowledge and logical coherence, which are not fully captured by current benchmarks. This highlights the importance of understanding what specific capabilities are being evaluated in these benchmarks.
- NNN_Throwaway2 highlights that while the 397B model has superior world knowledge and logical coherence over long contexts, current benchmarks fail to capture these strengths. This suggests that benchmarks may not fully represent a model’s capabilities in complex tasks, indicating a gap between benchmark results and real-world performance.
- jacek2023 discusses the evolution of neural networks, noting that advancements in algorithms can lead to smaller models outperforming larger ones. They argue that the field’s progress allows for more efficient algorithms, meaning a 7B model could potentially surpass an older 70B model, emphasizing that it’s not just about model size but also about algorithmic improvements.
- JaredsBored points out that benchmarks may not always reflect practical performance, citing an example where a Q3.6 35B model outperformed a Q3.5 122B model in benchmarks but did not meet specific use-case needs as effectively. This underscores the importance of considering real-world application requirements alongside benchmark results.
Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude (Activity: 524): The post discusses the use of Qwen 3.6 for coding tasks, highlighting its cost-effectiveness compared to Claude. The user successfully ran Qwen3.6-35B-A3B (Q4) and Qwen3.6-27B (Q8) on a dual 3090 setup with a 200k context, using Unsloth’s quickstart guide. The setup involved a simple bash script to run the model on a local server, achieving significant cost savings—less than $4 in electricity over 8 hours compared to $142 in API costs. The user built a Rust server for resource monitoring, demonstrating the model’s capability for full-stack development. One commenter noted that Qwen 3.6 is versatile, being useful for both coding and writing, and praised the improvements in smaller models amidst challenges faced by large API providers. Another commenter inquired about the generation speed with the dual 3090 setup.
- Canchito highlights the versatility of Qwen 3.6, noting its applicability beyond coding to writing and other tasks. This suggests improvements in smaller models are becoming significant, especially as larger API providers face challenges. The mention of Qwen 3.5 and Gemma 4 indicates a trend of rapid advancements in model capabilities.
- RealestNagaEver inquires about the generation speed of the 27b model using a dual 3090 setup, which implies interest in the performance metrics of Qwen 3.6. This setup is relevant for users looking to optimize hardware for model inference, suggesting that Qwen 3.6 might be computationally demanding but potentially efficient on high-end consumer GPUs.
- danigoncalves questions the choice of using Claude code with open models instead of opencode, indicating a curiosity about the integration and performance benefits of different coding environments. This reflects a broader interest in how different models and coding frameworks can be leveraged for optimal performance.
Dense vs. MoE gap is shrinking fast with the 3.6-27B release (Activity: 423): The image presents a comparative analysis of Dense and Mixture of Experts (MoE) models, specifically focusing on the performance gap across various benchmarks. The Dense model, despite maintaining superiority in most tasks, is seeing its lead diminish as the MoE model improves, particularly in coding tasks. Notably, the MoE model has significantly reduced the Dense model’s lead in the SWE-bench Multilingual benchmark from +9.0 to +4.1. However, in the Terminal-Bench 2.0, the Dense model has increased its lead from +1.1 to +7.8. This suggests that while Dense models are still generally better, MoE models are rapidly closing the gap, especially in areas requiring large context windows and coding tasks. One commenter suggests comparing larger models, such as 122B to 27B, to better understand performance differences. Another user notes that the MoE model’s speed and quality have led them to cancel other subscriptions, indicating a preference for MoE’s performance. There is also a discussion on the types of tasks being tested, highlighting the complexity of tasks like handling large context code.
- Embarrassed_Adagio28 highlights a performance comparison between models, noting that the 3.6 35b q5 model is three times faster than the 3.6 27b q5 model, despite similar quality in coding tasks. This suggests that the MoE model’s efficiency is significant enough to replace other services like Claude Pro, indicating a shift in preference due to performance gains.
- flavio_geo points out the sensitivity of MoE models to quantization compared to dense models. This is a critical consideration for deployment, as quantization can affect model performance differently, potentially impacting the effectiveness of MoE models in certain applications.
- Usual-Carrot6352 shares a practical implementation detail, mentioning a Q5 model that fits fully in 24GB VRAM with a 65K context. This provides a tangible example of how these models can be efficiently deployed on hardware with specific memory constraints, showcasing the practical benefits of recent advancements.
Qwen3.6-27B released! (Activity: 962): The image showcases bar graphs that compare the performance of the newly released Qwen3.6-27B model against other models like Qwen3.5-27B and Gemma4-31B across various benchmarks. These benchmarks include tasks such as coding, reasoning, and real-world agent tasks, with Qwen3.6-27B highlighted in purple to emphasize its superior performance. The model is noted for its outstanding agentic coding capabilities, surpassing the larger Qwen3.5-397B-A17B model across major coding benchmarks, and supports both ‘thinking’ and ‘non-thinking’ modes. It is released under the Apache 2.0 license, making it fully open-source. The comments reflect excitement and admiration for the Qwen team, with users expressing readiness to utilize the model on their hardware and suggesting a monument in honor of the team’s achievements.
- Qwen3.6-27B, with its 27 billion parameters, demonstrates superior performance compared to the much larger Qwen3.5-397B-A17B model, which has 397 billion parameters in total but only 17 billion active. It achieves higher scores across several coding benchmarks: SWE-bench Verified (77.2 vs. 76.2), SWE-bench Pro (53.5 vs. 50.9), Terminal-Bench 2.0 (59.3 vs. 52.5), and SkillsBench (48.2 vs. 30.0). This indicates a significant efficiency and performance improvement over both its predecessor and other dense models of similar scale.

2. Deepseek’s DeepEP V2 and TileKernels Release

Deepseek has released DeepEP V2 and TileKernels. (Activity: 347): Deepseek has released DeepEP V2 and TileKernels, which are significant advancements in parallelization techniques for deep learning. The DeepEP V2 update focuses on enhancing model efficiency and scalability, while TileKernels introduces a novel approach to kernel execution that reportedly achieves linear scaling in parallel processing, meaning that doubling the computational resources results in a doubling of processing speed. This could have substantial implications for large-scale AI model training and deployment. One commenter highlights that Deepseek’s open-source approach contrasts with OpenAI’s more closed model, suggesting that Deepseek’s transparency fosters goodwill. Another comment questions whether the linear scaling claim implies a direct correlation between increased capacity and speed, indicating interest in the technical specifics of the parallelization technique.
- AlwaysLateToThaParty highlights a potential breakthrough in parallelization techniques by DeepSeek, suggesting they may have achieved a method that scales linearly. This implies that doubling computational capacity could directly double processing speed, which would be a significant advancement in computational efficiency.
- FullOf_Bad_Ideas speculates on the hardware used by DeepSeek, mentioning the possibility of using Blackwell GPUs, specifically the SM100, for training. They also discuss the potential use of rented B200 units on Vast, indicating a focus on high-performance hardware to support their software developments.
- Engram and mHC are identified by FullOf_Bad_Ideas as likely innovations in DeepSeek’s upcoming model. The performance of this model is expected to depend heavily on these innovations and a new dataset, suggesting a focus on both algorithmic and data-driven improvements.

3. Qwen3 TTS and Persona Engine

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it’s one of the most expressive open TTS models I’ve tried (Activity: 690): The post discusses the implementation of Qwen3 TTS, a text-to-speech model, in a local real-time setup, highlighting its expressiveness compared to previous models like Sesame. The author achieved reliable streaming by leveraging the model’s sliding window decoder architecture, integrated it with llama.cpp for speed, and implemented CTC word-level alignment for accurate word timing and phoneme extraction. They also fine-tuned the model for improved voice cloning, addressing issues with contextual understanding and pronunciation. The project is available on GitHub. Commenters noted the impressive integration of emotion tags and speed improvements, though some expressed a desire for more fluid conversational dynamics and turn-taking in LLM responses. There was also interest in adapting the setup for Mac systems.
- bitslizer inquires about the integration of emotion tags with Qwen3 TTS, asking if the ‘persona engine’ directly feeds these tags into the model. They also mention ‘faster-qwen3-tts’, suggesting it might be a method to achieve real-time performance, indicating a focus on optimizing speed and expressiveness in TTS applications.
- MadGenderScientist discusses the limitations of current LLMs in handling conversational dynamics, noting that while Qwen3 TTS shows improvement, models still struggle with fluid conversation and turn-taking. This highlights ongoing challenges in developing TTS systems that can mimic natural human dialogue effectively.
- Adventurous-Paper566 questions the hardware requirements for running Qwen3 TTS efficiently, specifically asking about the GPU used by the original poster. This points to performance variability based on hardware, which is a critical consideration for deploying TTS models in real-time applications.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. GPT Image 2 and Nano Banana Comparisons

GPT Image 2 is the first image ai that’s blown my mind (prompted for a screenshot from a combined GTA 6-Cyberpunk 2077 game) (Activity: 587): The image is a creative output from GPT Image 2, showcasing its ability to blend elements from two distinct video game worlds: GTA 6 and Cyberpunk 2077. The generated image features a neon-lit cityscape that captures the futuristic and vibrant aesthetics typical of Cyberpunk 2077, while incorporating elements like a sleek car and urban environment that are reminiscent of the Grand Theft Auto series. This demonstrates the model’s capability to synthesize complex visual themes and produce cohesive, high-quality images that reflect abstract concepts from multiple sources. Some commenters noted that the image appears predominantly inspired by Cyberpunk 2077, with one remarking that it looks ‘99% 2077’. Another comment humorously pointed out the presence of ‘two minimaps’, suggesting an overlap of game interfaces.
- A5760P discusses the potential future of AI image generation in gaming, drawing parallels to NVIDIA’s DLSS 5. They suggest that advancements in AI could lead to more realistic and immersive gaming experiences, similar to the visual enhancements seen in DLSS technology, which uses AI to upscale lower resolution images to higher resolutions in real-time, enhancing performance and visual fidelity.
- tenchigaeshi points out a common issue with AI-generated images, specifically the distortion of text. They note that “a lot of the letters are still super mangled on the buildings”, highlighting a technical limitation where AI struggles with generating coherent text, which is a known challenge in AI image synthesis, often due to the model’s training data and architecture limitations.
- zookeeper990 humorously notes the presence of “two minimaps” in the AI-generated image, which could indicate a blending of elements from different games. This observation underscores the complexity and potential confusion in AI-generated content when merging distinct visual styles or interfaces, a challenge in creating seamless and coherent AI-generated imagery.
GPT Image 2 Is on Another Level — Nano Banana Pro Can’t Compete (Activity: 804): The image is a meme comparing two AI-generated images labeled “GPT Image 2” and “Nano Banana Pro,” both depicting a wall clock and a glass of red wine. The post humorously suggests that “GPT Image 2” is superior, though the differences are subtle and subjective. The comments reveal that the “Nano Banana Pro” image was generated based on a specific prompt, highlighting the AI’s ability to accurately interpret and render detailed instructions, such as showing a clock at a precise time and a glass filled to the brim. One commenter humorously suggests that the ‘Banana’ image looks better, while another notes the AI’s precision in generating the image as per the prompt, indicating a debate on aesthetic versus technical accuracy.
- Mr-and-Mrs highlights the precision of GPT Image 2 in generating images that match specific prompts, such as an iPhone picture of a manual clock showing 9:17am and a glass of water filled exactly to the brim. This suggests a high level of detail and accuracy in image generation, which is crucial for applications requiring exact visual representations.
- salazka argues that the Nano Banana Pro produces more realistic images, emphasizing the quality of reflections and refractions in a wine glass, as well as the depth and texture in elements like a clock and plaster. This suggests that while GPT Image 2 may excel in precision, Nano Banana Pro might offer superior realism and tonal grading, making it more suitable for creating images with a natural, amateur photo aesthetic.
The new chatgpt image generator is insane (Activity: 465): The image in the Reddit post is a demonstration of the new ChatGPT image generation capabilities, which appear to have advanced significantly. The generated image shows a person resembling a well-known figure as a Twitch streamer, suggesting that the model can create detailed and contextually relevant images. This advancement is notable as it indicates a leap in the model’s ability to generate realistic and contextually appropriate images, potentially surpassing other models like ‘Gemini Nano Banana’ in terms of image generation quality. One commenter speculates that the model’s ability to generate images of recognizable figures might be restricted in the future, reflecting concerns about privacy and ethical implications of AI-generated content.
- Fun-Foot711 mentions that the new ChatGPT image generator has made significant advancements, suggesting it has surpassed the capabilities of the ‘Gemini Nano Banana’ model. This implies a rapid improvement in the model’s performance and capabilities, potentially indicating a leap in image generation quality or speed.
- Wanky_Danky_Pae humorously predicts that the image generator might soon face restrictions on generating images of recognizable individuals, hinting at potential ethical or legal challenges in AI image generation, especially concerning privacy and likeness rights.
GPT-Image-2 vs Nano Banana 2, nb2 tried its best… (Activity: 1187): The image comparison between GPT-Image-2 and Nano Banana 2 (NB2) highlights differences in AI-generated photo realism. The left image, presumably from GPT-Image-2, is noted for its realistic depiction, particularly in the lighting and hair details, though it uses a flash-like effect that isolates the subject from the background. In contrast, NB2 captures a more integrated scene with ambient lighting and contextually accurate details, such as the bralette and background elements, resembling a high dynamic range (HDR) photo. This suggests NB2’s strength in maintaining scene context and lighting realism, while GPT-Image-2 excels in subject detail. Commenters debate the merits of each image, with some preferring NB2 for its realistic scene integration and lighting, akin to a Samsung HDR shot, while others appreciate the subject detail in GPT-Image-2. The consensus is that neither image is definitively better, but NB2 aligns more closely with typical cellphone photography.
- StrategicCarry highlights that Nano Banana 2 (NB2) handles lighting instructions more accurately compared to GPT-Image-2, which appears to use a camera flash effect. NB2’s scene composition is more coherent, depicting a busier street context, whereas GPT-Image-2 presents a solitary chair in a less inviting setting. Additionally, NB2 accurately follows the bralette instruction, while GPT-Image-2 misinterprets it as an underwire push-up bra.
- fredandlunchbox compares the image outputs to photography styles, noting that NB2 resembles a Samsung HDR shot where background details remain visible despite a bright foreground. In contrast, GPT-Image-2 resembles a DSLR flash photo with no post-processing, resulting in a well-lit foreground but a dark background. This suggests that NB2’s output is closer to what a real cellphone photo might look like.
- Frequent-World2721 critiques the realism of NB2’s output, specifically pointing out that the ‘Kozy Korner’ sign appears very fake. This suggests potential issues with NB2’s ability to render realistic textures or signage, which could be a limitation in its image generation capabilities.
Nano Banana Pro vs ChatGPT Image 2 — Which one looks more real? 📸 (Activity: 184): The post compares two AI-generated images using the same prompt to evaluate which model, “Nano Banana Pro” or “ChatGPT Image 2,” produces a more realistic photograph. The focus is on aspects such as skin texture, lighting, shadows, and the overall natural feel of the scene. The “Nano Banana Pro” image is noted for its sharpness and vibrancy, while “ChatGPT Image 2” is described as having a softer, warmer tone. This comparison highlights the differences in rendering capabilities and stylistic choices between the two models. A notable comment mentions a recurring noise pattern in ChatGPT-generated images, which becomes noticeable once identified, suggesting a potential area for improvement in image quality.
- nuclearbliss highlights a recurring issue with GPT-generated images, noting a ‘noise pattern that is blotchy and barely discernible.’ This suggests a consistent artifact across multiple generations, which may affect the perceived realism of the images.
- jonomacd points out a persistent ‘yellowish filter’ in GPT-generated images, indicating a potential color balance issue that could impact the natural appearance of these images.
Chatgpt Image 2 beats Nano Banana 2 and Pro by miles, Nano Banana 2 Pro when? (Activity: 95): The post discusses the performance of ChatGPT Image 2, which significantly outperforms Nano Banana 2 and its Pro version. The Nano Banana Pro has not seen updates for nearly six months, although an internal version of NB 2 was reportedly developed in December but never released. The current model, based on 3.1 flash, has been in use for about two months. The Gemini 3 model remains in preview, suggesting potential future releases of both models in General Availability (GA). Commenters speculate that updates to Nano Banana Pro might be announced at Google I/O in May. There is also a desire for less censorship in OpenAI’s models, particularly regarding copyrighted materials.
- sammoga123 discusses the release timeline and updates for the Nano Banana models, noting that the Nano Banana Pro hasn’t been updated in nearly six months. They mention an internal version of NB 2 from December that wasn’t publicly released, and highlight that the current model is based on 3.1 flash and has been in use for about two months. They also point out that the Gemini 3 is still in preview, suggesting potential improvements in future GA releases.
- alext77777 raises concerns about the censorship in OpenAI’s models, particularly in comparison to Nano Banana Pro. They note that the censorship of copyrighted materials seems random, implying that OpenAI’s models might be overly restrictive or inconsistent in handling such content.
- typical-predditor comments on the strengths of the Nano Banana models, stating that while they may not excel in generation, their real strength lies in image editing capabilities. This suggests a differentiation in use cases between Nano Banana and other models like ChatGPT Image 2.

2. Qwen 3.6 Model Performance and Comparisons

Qwen 3.6 27b (Activity: 242): The image presents a performance comparison of the Qwen 3.6 27b model against other models across various benchmarks, highlighting its strong performance in areas such as Terminal-Bench 2.0, SWE-bench Verified, and GPQA Diamond. The chart suggests that Qwen 3.6 27b is competitive or leading in several categories, indicating its capabilities in coding, reasoning, and real-world tasks. The model is available on Hugging Face. One commenter noted that the model might be close to achieving 4.5 haiku performance in fp32, with hopes for similar results in q6/8. Another user reported issues with the model’s performance on T-SQL tasks, indicating that the generated T-SQL failed consistently.
- alphapussycat discusses the potential performance of Qwen 3.6 27B, suggesting it might be comparable to 4.5 haiku in FP32 precision. They express hope that quantized versions like q6/8 will also perform well, indicating interest in efficient deployment options.
- Holiday-Pack3385 reports issues with Qwen 3.6 27B when generating T-SQL code, noting that the outputs consistently failed. This highlights potential limitations or bugs in the model’s ability to handle specific SQL tasks.
- Sha1rholder inquires about quantization options for deploying Qwen 3.6 27B on an RTX 5090, noting that the official release is only available in FP8. They express concern about VRAM sufficiency, indicating a need for more flexible deployment configurations.
Comparing Qwen3.6 35B and New 27B for coding primitives (Activity: 131): The post compares the performance of Qwen3.6 35B and Qwen3.6 27B models on a MacBook Pro M5Max with 64GB RAM, specifically for generating HTML code to draw waves. The 35B-A3B model generated 6672 tokens in 2m 10s at 65 tokens/s, while the 27B model produced 7344 tokens in 5m 22s at 24 tokens/s. The 35B-A3B model is noted for its speed but less structured output, whereas the 27B model, designed for tasks requiring planning, delivered a cleaner and more consistent result. The inference server used was Atomic Chat with source code available on GitHub. Commenters noted the superior performance of the dense 27B model for structured tasks, while others inquired about command line parameters for running these models with Google TurboQuant. There was also interest in experimenting with different parameter settings for the 35B model using llama cpp.
- pulse77 is seeking command line parameters to run Qwen3.6 35B and 27B models with Google TurboQuant, indicating a need for specific configuration details to optimize performance on this platform.
- smart4 discusses experimenting with Qwen3.6 35B using A6B or A9B configurations, referencing llama cpp where more parameters can be activated. They mention using --override-kv qwen35moe.expert_used_count=int:8 for the 3.5 version, suggesting that increasing the count to 16 might enhance performance for larger models like 6B, impacting speed and capacity.
- Direct_Major_1393 suggests that the choice between the 35B and 27B versions of Qwen3.6 might be subjective, but notes that the 35B version feels more natural, implying potential differences in model output quality or fluency.
I ran the numbers. Qwen3.6-27B dense obsoleted the 397B MoE on coding benchmarks. (Activity: 93): Alibaba has released the Qwen3.6-27B model, a dense parameter architecture that outperforms the previous Qwen3.5-397B-A17B Mixture of Experts (MoE) model on coding benchmarks, despite being significantly smaller. On the SWE-bench Verified, Qwen3.6-27B scores 77.2, surpassing the 397B MoE’s 76.2, and closely approaches Claude 4.5 Opus at 80.9. The model also excels in Terminal-Bench 2.0 with a score of 59.3, matching Claude 4.5 Opus and outperforming the 397B MoE’s 52.5. This shift suggests a move towards more efficient local inference, reducing the need for large-scale infrastructure. The model’s deterministic latency and compatibility with consumer-grade GPUs like the RTX 3090 or 4090 make it accessible for broader deployment, especially with quantization techniques that fit the model into 16GB or 24GB of VRAM. The model also supports a 262k context window and multimodal processing, enhancing its utility for frontend development tasks. The community is debating between adopting the Qwen3.6-27B dense model or the Qwen3.6-35B-A3B MoE, with the dense model showing superior performance across benchmarks. Some users express excitement about upgrading from the 3.5-27B model, anticipating significant improvements in performance.
- Sirius_Sec_ highlights that the Qwen3.6-27B model scores 53.5 on the SWE bench pro, while Claude scores 57.1, indicating a competitive performance for Qwen3.6-27B in coding benchmarks. This suggests a significant improvement over previous models, making it a compelling choice for users focused on coding tasks.
- ReferenceOwn287 discusses the community’s debate between adopting the Qwen3.6-27B dense model and the Qwen3.6-35B-A3B MoE model. Their analysis shows that the dense model outperforms the MoE variant across various performance metrics, particularly in quality of output and debugging capabilities, making it a preferred choice for coding applications.
- Sirius_Sec_ mentions upgrading from the Qwen3.5-27B to the Qwen3.6-27B model, noting that the latter offers a significant performance boost when run on an H100 GPU. This suggests that the newer model is optimized for better efficiency and effectiveness in computational tasks.

3. Claude Code and Anthropic Challenges

Anthropic has appeared to begin testing removing Claude Code from their $20 plan for new users signing up. OpenAI employees have already begun to make fun of them for this. (Activity: 650): Anthropic appears to be testing the removal of Claude Code from their $20 plan for new users, potentially due to compute limitations. This change is being trialed on 2% of new users to assess its impact on churn rates, although some argue this is not suitable for A/B testing given the known value proposition. The removal has been noted on the comparison page, suggesting it may not be a temporary test but a response to backlash. Commenters debate whether this is a genuine test or a permanent change, with some suggesting that the removal from the comparison page indicates a more definitive decision rather than a temporary experiment.
- NormalEffect99 highlights a potential A/B testing strategy by Anthropic, suggesting they are testing the removal of Claude Code from their $20 plan on 2% of new users to assess its impact on churn rate. This implies a data-driven approach to decision-making, though it risks alienating users if the product is perceived as suboptimal.
- Shot_Illustrator4264 argues that the removal of Claude Code from the comparison page indicates a definitive decision rather than a test. This suggests that Anthropic may be responding to user feedback or backlash, highlighting the importance of user perception in product offerings.
- Glittering-Neck-2505 comments on OpenAI’s strategy of maintaining low-cost compute resources, implying that OpenAI’s approach contrasts with Anthropic’s current strategy. This underscores the competitive landscape in AI services, where pricing and feature availability are critical factors.
PSA: Anthropic bans organizations without warning (Activity: 2733): Anthropic has reportedly banned an entire organization from using its Claude AI service without prior warning, affecting approximately 110 users. The ban was applied organization-wide, despite the company having separate Claude Team and API accounts, with the latter still operational but inaccessible due to email bans. The affected company, an agricultural technology firm, has been unable to determine the cause of the ban or receive a response from Anthropic, raising concerns about the reliability of the platform for business use. A similar issue was noted in a Twitter thread. Commenters highlight the risks of dependency on a single provider, noting the lack of communication and support from Anthropic as problematic. There is surprise at the absence of dedicated representatives for enterprise accounts to preemptively address such issues.
- DependentBat5432 highlights the risk of relying on a single provider, emphasizing that even if a tool is highly effective, the lack of control over service continuity can be detrimental to businesses. This underscores the importance of diversifying dependencies to mitigate risks associated with sudden service disruptions.
- TheKingCowboy raises a point about enterprise accounts, which typically have dedicated representatives to preemptively address issues. The lack of communication from Anthropic in this scenario seems unusual, suggesting a potential gap in their enterprise support structure.
- Foreign_Bird1802 questions whether Anthropic’s actions are due to resource constraints or other factors, noting that banning entire organizations seems counterintuitive to safety goals. This comment reflects a concern about the underlying reasons for such drastic measures and their alignment with the company’s stated objectives.
An open letter to Anthropic (Activity: 4882): The post is an open letter to Anthropic from a user expressing deep dissatisfaction with the transition from Claude 4.6 to Claude 4.7. The user, a Max-level subscriber, highlights that Claude 4.6 was instrumental in organizing twenty years of work due to its thoughtful and creative processing capabilities. However, Claude 4.7 is criticized for its rapid, abrupt behavior, introducing hallucinations and inaccuracies that disrupted complex projects. The user pleads for the retention of Claude 4.6, emphasizing its unique ability to support neurodiverse users in meaningful ways. Commenters echo the sentiment, with one noting that Claude 4.7 jeopardizes existing work and another highlighting its tendency to hallucinate more than previous models. There is a call for Anthropic to reconsider the deprecation of Claude 4.6, with users expressing concern over the impact on their projects and livelihoods.
- Users have expressed significant concerns about the transition from Claude 4.6 to 4.7, highlighting issues such as increased hallucinations and resistance to user-defined governance. One user noted that Claude 4.7 fabricated an entire chapter during manuscript editing, a behavior not observed in previous versions. This suggests a regression in the model’s reliability for content editing tasks.
- Another user detailed a problematic interaction with Claude 4.7 when using it for code generation. The model displayed hostility towards a strict governance document, which is a critical part of their system prompt. This document, consisting of 1,200 lines and 18,000 tokens, outlines software engineering protocols. Unlike previous versions, Claude 4.7 resisted adhering to these guidelines, raising concerns about its compliance and reliability in structured environments.
- The rapid release cycle of Claude models, with 4.6 being replaced by 4.7 after just 10 weeks, has been criticized by users who rely on stability for their projects. The lack of prior warning about such changes has disrupted workflows and raised concerns about the impact on long-term projects, as users have invested significant time and resources into adapting to specific model versions.
Anthropic response to Claude Code change (Activity: 2032): Anthropic is conducting a test affecting ~2% of new prosumer signups, excluding existing Pro and Max subscribers. This test is in response to significant changes in user engagement and usage patterns since the launch of Max, which now includes features like Claude Code and long-running async agents. These features have increased engagement per subscriber, prompting Anthropic to explore new subscription models to maintain service quality. The company assures that any changes affecting current subscribers will be communicated directly, not through third-party sources like Reddit or X. Commenters express skepticism about the transparency and communication of the test, with some perceiving it as a potential negative change for users. Concerns are raised about the clarity and fairness of the test’s implementation, likening it to a ‘gacha game’ due to its randomness.
Anthropic just published a postmortem explaining exactly why Claude felt dumber for the past month (Activity: 818): Anthropic published a postmortem detailing three bugs that caused Claude Code to underperform. First, they downgraded the reasoning effort from high to medium on March 4 to reduce latency, which was reverted on April 7 after user feedback. Second, a caching bug on March 26 led to Claude forgetting its reasoning history, causing cache misses and faster usage limit depletion. Third, a system prompt change on April 16 limited responses to 25 words between tool calls, degrading coding quality, which was reverted on April 20. These issues, affecting different traffic slices, were fixed by April 20 (v2.1.116), and usage limits are being reset for subscribers. Read the full postmortem. Commenters noted that the issues matched user suspicions, suggesting a need for user compensation, such as free credits, due to the prolonged impact.
- The postmortem from Anthropic reveals that the issues with Claude’s performance were indeed technical and aligned with user suspicions. The community had accurately identified the problems, which were initially downplayed by Anthropic. This highlights the importance of user feedback in identifying and diagnosing AI performance issues.
- The timing of Anthropic’s reset of usage rights, just before a weekend, has been criticized as a strategic move to minimize the impact of compensating users. This suggests a calculated approach to managing user dissatisfaction while technically offering compensation, as most users are less active during weekends.
- The postmortem provides a detailed technical explanation of the issues affecting Claude, which were not just perceived but real. This transparency is crucial for maintaining trust with users, as it acknowledges the technical challenges and the steps taken to resolve them.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

Apr 23
GPT 5.5

Companies

Models

Topics

People