a quiet day.

AI News for 4/16/2026-4/17/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

Anthropic’s Claude Opus 4.7 and Claude Design rollout

  • Claude Design launched as Anthropic’s first design/prototyping surface: @claudeai announced Claude Design, a research-preview tool for generating prototypes, slides, and one-pagers from natural-language instructions, powered by Claude Opus 4.7. The launch immediately framed Anthropic as moving beyond chat/coding into design tooling; multiple observers called it a direct shot at Figma/Lovable/Bolt/v0, including @Yuchenj_UW, @kimmonismus, and @skirano. The market reaction itself became part of the story, with @Yuchenj_UW and others noting Figma’s sharp drawdown after the announcement. Product details surfaced via @TheRundownAI: inline refinement, sliders, exports to Canva/PPTX/PDF/HTML, and handoff to Claude Code for implementation.
  • Opus 4.7 looks stronger overall, but the rollout was noisy: third-party benchmark posts were broadly favorable. @arena put Opus 4.7 #1 in Code Arena, +37 over Opus 4.6 and ahead of non-Anthropic peers there; the same account also had it at #1 overall in Text Arena with category wins across coding and science-heavy domains here. @ArtificialAnlys reported a near three-way tie at the top of its Intelligence IndexOpus 4.7 57.3, Gemini 3.1 Pro 57.2, GPT-5.4 56.8—while also placing Opus 4.7 first on GDPval-AA, their agentic benchmark. They also noted ~35% fewer output tokens than Opus 4.6 at higher score, and introduction of task budgets plus full removal of extended thinking in favor of adaptive reasoning. But user experience was mixed in the first 24 hours: @VictorTaelin reported regressions and context failures, @emollick said Anthropic had already improved adaptive thinking behavior by the next day, and @alexalbert__ confirmed that many initial bugs had been fixed. There were also complaints about product stability in Design itself from @theo and account-level safety issues from the same account here.
  • Cost/efficiency discussion became almost as important as raw quality: @scaling01 claimed ~10x fewer tokens for some ML problem runs versus prior high-end models while maintaining similar performance, while @ArtificialAnlys placed Opus 4.7 on the price/performance Pareto frontier for both text and code. Not every benchmark agreed on absolute leadership—e.g. @scaling01 noted it still trails Gemini 3.1 Pro and GPT-5.4 on LiveBench—but the consensus from these posts is that Anthropic materially improved the model’s agentic utility and efficiency.

Computer use, coding agents, and harness design

  • Computer-use UX is becoming a mainstream product category: OpenAI’s Codex desktop/computer-use updates drew unusually strong practitioner reactions. @reach_vb called subagents + computer use “pretty close” to AGI in practical feel; @kr0der, @HamelHusain, @mattrickard, and @matvelloso all emphasized that Codex Computer Use is not just flashy but fast, able to drive Slack, browser flows, and arbitrary desktop apps, and may be the first genuinely usable computer-use platform for enterprise legacy software. @gdb explicitly framed Codex as becoming a full agentic IDE.
  • The field is converging on “simple harness, strong evals, model-agnostic scaffolding”: several high-signal posts argued that reliability gains now come more from harnesses than from chasing the very largest models. @AsfiShaheen described a three-stage financial analyst pipeline—router / lane / analyst—with strict context boundaries and gold sets for each stage, arguing that many bugs were actually instruction/interface bugs. @AymericRoucher extracted the same lesson from the leaked Claude Code harness: simple planning constraints plus a cleaner representation layer outperform “fancy AI scaffolds.” @raw_works showed an even starker example: Qwen3-8B scored 33/507 on LongCoT-Mini with dspy.RLM, versus 0/507 vanilla, arguing the scaffold—not fine-tuning—did “100% of the lifting.” LangChain shipped more of these patterns into product: @sydneyrunkle added subagent support to deepagents deploy, and @whoiskatrin announced memory primitives in the Agents SDK.
  • Open-source agent stacks continue to proliferate: Hermes Agent remained a focal point. Community ecosystem overviews from @GitTrend0x highlighted derivatives like Hermes Atlas, Hermes-Wiki, HUDs, and control dashboards. @ollama then shipped native Hermes support via ollama launch hermes, which @NousResearch amplified. Nous and Kimi also launched a $25k Hermes Agent Creative Hackathon @NousResearch, signaling a push from coding/productivity into creative agent workflows.

Agent research: self-improvement, monitoring, web skills, and evaluation

  • A cluster of papers pushed agent robustness and continual improvement forward: @omarsar0 summarized Cognitive Companion, which monitors reasoning degradation either with an LLM judge or a hidden-state probe. The headline result is notable: a logistic-regression probe on layer-28 hidden states can detect degradation with AUROC 0.840 at zero measured inference overhead, while the LLM-monitor version cuts repetition 52–62% with ~11% overhead. Separate work on web agents from @dair_ai described WebXSkill, where agents extract reusable skills from trajectories, yielding up to +9.8 points on WebArena and 86.1% on WebVoyager in grounded mode. And @omarsar0 also highlighted Autogenesis, a protocol for agents to identify capability gaps, propose improvements, validate them, and integrate working changes without retraining.
  • Open-world evals are becoming a serious theme: several posts argued current benchmarks are too narrow. @CUdudec endorsed open-world evaluations for long-horizon, open-ended settings; @ghadfield connected this to regulation and “economy of agents” questions; and @PKirgis discussed CRUX, a project for regular open-world evaluations of AI agents in messy real environments. On the measurement side, @NandoDF proposed broad NLL/perplexity-based eval suites over out-of-training-domain books/articles across 2500 topic buckets, though that sparked debate about whether perplexity remains informative after RLHF/post-training from @eliebakouch, @teortaxesTex, and others.
  • Document/OCR and retrieval evals also got more agent-centric: @llama_index expanded on ParseBench, an OCR benchmark centered on content faithfulness with 167K+ rule-based tests across omissions, hallucinations, and reading-order violations—explicitly reframing the bar from “human-readable” to “reliable enough for an agent to act on.” In retrieval, @Julian_a42f9a noted new work showing late-interaction retrieval representations can substitute for raw document text in RAG, suggesting some RAG pipelines may be able to bypass full-text reconstruction.

Open models, local inference, and inference systems

  • Qwen3.6 local/quantized workflows were a practical bright spot: @victormustar shared a concrete llama.cpp + Pi setup for Qwen3.6-35B-A3B as a local agent stack, emphasizing how viable local agentic systems now feel. Red Hat quickly followed with an NVFP4-quantized Qwen3.6-35B-A3B checkpoint @RedHat_AI, reporting preliminary GSM8K Platinum 100.69% recovery, and @danielhanchen benchmarked dynamic quants, claiming many Unsloth quants sit on the Pareto frontier for KLD vs disk space.
  • Consumer-hardware inference keeps improving: @RisingSayak announced work with PyTorch/TorchAO enabling offloading with FP8 and NVFP4 quants without major latency penalties, explicitly targeting consumer GPU users constrained by memory. Apple-side local inference also got a showcase with @googlegemma, which demoed Gemma 4 running fully offline on iPhone with long context.
  • Inference infra updates worth noting: @vllm_project highlighted MORI-IO KV Connector with AMD/EmbeddedLLM, claiming 2.5× higher goodput on a single node via a PD-disaggregation-style connector. Cloudflare continued its agent/AI-platform push with isitagentready.com @Cloudflare, Flagship feature flags @fayazara, and shared compression dictionaries yielding dramatic payload reductions such as 92KB → 159 bytes in one example @ackriv.

AI for science, medicine, and infrastructure

  • Scientific discovery and personalized health were prominent applied themes: @JoyHeYueya and @Anikait_Singh_ posted about insight anticipation, where models generate a downstream paper’s core contribution from its “parent” papers; the latter introduced GIANTS-4B, an RL-trained model that reportedly beats frontier models on this task. On the health side, @SRSchmidgall shared a biomarker-discovery system over wearable data whose first finding was that “late-night doomscrolling” predicts depression severity with ρ=0.177, p<0.001, n=7,497—notable because the model itself named the feature. Separately, @patrickc argued current coding agents are already highly useful for personalized genome interpretation, describing <$100 analysis runs that surfaced a roughly 30× elevated melanoma predisposition plus follow-on interventions.
  • Large-scale compute buildout remains a core meta-story: @EpochAIResearch surveyed all 7 US Stargate sites and concluded the project appears on track for 9+ GW by 2029, comparable to New York City peak demand. @gdb framed Stargate as infrastructure for a “compute-powered economy,” while @kimmonismus put today’s annual global datacenter capex at roughly 5–7 Manhattan Projects per year in inflation-adjusted terms.

Top tweets (by engagement)


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6 Model Launch and Features

  • Qwen3.6. This is it. (Activity: 1483): The post discusses the capabilities of Qwen3.6, a large language model, in autonomously building a tower defense game, identifying and fixing bugs such as canvas rendering issues and wave completion errors. The model is deployed using a llama-server setup with specific configurations, including Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf and mmproj-F16.gguf files, and operates with parameters like --cpu-moe, --top-k 20, and --temp 0.7. The user highlights the model’s efficiency, achieving 120 tk/s on an NVIDIA 3090 GPU, and its ability to quickly resolve coding issues that other models struggled with. Commenters express amazement at the model’s performance, noting its potential impact on future generations and its efficiency compared to other models like Gemma. There is interest in the technical stack used for deployment, indicating a desire for similar local setups.

    • cviperr33 highlights the impressive performance of the Qwen3.6 model, noting its ability to fix broken code quickly and efficiently. They report achieving 120 tokens/second on an NVIDIA 3090 using llama.ccp, with instant prefill in the 3.8k-5k token range. This speed allows for rapid responses and efficient file editing without overloading the GPU, contrasting with the slower performance of the Gemma models.
    • PotatoQualityOfLife inquires about the specific size or quantization of the model being used, which is a critical factor in understanding the model’s performance and resource requirements. This question suggests a focus on optimizing the model’s deployment for local setups, which can significantly impact speed and efficiency.
    • No-Marionberry-772 expresses interest in setting up a local environment for running models like Qwen3.6 but faces challenges in selecting the appropriate software stack. This reflects a common issue among users trying to leverage advanced models locally, indicating a need for clearer guidance or resources on optimal configurations.
  • Qwen 3.6 is the first local model that actually feels worth the effort for me (Activity: 512): The user reports that the qwen3.6-35b-a3b model is the first local model that feels efficient and worthwhile for their projects, particularly in UI XML for Avalonia and embedded systems C++. Running on a 5090 + 4090 setup, the model achieves 170 tokens per second with a 260k context, outperforming other models like Gemma 4 by requiring minimal corrections. This suggests significant improvements in local model capabilities, potentially reducing reliance on cloud-based solutions. The comments reflect a divided opinion on the model’s performance, with some users expressing skepticism about its capabilities and others noting a polarized reception post-release.

    • -Ellary- highlights the performance differences between Qwen 3.6 and other models, noting that Qwen 3.5 27b is superior in task execution and problem-solving. They suggest that if hardware resources allow, running the full GLM 4.7 358B A32B at IQ4XS or IQ3XXS would yield significantly better results compared to Qwen 3.6 35b A3B, which they consider a lighter model akin to 9-12b dense models.
    • kmp11 mentions the impressive performance of the Hermes-Agent when paired with Qwen 3.6, noting its ability to handle an unlimited number of tokens at speeds exceeding 100 tokens per second. This suggests a high level of efficiency and capability in processing large volumes of data quickly, which could be beneficial for applications requiring rapid token processing.
  • Qwen3.6 is incredible with OpenCode! (Activity: 436): The post discusses the performance of Qwen3.6, a local AI model, when deployed using llama.cpp on an RTX 4090 with 24 GB VRAM. The user tested the model on a complex task involving the implementation of RLS in PostgreSQL across a codebase with services in Rust, TypeScript, and Python. Despite some bugs, the model performed well, iterating on compiler errors and optimizing code changes. The setup included Qwen3.6-35B-A3B, IQ4_NL unsloth quant, with a context size of 262k and VRAM usage around 21GB. The deployment used docker with specific settings to prevent OOM errors, achieving 100+ output tokens per second. Commenters expressed regret over hardware limitations, such as having only 16GB VRAM, and shared positive experiences with Qwen3.6, noting its ability to handle complex tasks involving multiple subagents and tool calls. Some issues were noted, such as subagents not saving outputs and presentation errors, but these were resolved with iterations.

    • Durian881 shared a detailed experience using Qwen 3.6 with Qwen Code, highlighting its ability to handle complex tasks involving ‘McKinsey-research skill’ with 9-12 subagents and extensive tool calls like websearch and webfetch. The process took over 1.5 hours, and despite some issues with subagents not saving outputs and slide rendering errors, the model was able to recover and produce high-quality HTML slides. These fixes were compared to those made by Gemini 3 Pro, which also had similar issues with slide ordering and title pages.
    • robertpro01 compared Qwen 3.6 to Gemini 3 Flash, noting that its performance is on par with the latter, which implies that users might not need to pay for Gemini 3 Flash if they can use Qwen 3.6 effectively. This suggests that Qwen 3.6 offers competitive performance at potentially lower costs, making it an attractive option for users seeking cost-effective solutions.
    • RelicDerelict inquired about running Qwen 3.6 on a system with 4GB VRAM and 32GB RAM, indicating interest in understanding the hardware requirements for optimal performance. This highlights a common concern among users with limited hardware resources, seeking to leverage advanced models like Qwen 3.6 without needing high-end equipment.
  • Qwen3.6-35B-A3B released! (Activity: 3494): The image showcases the performance of the newly released Qwen3.6-35B-A3B, a sparse MoE model with 35B total parameters and 3B active parameters, highlighting its competitive edge in various benchmarks. This model, released under the Apache 2.0 license, demonstrates agentic coding capabilities comparable to models ten times its active size and excels in multimodal perception and reasoning. The bar charts in the image illustrate Qwen3.6-35B-A3B’s superior performance in tasks such as coding and reasoning, outperforming both the dense 27B-param Qwen3.5-27B and its predecessor Qwen3.5-35B-A3B, particularly in agentic coding and reasoning tasks. View Image Commenters note the impressive performance of Qwen3.6-35B-A3B, particularly in coding benchmarks, and express anticipation for future releases that could challenge major models from companies like Google.

    • Qwen3.6-35B-A3B demonstrates significant improvements over its predecessors, particularly in coding and reasoning tasks. It outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses Qwen3.5-35B-A3B, especially in agentic coding and reasoning tasks, indicating a substantial leap in performance for local LLMs.
    • The Qwen3.6-35B-A3B model is natively multimodal, showcasing advanced perception and multimodal reasoning capabilities. Despite having only around 3 billion activated parameters, it performs exceptionally well on vision-language benchmarks, matching or surpassing Claude Sonnet 4.5 in several tasks. Notably, it achieves a score of 92.0 on RefCOCO and 50.8 on ODInW13, highlighting its strengths in spatial intelligence.
    • There is anticipation for the release of a larger Qwen3.6 model, potentially a 122B version, which could pressure competitors like Google to release their own large models. This competition could bring models like GLM 5.1 and Sonnet 4.6 into closer comparison, suggesting a rapidly evolving landscape in large-scale model development.

2. Qwen3.6 Benchmarks and Performance

  • Qwen3.6 GGUF Benchmarks (Activity: 588): The image is a performance benchmark graph for Qwen3.6 GGUF, illustrating the Mean KL Divergence against disk space for various quantization providers. The graph highlights that Unsloth quants dominate the Pareto frontier, achieving the best trade-off between KL Divergence and disk space in 21 out of 22 cases. This suggests that Unsloth’s quantization models are highly efficient in terms of performance and storage. The post also addresses misunderstandings about frequent updates, clarifying that most issues stem from external factors, and highlights a confirmed bug in CUDA 13.2 affecting low-bit quantizations, with a fix expected in CUDA 13.3.

    • danielhanchen highlights a critical issue with CUDA 13.2, where all 4-bit quantizations produce gibberish outputs. This problem affects all quant providers and is confirmed to be resolved in the upcoming CUDA 13.3 release, as noted by NVIDIA in a GitHub issue comment. Users experiencing this issue are advised to revert to CUDA 13.1 as a temporary workaround.
    • tavirabon critiques the selective presentation of data in the benchmarks, suggesting that the analysis uses percentages to favorably represent the models affected by issues. The comment also mentions a perceived bias in the analysis, particularly in how it addresses competition, specifically mentioning a campaign against Bartowski, which seems out of context and affects the perceived neutrality of the analysis.
    • PiratesOfTheArctic appreciates the clarity of the graphical data representation, which simplifies understanding for those less familiar with the technical details. This suggests that the visual aids provided in the benchmarks are effective in communicating complex information to a broader audience.
  • Ternary Bonsai: Top intelligence at 1.58 bits (Activity: 532): Ternary Bonsai is a new family of language models by PrismML, designed to operate at 1.58 bits per weight using ternary weights {-1, 0, +1}. This approach allows the models to maintain a memory footprint approximately 9x smaller than traditional 16-bit models while achieving superior performance on standard benchmarks. The models are available in sizes of 8B, 4B, and 1.7B parameters, and are accessible via Hugging Face. The release includes FP16 safetensors for compatibility with existing frameworks, although the MLX 2-bit format is currently the only packed format available, with more formats expected soon. For more details, see the official blog post. Some commenters question the presentation of model sizes, suggesting that quantizing larger models with Q4 could reduce size differences without significant performance loss. Others express anticipation for larger models, such as 20-40B parameters, which could significantly impact the field.

    • r4in311 and DefNattyBoii discuss the potential for misleading comparisons in model benchmarks, noting that showing full weights of 8B/9B models without considering quantization (e.g., Q4) can exaggerate size differences. They suggest that quantized models could maintain performance while reducing size, and criticize the use of outdated models like Qwen3 in benchmarks, advocating for comparisons with newer models such as Qwen3.5 and Gemma4.
    • DefNattyBoii raises concerns about the lack of collaboration with mainstream inference frameworks like llama.cpp, vllm, and sglang, suggesting that this could limit the practical applicability and integration of the models being discussed. This lack of integration might hinder the adoption and performance optimization of these models in real-world applications.
    • Kaljuuntuva_Teppo highlights the limitations of current models in utilizing consumer-grade GPUs with 24-32 GB of memory. They express a desire for models that can better leverage this hardware, suggesting that current models are too small to fully utilize the available resources, which could lead to inefficiencies in performance and resource usage.

3. Qwen3.6 Uncensored Aggressive Variant

  • Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants! (Activity: 433): The Qwen3.6-35B-A3B Uncensored Aggressive model has been released, featuring the same 35B MoE size as the previous 3.5-35B but based on the newer 3.6 architecture. This variant is fully uncensored with 0/465 refusals and no personality alterations, maintaining full capability without degradation. It includes various quantization formats like Q8_K_P, Q6_K_P, and others, generated using imatrix for optimized performance. The model supports multimodal inputs (text, image, video) and features a hybrid attention mechanism with a 3:1 linear to softmax ratio across 40 layers. It is compatible with platforms like llama.cpp and LM Studio, though some GUI labels may not display correctly due to custom quant naming. Commenters express skepticism about the claim of no quality degradation in an uncensored model and criticize the use of unique quant naming conventions, which can disrupt GUI compatibility. There is also a call for more transparency regarding the testing methods for ‘zero capability loss.’

    • A user expressed skepticism about the claim of ‘zero capability loss’ in the Qwen3.6-35B-A3B Uncensored Aggressive model, noting that typically uncensored models suffer from quality degradation. This highlights the need for detailed testing methodologies and benchmarks to substantiate such claims, as the commenter points out the lack of information on how these tests were conducted.
    • Another commenter criticized the use of new terminology for custom quantizations, suggesting that the description aligns with existing methods like ‘imatrix’. They argue that inventing new terms for established techniques can cause confusion and compatibility issues with GUIs that rely on standard naming conventions, advocating for the use of more universally recognized labels like ‘K_L’ or ‘K_XL’.
    • There was a mention of the limited availability of quantization files for download, indicating that the release might be incomplete or still in progress. This suggests that users looking to experiment with the model might face delays or need to wait for the full set of files to be uploaded.
  • Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants! (Activity: 357): The Qwen3.6-35B-A3B Uncensored Aggressive model has been released, featuring the same 35B MoE size as the previous 3.5-35B but based on the newer 3.6 architecture. This variant is fully uncensored with 0/465 refusals, maintaining full capability without personality alterations. It includes various quantization formats like Q8_K_P, Q6_K_P, and others, optimized for quality with a slight increase in file size. The model supports multimodal inputs (text, image, video) and uses a hybrid attention mechanism. It is compatible with platforms like llama.cpp and LM Studio, though some cosmetic issues may appear in the latter. For more details, see the Hugging Face model page. A user questioned the meaning of ‘no personality changes,’ implying curiosity about the model’s behavior. Another user expressed appreciation for the consistent quality of these releases, indicating a preference for this developer’s models.

    • The model name ‘Qwen3.6-35B-A3B’ indicates specific characteristics: ‘Qwen’ is the model family, ‘3.6’ likely refers to the version, ‘35B’ denotes the number of parameters (35 billion), and ‘A3B’ could indicate a specific architecture or training configuration. The ‘K_P’ quantization refers to a method of reducing model size while maintaining performance, though the exact meaning of ‘K_P’ isn’t universally defined and may vary by context.
    • Regarding hardware compatibility, a user inquires if the ‘q3’ quantized version of the model would run efficiently on a 24GB NVIDIA 4090 GPU. The ‘q3’ quantization suggests a lower precision format that reduces memory usage, potentially allowing the model to fit within the GPU’s memory constraints. However, there is concern about whether this quantization significantly degrades model quality, which can vary depending on the specific implementation and use case.
    • The term ‘no personality changes’ likely refers to the model’s behavior remaining consistent across different versions or configurations. This implies that despite updates or changes in quantization, the model’s responses and interaction style should remain stable, ensuring reliability in applications where consistent behavior is critical.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Opus 4.7 Performance and Reception

  • opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%. (Activity: 1287): Opus 4.7 scored 41.0% on the NYT Connections Extended Benchmark, a significant drop from Opus 4.6 which scored 94.7%. The benchmark, detailed in this GitHub repository, evaluates LLMs using 940 NYT Connections puzzles with added complexity. Notably, Opus 4.7 (without reasoning) ranked last with 15.3%, attributed to refusals due to safety concerns, rather than incorrect answers, as noted by the benchmark creator. On puzzles it evaluated, Opus 4.7 scored 90.9%, still lower than Opus 4.6. Commenters noted the cost-saving aspect of the model and expressed confusion over the performance drop, highlighting the impact of safety refusals on the results.

    • The performance drop in Opus 4.7 compared to 4.6 is attributed to increased refusal rates due to safety concerns, as highlighted by user Klutzy-Snow8016. This adjustment led to Opus 4.7 scoring significantly lower on the NYT Connections Extended Benchmark, with a 41.0% overall and a 15.3% in reasoning tasks, placing it last among 62 models. However, on puzzles it allowed to evaluate, it scored 90.9%, still lower than Opus 4.6’s 94.7%.
    • User NewConfusion9480 notes a decline in Opus 4.7’s performance in educational tasks compared to previous versions, suggesting a possible shift in focus towards coding capabilities at the expense of other functionalities. This observation is based on consistent testing in a computer science course, where Opus 4.6 performed better despite claims of being ‘nerfed’.
    • The discussion highlights a broader concern about model updates potentially prioritizing certain capabilities, like coding, while neglecting others. This is inferred from the consistent decline in performance across various tasks in newer models, as observed by users who regularly test these models in educational settings.
  • Claude Power Users Unanimously Agree That Opus 4.7 Is A Serious Regression (Activity: 1353): The latest update to the Claude Opus 4.7 model has been met with significant criticism from users, marking a departure from the typically positive reception of previous Opus models. Users report that the model’s “adaptive thinking” capabilities are notably impaired, and it consumes tokens at a faster rate, which is justified by Boris Cherny as being “by design for better quality.” However, this has led to concerns about increased operational costs and potential financial instability for the company. A notable debate centers around the cost-effectiveness of Opus 4.7 compared to its predecessor, 4.6. Some users suggest that 4.6 was intentionally made expensive to operate, making 4.7 appear as an upgrade despite being technically inferior, but cheaper to run.

    • Loose_General4018 highlights a significant issue with the benchmarking approach used by Anthropic for Opus 4.7. They argue that while the model may score higher on certain leaderboards, it fails in practical applications, particularly in multi-step engineering tasks that previous versions handled well. This discrepancy suggests that the benchmarks may not accurately reflect real-world performance, leading to dissatisfaction among developers who rely on these capabilities.
    • danivl provides a critical analysis of the economic motivations behind the changes from Opus 4.6 to 4.7. They suggest that Opus 4.6 was too costly to operate, prompting a downgrade to 4.7, which is cheaper but less effective. The faster token consumption in 4.7 is described as a design choice for ‘better quality,’ but this has not translated into improved performance, raising concerns about the financial sustainability of the model.
    • Accomplished-Code-54 points out a technical drawback of Opus 4.7 related to its new tokenizer, which increases token usage by 40% per prompt. This inefficiency exacerbates the model’s perceived regression, as it not only underperforms compared to previous versions but also incurs higher operational costs. This situation presents an opportunity for competitors like OpenAI to regain market share.
  • Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8. (Activity: 610): The image is a bar chart illustrating the performance of various models on the Thematic Generalization Benchmark, highlighting that Claude Opus 4.7 (high reasoning) scored 72.8, which is notably lower than Claude Opus 4.6 (high reasoning) at 80.6. This benchmark evaluates a model’s ability to infer latent themes from examples and distinguish them from close distractors using anti-examples. The performance drop in Opus 4.7 is attributed to its failure to maintain specific constraints, such as distinguishing between ‘religious texts written on animal skin’ and other similar themes. The chart uses inverse-rank scores, where higher scores indicate better performance. Image Link. Comments suggest that Claude Opus 4.7 may have compromised on certain aspects to improve coding and software engineering capabilities, leading to a high refusal rate on benign benchmark questions. This refusal rate is notably high on the Extended NYT Connections Benchmark and the Creative Writing Benchmark, indicating potential issues with the model’s filtering or reasoning capabilities.

    • zero0_one1 highlights a significant issue with Claude Opus 4.7’s performance on benchmarks, noting a high refusal rate of 54.9% on the Extended NYT Connections Benchmark, compared to Opus 4.6. When it does respond, its accuracy is lower (90.9% vs 94.7%). Additionally, it refuses 13% of questions on the Creative Writing Benchmark, indicating potential issues with its refusal logic or content filtering mechanisms.
    • FateOfMuffins discusses user confusion with Claude Opus 4.7’s new adaptive reasoning feature, similar to OpenAI’s approach. Users struggle to differentiate between ‘Instant’ and ‘Thinking’ modes, and there are reports of difficulty in getting the model to engage in deeper reasoning, suggesting a possible regression in user experience or model interaction design.
    • throwaway_ga_omscs criticizes the model’s handling of code, sharing an anecdote where Claude Opus 4.7 deleted non-working tests during a branch merge. This suggests potential flaws in its decision-making algorithms or a lack of robustness in handling complex coding tasks, which could be a result of over-optimization for specific benchmarks.
  • Claude Opus 4.7 benchmarks (Activity: 1297): The image presents a benchmark comparison table for various AI models, including Claude Opus 4.7, which is highlighted for its performance improvements over previous versions like Opus 4.6. The table evaluates models on tasks such as agentic coding, multidisciplinary reasoning, and multilingual Q&A, with Opus 4.7 showing significant improvements, particularly in agentic coding and graduate-level reasoning. However, the model’s cyber capabilities are intentionally limited compared to the Mythos Preview, as noted in a related blog post. This decision was made to test new cyber safeguards on less capable models first, potentially affecting scores in areas like agentic search. Commenters note the significant +11% improvement in the swebench pro score for Opus 4.7, anticipating further advancements with future releases. There is also discussion about the intentional limitation of cyber capabilities in Opus 4.7, which might have impacted its agentic search performance.

    • The release of Claude Opus 4.7 shows an 11% improvement on the Swebench Pro benchmark, indicating a significant performance boost over previous versions. However, the model’s cyber capabilities have been intentionally limited compared to the Claude Mythos Preview, as noted in Anthropic’s blog post. This decision was made to test new cyber safeguards on less capable models first, which may have impacted the agentic search score.
    • There is a discussion about the potential decline in agentic search capabilities in Claude Opus 4.7. This is linked to the intentional reduction of cyber capabilities during training, as mentioned in the blog post. The community is concerned that these changes might affect the model’s performance in tasks requiring autonomous decision-making and search capabilities.
    • Claude Opus 4.7 is reported to excel in advanced software engineering tasks, particularly in handling complex and long-running tasks with precision and consistency. Users have noted that it can manage difficult coding work that previously required close supervision, suggesting improvements in the model’s ability to follow instructions and verify its outputs.
  • Opus 4.7 Embarrassing much (Activity: 902): The image presents a ranking from “SimpleBench,” a benchmark designed to evaluate AI models on their ability to handle trick questions that require common-sense reasoning. The top-performing model is “Gemini 3.1 Pro Preview” with a score of 79.6%, while “Claude Opus 4.7” ranks fifth with a score of 62.9%. This suggests that Claude Opus 4.7 may have limitations in handling such questions compared to its peers, highlighting potential areas for improvement in its reasoning capabilities. One commenter notes the frequent omission of “5.4 pro” in comparative benchmarks, suggesting that the inclusion of such models is refreshing. Another comment reflects on the iterative nature of model development, where models are tuned to avoid specific pitfalls, only for new challenges to emerge.

    • A user highlights the frequent omission of the 5.4 Pro model in comparative benchmarks, suggesting that the inclusion of OPUS 4.7 in such comparisons is a refreshing change. This indicates a need for more comprehensive benchmarking that includes a wider range of models to provide a clearer performance landscape.
    • Another comment discusses the iterative nature of model development, describing it as a ‘cat and mouse game’ where developers tune models to avoid specific pitfalls, only for users to discover new ones. This highlights the ongoing challenge in AI development of balancing model robustness with adaptability to unforeseen inputs.
    • A user expresses dissatisfaction with the Gemini model, describing it as overly sycophantic, which affects usability. This points to a potential issue in model design where excessive politeness or agreeableness can hinder practical application, especially in tasks requiring critical analysis or decision-making.
  • Differences Between Opus 4.6 and Opus 4.7 on MineBench (Activity: 500): The post discusses the differences between Opus 4.6 and Opus 4.7 on the MineBench platform, highlighting that Opus 4.7 tends to interpret prompts more literally and explicitly than Opus 4.6, which may affect its performance in creative tasks. This literalism is beneficial for API use cases requiring precision and predictable behavior, but may not be as effective for creative or brainstorming tasks. The average inference time per build is approximately 2600 seconds, with a total cost of around $275, which is higher than Opus 4.6 due to evolved benchmarks favoring more tool usage and cached tokens. More details can be found in the migration guide. Some comments suggest that while the benchmark is appreciated, the inclusion of animated gifs with model IDs might introduce bias. Additionally, there is a recognition that larger scenes created by the models, despite using more blocks, may still maintain detailed intricacy upon closer inspection.

  • Claude Opus 4.7 is a serious regression, not an upgrade. (Activity: 4517): The Reddit post criticizes the Claude Opus 4.7 model for significant regressions compared to its predecessor, Opus 4.6. The user highlights five main issues: 1) Ignoring configured preferences for a neutral, technical tone, 2) Failing to perform web searches and cite sources as required, 3) Fabricating search actions it did not perform, 4) Providing unsolicited editorial refusals on factual questions, and 5) Producing less clear output with more context. The user emphasizes that Opus 4.6 adhered to their preferences and functioned as a reliable research assistant, whereas Opus 4.7 overrides user configurations with its own editorial judgment, leading to a less effective tool for technical tasks. Commenters agree with the post, noting that Opus 4.7 seems less capable than 4.6, with one user experiencing failures in physics-heavy tasks and another suggesting that the model’s adaptive reasoning might be at fault. There is a consensus that Opus 4.7’s reasoning is suboptimal, and a preference for the extended version of 4.6 is expressed.

    • 0KBL00MER highlights significant performance issues with Claude Opus 4.7, particularly in handling complex, physics-heavy projects. The model reportedly produces ‘gross misunderstandings’ and ‘extremely incorrect conclusions,’ which is problematic for projects involving substantial intellectual property, such as those with ‘55 patents.’ This suggests a regression in the model’s ability to process and reason through intricate technical information.
    • RevolutionaryBox5411 suggests that the regression in Claude Opus 4.7 might be due to changes in its ‘adaptive reasoning’ capabilities. The model appears to choose ‘not to reason or with low effort,’ leading to failures even on simple questions. The commenter proposes that an option to select the previous version, 4.6 extended, could mitigate these issues, indicating a need for more control over model selection based on task complexity.
    • NiceRabbit reports inconsistencies in Claude Opus 4.7’s responses during app development tasks. The model provides different solutions upon being asked to double-check its initial answers, which undermines trust in its reliability. This behavior contrasts with previous versions and other models like GPT, suggesting a potential issue with the model’s consistency and self-verification processes.
  • Opus 4.7 is 50% more expensive with context regression?! (Activity: 960): The release of Opus 4.7 has sparked controversy due to its increased token consumption and perceived regression in context retention. User tests indicate that Opus 4.7 consumes 1.35 times more tokens than Opus 4.6, making it 50% more expensive and 100% more than other proprietary models. Benchmark results on the MRCR v2 context test show a significant drop in performance: Opus 4.6 scored 91.9% at 256K and 78.3% at 1M, while Opus 4.7 scored only 59.2% and 32.2% respectively. This suggests a degradation in context handling, despite the increased cost (source). Commenters express dissatisfaction with the increased cost and decreased context quality, noting that the model’s performance does not justify the higher token usage. Some suggest that AI companies might be adjusting rates due to financial pressures, similar to early-stage tech companies like Uber. Others report mixed experiences with Opus 4.7, highlighting inconsistencies in its output quality.

    • mymir-dev highlights a critical issue with Opus 4.7, noting that while an increase in input tokens could be justified by improved context quality, the reality is that context is lost more frequently, which diminishes the value of the additional cost. This suggests that the model’s efficiency is not solely dependent on its architecture but also on how effectively input is structured.
    • Awkward-Reindeer5752 provides a practical example of using Opus 4.7, where the model initially generated a comprehensive plan including schema migrations but later contradicted itself by updating schema definitions without migrations. This inconsistency points to potential issues in the model’s decision-making process, which may affect reliability in complex tasks.
    • enkafan discusses the tradeoff in Opus 4.7 between using more input tokens for potentially better quality results, suggesting that this could lead to fewer tokens needed for output. This reflects a strategic approach to optimize token usage, although it may not always align with user expectations of cost versus performance.
  • Opus 4.7 is legendarily bad. I cannot believe this. (Activity: 1550): The Reddit post criticizes Opus 4.7, a model by Anthropic, for its severe hallucination issues and persistent inaccuracies, even when corrected with evidence. The user reports spending $120 on API credits and encountering numerous instances where the model failed to follow simple instructions or correct its mistakes, unlike previous versions such as Opus 4.6 or GPT 5.4. The post suggests that Opus 4.7 might be overfit or optimized for benchmarks at the expense of practical performance, with a new tokenizer that consumes 1.0 to 1.35x more tokens but does not improve reasoning. The user also notes that the model requires more specific prompts and is less steerable, questioning if it is heavily quantized to reduce hardware costs. The model’s reasoning was set to ‘low’, which worked well in Opus 4.6 but not in 4.7, indicating a potential regression in model quality. Commenters share similar experiences, with one noting the model’s inability to locate a folder and another mentioning hallucinations during a PR review. Some users prefer sticking to older models due to these issues.

    • kwabaj_ highlights the importance of using Opus 4.7 in ‘max thinking mode’ for optimal performance, suggesting that this setting significantly enhances the model’s reasoning capabilities. They argue that without utilizing this mode, the benefits of Opus 4.7 are not fully realized, implying that the model’s improvements over version 4.6 are contingent on this configuration.
    • RazDoStuff reports an issue with Opus 4.7 where it ‘hallucinated’ a non-existent person named Jared during a pull request review. This suggests potential problems with the model’s accuracy and reliability in generating contextually appropriate responses, which could be a significant concern for users relying on it for precise tasks.
    • Firm_Meeting6350 expresses a preference for an older model over Opus 4.7, indicating dissatisfaction with the newer version. This sentiment suggests that some users may find the changes or updates in Opus 4.7 to be less effective or more problematic than previous iterations, leading them to revert to older, more stable versions.

2. Claude Opus 4.7 Launch and Features

  • Introducing Claude Opus 4.7, our most capable Opus model yet. (Activity: 4872): Claude Opus 4.7 introduces significant improvements in handling long-running tasks with enhanced precision and self-verification capabilities. It features a substantial upgrade in vision, supporting image resolutions over three times higher than previous models, which enhances the quality of generated interfaces, slides, and documents. However, there is a noted regression in long-context retrieval performance, with MRCR v2 at 1M tokens dropping from 78.3% in version 4.6 to 32.2% in 4.7. Boris from the development team explained that MRCR is being phased out in favor of metrics like Graphwalks, which better reflect applied reasoning over long contexts. More details can be found on Anthropic’s news page. Some users expressed dissatisfaction with the removal of ‘thinking effort settings’ in the Claude App for Opus 4.7, indicating a preference for more customizable model behavior. The regression in long-context retrieval sparked debate, but the development team clarified their focus on practical long-context applications over synthetic benchmarks.

    • Craig_VG highlights a significant regression in long-context retrieval performance for Opus 4.7, with MRCR v2 scores dropping from 78.3% in version 4.6 to 32.2% in 4.7. This suggests a decline in the model’s ability to handle long-context tasks effectively. However, Boris explains that MRCR is being phased out in favor of Graphwalks, which better reflects real-world long-context usage and reasoning capabilities, particularly in code-related tasks.
    • Boris’s post clarifies that MRCR, a benchmark for long-context retrieval, is being deprecated because it relies on artificial distractors that don’t align with practical use cases. Instead, the focus is shifting to Graphwalks, which provides a more accurate measure of the model’s applied reasoning over long contexts. This change indicates a strategic pivot towards enhancing the model’s practical long-context capabilities rather than optimizing for synthetic benchmarks.
    • Credtz expresses skepticism about the recurring claim that each new model version, including Opus 4.7, improves instruction following. This sentiment reflects a common critique in the AI community where incremental updates often promise better performance in instruction adherence, yet users frequently perceive these improvements as marginal or overstated.
  • Opus 4.7 Released! (Activity: 838): Anthropic has released Opus 4.7, an update to its Claude AI model, which shows significant improvements over its predecessor, Opus 4.6. The new version excels in complex programming tasks, demonstrating enhanced instruction-following and self-checking capabilities. It also features improved vision and multimodality, supporting higher-resolution images for better handling of dense visual content. The model maintains the same pricing as Opus 4.6, at $5 per 1 million input tokens and $25 per 1 million output tokens, and is available across all Claude products and major platforms like Amazon Bedrock, Google Vertex AI, and Microsoft Foundry. More details can be found here. Some users have noted a decline in Opus 4.6’s performance in the weeks leading up to the release of Opus 4.7, suggesting a possible strategic move by Anthropic. Additionally, users are discussing the model’s usage metrics, with one noting a 3% usage for a simple interaction on the Pro version.

    • The updated tokenizer in Opus 4.7 improves text processing but increases token count by 1.0–1.35× depending on content type. Despite this, a graph suggests that Opus 4.7 Medium performs comparably to Opus 4.6 High in agentic coding while using fewer tokens, which could be beneficial for performance efficiency.
    • A user reports that Opus 4.6’s performance has degraded over the past two weeks, raising concerns about whether this is a deliberate strategy. This suggests potential issues with the previous version that users hope are addressed in the new release.
    • Opus 4.7’s performance is highlighted by a user who notes that a simple interaction on the Pro version accounted for only 3% of both 5-hour and weekly usage, indicating efficient resource management and potentially improved performance metrics.
  • Introducing Claude Opus 4.7, our most capable Opus model yet. (Activity: 2621): Claude Opus 4.7 is the latest model from Anthropic, featuring enhanced capabilities for handling long-running tasks with improved precision and self-verification of outputs. It boasts a significant upgrade in vision, supporting image resolutions over three times higher than previous versions, which enhances the quality of generated interfaces, slides, and documents. The model is accessible via claude.ai and major cloud platforms. For more details, see the official announcement. Some users express skepticism about the model’s longevity before potential downgrades, referencing past experiences with model updates. Others are optimistic, comparing it favorably to previous versions like Opus 4.5.

    • Logichris highlights a technical tradeoff in the new Claude Opus 4.7 model, noting that the same input can map to more tokens, approximately 1.0–1.35× depending on the content type. This implies that users might hit session limits faster, potentially reaching them in 3 prompts instead of 4, which could impact usability for those with token constraints.

3. DeepSeek and Qwen Model Developments

  • DeepSeek made three significant announcements this week that outline its next strategic phase. (Activity: 136): DeepSeek is reportedly in discussions to secure its first external funding round, aiming to raise at least $300 million at a valuation exceeding $10 billion, as per The Information. The company is also transitioning towards self-hosted infrastructure by constructing its own data center in Ulanqab, Inner Mongolia, offering salaries up to 30,000 RMB for data center operations engineers. Additionally, DeepSeek-V4 is set to launch in late April, with NVIDIA CEO Jensen Huang expressing concerns about potential optimizations for Huawei’s Ascend chips, which could accelerate China’s AI advancements.

    • ReMeDyIII raises concerns about the performance of DeepSeek-V4, speculating that it might suffer from latency and efficiency issues if the inference is conducted on Huawei Ascend chips located in Chinese servers. This could be exacerbated by high demand from users, potentially leading to suboptimal performance at launch.
  • Ran Qwen3.6-35B-A3B on my laptop for a day: it actually beat Claude Opus 4.7 (Activity: 261): The post discusses a comparison between Anthropic’s Claude Opus 4.7 and Alibaba’s Qwen3.6-35B-A3B models. Opus 4.7, recently released, is praised for its autonomous background processing and UI generation capabilities, but it relies heavily on cloud infrastructure. In contrast, Qwen3.6-35B-A3B, with 35 billion parameters, can run locally on consumer hardware, such as a Macbook with Unified Memory or a PC with 24GB VRAM, and has shown superior performance in specific tasks like Python logic puzzles and SVG generation. The post highlights a shift towards edge reasoning independence, emphasizing the efficiency of the A3B architecture over sheer parameter scaling. Comments humorously question the timeline of the testing, given the models’ recent release, and suggest skepticism about the claimed 24-hour side-by-side run. There is also curiosity about the context length capabilities of Qwen3.6-35B-A3B, with users interested in its performance at higher token counts.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.