a quiet day.
AI News for 4/23/2026-4/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINewsâ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Top Story: DeepSeek V4
What happened
DeepSeek released DeepSeek-V4 Pro and DeepSeek-V4 Flash, its first major architecture refresh since V3 and first clear two-tier lineup, with 1M-token context, hybrid reasoning/non-reasoning modes, an MIT license, and a technical report detailed enough that multiple researchers called it one of the most important or best-written model papers of the year. Across the reactions, the factual consensus is that V4 materially advances open-weight long-context and agentic coding performance while remaining somewhat behind the top closed frontier models overall. Independent benchmarkers place V4 Pro around the #2 open-weights tier, roughly near Kimi K2.6 / GLM-5.1 / strong Claude Sonnet-class to Opus-ish depending on benchmark and mode, with especially strong long-context and agentic performance; opinions diverge on how close it is to GPT-5.x / Opus 4.7 and on whether this is âdemocratizingâ progress or an architecture so complex that few open labs can realistically reproduce it. Key sources include deep-dive commentary from @ArtificialAnlys, @scaling01, @nrehiew_, @ben_burtenshaw, @TheZachMueller, @ZhihuFrontier, and infra/vendor posts from @vllm_project, @NVIDIAAI, and @Togethercompute.
Core facts and technical details
The most concrete technical claims repeated across the discussion:
-
Two models
- V4 Pro: 1.6T total parameters / 49B active
- V4 Flash: 284B total / 13B active
- Reported by @ArtificialAnlys, @teortaxesTex, @baseten, @NVIDIAAI
-
Context
- 1M tokens, up from 128K in V3.2 per @ArtificialAnlys
- Multiple posters frame this as the headline achievement: âsolid ultra-long contextâ @teortaxesTex
-
Training scale
- 32Tâ33T tokens cited repeatedly
- @nrehiew_ notes 32T tokens over 1.6T parameters, i.e. roughly 20 tokens/parameter
- @teortaxesTex cites 33T
- @nrehiew_ estimates pretraining compute at ~1e25 FLOPs
-
Reasoning / modes
- DeepSeek exposes three reasoning modes per @Togethercompute
- Hybrid âthinking/non-thinkingâ positioning noted by @ArtificialAnlys
-
Long-context architecture
- Several threads summarize a new hybrid attention system:
- shared KV vectors
- compressed KV streams
- sparse attention over compressed tokens
- local/sliding-window attention for nearby context
- @ZhihuFrontier gives the most compact public summary:
- 2Ă KV reduction via shared key-value vectors
- c4a â 4Ă compression
- c128a â 128Ă compression
- top-k sparse attention on compressed tokens
- 128-token sliding window
- 1M context KV cache = 9.62 GiB/sequence (bf16)
- 8.7Ă smaller than DeepSeek V3.2âs 83.9 GiB
- FP4 index cache + FP8 attention cache gives another ~2Ă reduction
- @ben_burtenshaw condenses this to â10Ă smaller KV cacheâ
- @TheZachMueller and @TheZachMueller describe CSA + HCA layer patterns, with alternating layers and V4 Flash using sliding-window layers instead of HCA in some places
- Several threads summarize a new hybrid attention system:
-
Quantization / checkpoint format
- @LambdaAPI: checkpoint is mixed FP4 + FP8
- MoE expert weights in FP4
- attention / norm / router in FP8
- claim: the full model fits on a single 8ĂB200 node
- @LambdaAPI: checkpoint is mixed FP4 + FP8
-
Inference hardware / serving
- @NVIDIAAI: on Blackwell Ultra, V4 Pro can deliver 150+ TPS/user interactivity for agentic workflows
- @NVIDIAAI: published day-0 V4 Pro performance pareto using vLLM
- @SemiAnalysis_: day-0 support and benchmarking across H200, MI355, B200, B300, GB200/300
- @Prince_Canuma: DeepSeek4-Flash on 256GB Mac
- @Prince_Canuma: MLX quants published
- @simonw asks about smaller-RAM Mac viability, implying community interest but incomplete support story
- @QuixiAI reminds users that many local stacks still lack tensor parallel, relevant because V4-class models strongly stress inference infra
-
License / availability / pricing
- MIT license per @ArtificialAnlys
- first-party API plus rapid third-party availability via @Togethercompute, @baseten, @NousResearch, @Teknium
- V4 Pro pricing: $1.74 / $3.48 per 1M input/output tokens
- V4 Flash pricing: $0.14 / $0.28
- cache-hit pricing also given by @ArtificialAnlys
- @scaling01 views the pricing as a glimpse of future âMythos-levelâ cheap coding models
- Reuters-via-posted quote from @scaling01: DeepSeek said Pro pricing could fall sharply once Huawei Ascend 950 supernodes are deployed at scale in H2
Independent evaluations and where V4 lands
The most useful independent benchmark synthesis came from @ArtificialAnlys:
- V4 Pro Max: 52 on Artificial Analysis Intelligence Index
- up 10 points from V3.2 at 42
- becomes #2 open weights reasoning model, behind Kimi K2.6 (54)
- V4 Flash Max: 47
- positioned around strong mid/high open models, âClaude Sonnet 4.6 max level intelligenceâ
- GDPval-AA (agentic real-world work):
- V4 Pro: 1554, leading open-weight models
- ahead of Kimi K2.6 (1484), GLM-5.1 (1535), MiniMax-M2.7 (1514)
- AA-Omniscience
- V4 Pro: -10, an 11-point improvement over V3.2
- but still paired with 94% hallucination rate
- V4 Flash: 96% hallucination rate
- Cost to run AA Index
- V4 Pro: $1,071
- V4 Flash: $113
- Output tokens used on AA Index
- V4 Pro: 190M
- V4 Flash: 240M
- This is a major caveat: cheap per-token pricing does not imply cheap total task cost if the model spills huge token volumes
Additional eval perspectives:
- @arena:
- #2 open in Text Arena overall at debut
- category wins/placements:
- #1 Medical & Healthcare
- #15 Creative Writing
- #18 Multi-Turn
- thinking variant:
- #8 Math
- #9 Life/Physical/Social Science
- @arena emphasizes the Pro vs Flash tradeoff:
- Pro ranks ~30 places higher
- costs 12Ă more
- Flash is still competitive in Chinese, medicine, math
- @scaling01:
- â~Opus 4.5 estimate holds for now, at least on SimpleBenchâ
- @scaling01:
- V4 is âdefinitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Proâ
- @scaling01 lists what scores would confirm <6 month gap:
- ARC-AGI-1 ~75%
- ARC-AGI-2 ~35%
- GSO ~26%
- METR 4.5â5 hours
- WeirdML ~63%
- @TheZachMueller:
- on his evals, Flash@max â Pro@high on reasoning
- Pro focuses more on knowledge (SimpleQA)
- @VictorTaelin:
- after fixing benchmark bugs and letting long-running models run longer, DeepSeek and Kimi improved materially
- @mbusigin:
- a simple negative early impression with no detail
- @petergostev:
- on BullshitBench, not about capability but refusal/pushback behavior, GPT-5.5 underperformed; included here because many readers compare V4 in an eval-skeptical environment
Facts vs opinions
Facts / relatively well-supported claims
- V4 Pro / Flash were released with the specs above, MIT-licensed, 1M context, and open technical documentation: @ArtificialAnlys, @TheZachMueller
- The architecture introduces a new long-context attention system with dramatic KV-cache reduction: @ZhihuFrontier, @ben_burtenshaw
- Independent benchmarkers broadly place V4 Pro near the very top of open weights but below the best proprietary models overall: @ArtificialAnlys, @arena, @scaling01
- DeepSeek V4 is heavily token-intensive in some evaluations: @ArtificialAnlys
- The checkpoint uses FP4/FP8 mixed precision and can fit on an 8ĂB200 node: @LambdaAPI
- Rapid ecosystem support arrived via vLLM and other providers day 0: @vllm_project, @SemiAnalysis_
Opinions / interpretation
- âV4 is ~4â5 months behind the frontierâ from @scaling01, @scaling01, @scaling01 is an informed estimate, not a measured fact
- âTop three openâ vs âonly open model close to frontierâ debate from @teortaxesTex is partly about benchmark trust and framing
- âStrongest pretrained model we haveâ from @teortaxesTex is an opinion hinging on scale + architecture, not direct benchmark supremacy
- âMost significant AI paper of the yearâ from @Dorialexander is enthusiasm, not consensus
- âThis is what research should look likeâ from @scaling01 speaks to transparency/style rather than only capability
- âNot exactly a democratizing technologyâ from @teortaxesTex is a strong architectural/political interpretation
Different opinions and fault lines
1) Is V4 near frontier, or clearly behind?
More favorable
- @scaling01: puts it at roughly GPT-5.2 / Opus 4.5+ tier
- @scaling01: SimpleBench supports ~Opus 4.5
- @teortaxesTex: argues it is the strongest pretraining base among opens and implies people are underestimating what post-training can do
More skeptical
- @scaling01: below Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro
- @scaling01: the gap may widen again because closed labs have bigger models, better science/law/medicine coverage, faster inference with GB200s
- @mbusigin: early impressions ânot greatâ
- @teortaxesTex: says polished models like K2.6 and GLM 5.1 may still feel better in coding despite lower intrinsic capacity
2) Is V4âs real contribution model quality, or long-context systems design?
A big split in reactions is that many technical readers think the long-context architecture matters more than the raw benchmark position.
- @teortaxesTex: âTheyâve completed their quest: Solid Ultra-Long Contextâ
- @ben_burtenshaw: first open model where long context and agentic post-training âmeetâ
- @scaling01: expects other open labs to adopt pieces of the architecture
- @Dorialexander: frames Huawei/sovereignty constraints as an opportunity to reshape hardware and memory/interconnect design
- @jukan05: reads the paper as evidence that NVIDIAâs hardware roadmap is unusually well aligned to where MoE/long-context models are going
3) Is V4 âopen democratization,â or too hard to copy?
This was one of the sharpest strategic disagreements.
- @teortaxesTex: says V4 is ânot exactly a democratizing technologyâ because the architecture is too difficult for most labs to replicate
- @teortaxesTex: suggests even DeepSeek may not want to do this exact architecture again without refactoring
- @stochasticchasm: notes the sheer hyperparameter complexity is daunting
- Against that, @Prince_Canuma and @Prince_Canuma show that the ecosystem is already compressing and adapting Flash for localish Apple Silicon use, softening the ânot democratizingâ claim on the inference side if not the training side
4) Are people underrating Flash?
Several reactions suggest Flash may be more important than Pro for practical adoption.
- @arena: Flash shifts the price/performance frontier
- @TheZachMueller: Flash@max â Pro@high on reasoning tasks
- @teortaxesTex: benchmarks may underweight âlegit 1M context for penniesâ
- @Prince_Canuma: Flash runs on 256GB Mac
- @baseten and @Togethercompute emphasize long-document analysis and agentic use cases where Flashâs economics matter
China, chips, Huawei, and sovereignty context
DeepSeek V4 was not discussed as a pure model release; it was treated as evidence in the larger USâChina compute and sovereignty debate.
- @scaling01: Chinese labs are already in or near âtakeoffâ in the sense that their models help build better models, though still shifted 5+ months behind
- @scaling01: thinks chip bans are likely to widen the gap in broad domains over time
- @teortaxesTex, @teortaxesTex: disputes simplistic Huawei-dismissal and notes mixed Chinese sentiment toward Huawei
- @ogawa_tter: points to analysis of Ascend 950 / A3 clusters and V4 deployment plans
- @Dorialexander: argues the sovereignty play around Huawei may reshape hardware architecture
- @scaling01: cites DeepSeek saying prices could drop sharply once Ascend 950 supernodes scale in H2
- @jukan05: interprets V4 as validating NVIDIAâs Blackwell/Rubin/HBM/interconnect strategy
- @NVIDIAAI, @NVIDIAAI: unsurprisingly highlight Blackwell day-0 performance, but this is vendor framing rather than independent proof of strategic superiority
There is also a more ideological thread:
- @teortaxesTex, @teortaxesTex, @teortaxesTex argues that Western discourse often misreads Chinese labs as purely state proxies or distillation shops, and instead sees them as serious mission-driven actors. This is interpretive, but it helps explain why the release drew such emotionally charged geopolitical reactions.
Distillation, training data, and data quality
A recurring undercurrent: does V4 mainly reflect architectural innovation, or can critics dismiss it as âdistillationâ?
- @yacineMTB speculates that some complaints about Chinese distillation may partly come from people discovering theyâre outperformed
- @cloneofsimo: âVery interesting⊠given they distilled claude đ€đ€â
- @kalomaze: jokes about DeepSeek training on DeepSeek reasoning traces
- On the more substantive side, @teortaxesTex says DeepSeekâs writing quality, especially Chinese, reflects long-standing obsession with data cleanliness and cites job listings @teortaxesTex, @teortaxesTex
- @nrehiew_ notes the report still lacks much detail on pretraining data beyond standard categories
- Overall, factual public evidence in this tweet set supports âDeepSeek trains at large scale with strong data work,â but not any strong claim about the degree of external distillation beyond speculation
Architecture lineage and prior art
Several researchers pointed out that V4 did not emerge from nowhere.
- @jaseweston: says DeepSeek uses hash routing from a 2021 ParlAI approach
- @suchenzang: criticizes routing-induced outliers, with a jab at hashing
- @teortaxesTex: notes Mixtral-style MoE was a reasonable earlier hack, but claims DSMoE changed things
- @art_zucker broadly attacks MoEs as a dead end
- @gabriberton counters that MoEs are provably effective despite inelegance
- @stochasticchasm is even more positive: âMoEs are amazingâ
This matters because V4 was read not just as a stronger checkpoint, but as a possible new design point for open long-context MoEs.
Why the technical report itself mattered
A striking amount of praise was directed not just at the model but at the paper/report quality.
- @scaling01: âthe technical paper is a big dealâ
- @Dorialexander: âmost significant AI paper of the yearâ
- @morqon: âone of the best Iâve ever readâ
- @scaling01: âthis is what research should look likeâ
- @TheZachMueller, @iamgrigorev, @nrehiew_: all signal unusually high effort to digest and test the report
For expert readers, this is important because many frontier releases now arrive with sparse technical disclosure. V4âs report appears to have reset expectations for what a serious open release can look like.
Practical limitations and caveats
Despite the enthusiasm, several caveats recur:
- Still behind closed frontier in aggregate capability
- especially sciences/law/medicine and broad âgeneral domainsâ per @scaling01
- Reasoning RL may be undercooked
- @scaling01: reasoning efficiency not much changed vs V3.2 Speciale
- Serving remains hard
- @scaling01: many labs serve at only 20â30 tok/s and limited concurrency; running evals can take a day
- @ClementDelangue: acknowledges concurrency bottlenecks on HF
- High token usage
- major practical caveat from @ArtificialAnlys
- API controls
- @stochasticchasm: notes DeepSeek API appears not to allow sampler control
- Adoptability
- @teortaxesTex: too complex for many labs to copy cleanly
Broader implications
Three implications stand out.
-
Open-weight long-context is no longer just marketing.
V4âs strongest contribution may be proving that 1M context can be made operationally credible in an open-weight model, with concrete KV-cache engineering and open inference support. This is why multiple posters focused less on benchmark deltas and more on systems design: @ben_burtenshaw, @ZhihuFrontier, @scaling01. -
Chinaâs top labs remain competitive in open models, even if not fully closing the closed-model gap.
The benchmark picture across @ArtificialAnlys, @arena, and @scaling01 suggests Chinese labs now dominate much of the open-weight top tier: Kimi, GLM, DeepSeek, and soon MiMo. -
The bar for âopenâ is rising from checkpoint release to full-stack co-design.
V4 was instantly discussed alongside vLLM, Blackwell, MLX quants, Mac viability, Ascend clusters, and cache/memory architectures. In other words, âthe modelâ is increasingly inseparable from the inference substrate.
Infrastructure, inference, and local/open ecosystem
- Hugging Face launched ML Intern, an open-source CLI âAI internâ for ML work that can research papers, write code, run experiments, use HF datasets/jobs, search GitHub, and iterate up to 300 steps, per @MillieMarconnni. Related sentiment: HFâs $9 Pro tier is unusually strong value per @getpy.
- Meta said it will add tens of millions of AWS Graviton cores to its compute portfolio to scale Meta AI and agentic systems for billions of users, per @AIatMeta.
- Local/open coding stack momentum stayed strong:
- @julien_c: Qwen3.6-27B via llama.cpp on a MacBook Pro feels close to latest Opus for many coding tasks
- @p0: free CLI agent built with Pi + Ollama + Gemma 4 + Parallel web search MCP
- @Prince_Canuma: DeepSeek V4 quants incoming
- @QuixiAI: reminder that llama.cpp / Ollama / LM Studio do not support tensor parallel, pushing serious multi-GPU serving users toward vLLM
- Nous/Hermes shipped heavily:
- Hermes Agent v0.11.0 introduced a rewritten React TUI, dashboard plugin, theming, more inference providers, image backends, and QQBot support, per @WesRoth
- Hermes got broad praise and rapid support for both DeepSeek V4 and GPT-5.5, via @mr_r0b0t, @Teknium
- @JulianGoldieSEO and @LoicBerthelot compared Hermes favorably to OpenClaw on learning loops, memory, model support, deployment flexibility, and security
- A native Linux sandbox backend for Deep Agents using bubblewrap + cgroups v2 was released by @nu_b_kh
Research papers and benchmarks
- On-policy distillation token selection:
- @TheTuringPost highlights a paper showing only some tokens carry most learning signal; using ~50% of tokens can match or beat full training and cut memory by ~47%, while even <10% focused on confident-wrong tokens nearly matches full training.
- Google Research pushed several ICLR demos:
- MesaNet, a transformer alternative / linear sequence layer optimized for in-context learning under fixed memory, via @GoogleResearch
- robotics/3D reasoning and efficient transformer work via @GoogleResearch
- âreasoning can lead to honestyâ demo via @GoogleResearch
- MIT Hyperloop Transformers mix looped and normal transformer blocks, using ~50% fewer parameters while beating regular transformers at 240M / 1B / 2B, per @TheTuringPost.
- âLearning mechanicsâ tries to synthesize a theory of deep learning dynamics, via @learning_mech.
- Tool/agent systems papers:
- Tool Attention Is All You Need claims 95% tool-token reduction (47.3k â 2.4k/turn) with dynamic gating and lazy schema loading, per @omarsar0
- StructMem for long-horizon structured memory highlighted by @dair_ai
- HorizonBench targets long-horizon personalization with shifting user preferences, via @StellaLisy
- Clarifying questions for software engineering:
- @gneubig shared work on a model trained specifically to ask clarifying questions, improving results with fewer questions.
GPT-5.5 rollout and coding agents
- OpenAI rolled GPT-5.5 and GPT-5.5 Pro into API and ecosystem products with a 1M context window, per @OpenAI, @OpenAIDevs.
- Distribution was immediate across Cursor, GitHub Copilot, Codex/OpenAI API, OpenRouter, Perplexity, Devin, Droid, Fleet, Deep Agents:
- @cursor_ai: GPT-5.5 is top on CursorBench at 72.8%
- @cline: #1 on Terminal-Bench at 82.7
- @OpenAIDevs: Perplexity Computer saw 56% fewer tokens on complex tasks
- @scaling01: GPT-5.5 medium became strongest non-thinking model on LisanBench with 45.6% fewer tokens than GPT-5.4 medium and higher scores
- User feedback clustered around better coding quality and token efficiency, despite mixed feelings about some evals:
- @almmaasoglu: best code theyâve read from an LLM; less verbose, less defensive
- @KentonVarda: caught a deep Capân Proto RPC corner case from a 6-year-old comment
- @willdepue: underwhelmed by evals, impressed in Codex on complex technical projects
- @omarsar0: smooth switch from Claude Code to Codex/GPT-5.5 thanks to better âeffort calibrationâ
- Cursor also shipped /multitask async subagents and multi-root workspaces, via @cursor_ai.
- There is growing market emphasis on limits and economics rather than tiny quality gaps:
- @nrehiew_ argues usage caps now matter more than small frontier deltas
- @HamelHusain says Codexâs subscription structure makes it hard not to use
Industry moves, funding, and policy
- Google reportedly plans to invest up to $40B in Anthropic, reported by @FT and echoed by @zerohedge. Reactions centered on how large Anthropicâs compute commitment may now be.
- Cohere and Aleph Alpha announced a Canada/Germany sovereign AI partnership, framed as enterprise-grade and privacy/security focused by @cohere, @aidangomez, @nickfrosst.
- ComfyUI raised $30M at a $500M valuation, while keeping core/open-local positioning, via @yoland_yan.
- Mechanize announced $9.1M raised at a $500M post-money valuation, via @MechanizeWork.
- Arcee AI hired Cody Blakeney as Head of Research, emphasizing open-weight American frontier models, via @code_star.
- Safety / governance:
- OpenAI announced a Bio Bug Bounty for GPT-5.5, per @OpenAINewsroom
- Anthropic launched Project Deal, a marketplace where Claude negotiated on behalf of employees, and highlighted model-quality asymmetry and policy challenges, via @AnthropicAI
Creative AI and multimodal
- GPT Image 2 + Seedance 2 workflows kept drawing attention:
- @_OAK200 and @awesome_visuals showed high-fidelity imageâvideo pipelines
- @BoyuanChen0 said 2K/4K images are already available via experimental API and active fixes are underway
- Kling announced native 4K output and a $25k short film contest, via @Kling_ai.
- Some evaluative nuance:
- @goodside noted GPT Images 2.0 could render a valid-looking Rubikâs Cube state, which is surprisingly hard
- @venturetwins framed recent image/video gains as a major step toward personalized game-like content generation
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Deepseek V4 and Related Releases
-
Deepseek V4 AGI comfirmed (Activity: 1138): The image is a meme and does not contain any technical content. The title âDeepseek V4 AGI confirmedâ suggests a humorous or exaggerated claim about an AI model, possibly referencing advancements in artificial general intelligence (AGI). The comments further imply a satirical tone, mentioning uncensored datasets and military applications, which are likely not serious claims. The comments reflect a satirical take on AI capabilities, with mentions of uncensored datasets and military applications, indicating skepticism or humor rather than a serious technical discussion.
- UserXtheUnknown discusses a test scenario with Deepseek V4, highlighting its tendency to overthink problems. The model interprets constraints like âusing only one knifeâ as mandatory rather than optional, which affects its problem-solving approach. This reflects a nuanced understanding of task constraints, but also indicates potential areas for improvement in handling implicit instructions.
-
Deepseek V4 Flash and Non-Flash Out on HuggingFace (Activity: 1393): DeepSeek V4 has been released on HuggingFace, featuring two models: DeepSeek-V4-Pro with
1.6T parameters(of which49Bare activated) and DeepSeek-V4-Flash with284B parameters(with13Bactivated). Both models support a context length ofone million tokens, which is significant for handling extensive sequences. The models are released under the MIT license, allowing for broad use and modification. A notable comment highlights the challenge of hardware limitations, particularly RAM, when working with such large models. Another comment suggests the potential benefit of a0.01bit quantizationto manage the model size more effectively.- The DeepSeek-V4 models are notable for their massive parameter sizes, with the Pro version having 1.6 trillion parameters (49 billion activated) and the Flash version having 284 billion parameters (13 billion activated). Both models support an extensive context length of one million tokens, which is significant for handling large-scale data inputs and complex tasks.
- A user expressed interest in a 0.01-bit quantization of the DeepSeek-V4 models, which suggests a focus on reducing the model size and computational requirements while maintaining performance. Quantization is a common technique to optimize models for deployment on hardware with limited resources.
- The mention of the MIT license indicates that DeepSeek-V4 is open-source, allowing for broad use and modification by the community. This licensing choice can facilitate collaboration and innovation, as developers can freely integrate and adapt the models into their own projects.
-
Buried lede: Deepseek v4 Flash is incredibly inexpensive from the official API for its weight category (Activity: 404): The image provides a comparison between two models, âdeepseek-v4-flashâ and âdeepseek-v4-pro,â highlighting that the âdeepseek-v4-flashâ model is significantly more affordable in terms of input and output token costs. Despite its affordability, the model supports advanced features like JSON output, tool calls, and chat prefix completion in both non-thinking and thinking modes. The discussion around the image suggests that while the âdeepseek-v4-flashâ is marketed as inexpensive, some users argue that it is actually overpriced compared to previous versions when considering parameter scaling, with the âV3.2â model being cheaper per parameter. Commenters discuss the impact of GPU shortages on current pricing, suggesting that prices may decrease as GPU production increases. There is also debate about the pricing strategy, with some users noting that the new model is more expensive per parameter compared to older versions.
- DistanceSolar1449 highlights a pricing comparison between DeepSeek V3.2 and V4 Flash, noting that V3.2 was priced at
$0.26/0.38for input/output at671b, whereas V4 Flash is$0.14/$0.28at284b. This suggests that V4 Flash is actually more expensive if pricing were to scale linearly with parameters, challenging the notion of its cost-effectiveness. - jwpbe provides a comparative analysis of DeepSeek V4 Flashâs API cost, stating that at
14 cents in / 28 cents out, it is significantly cheaper than competitors like Minimax 2.7, which is3xthe cost, and Qwenâs equivalent, which is even higher. They also mention that Trinity Thinking Large is twice as expensive, indicating that V4 Flash offers a competitive pricing advantage in the market. - Worried-Squirrel2023 discusses the strategic implications of Huaweiâs silicon developments, suggesting that DeepSeekâs pricing strategy involves trading NVIDIA margins for Ascend supply. They predict that once the
950 supernodesscale, DeepSeek could potentially undercut competitors in the open weights tier, leveraging Huaweiâs advancements to optimize costs.
- DistanceSolar1449 highlights a pricing comparison between DeepSeek V3.2 and V4 Flash, noting that V3.2 was priced at
-
Deepseek has released DeepEP V2 and TileKernels. (Activity: 396): Deepseek has released DeepEP V2 and TileKernels, which are significant advancements in AI model optimization and parallelization. DeepEP V2 focuses on enhancing model efficiency and accuracy, while TileKernels introduces a novel parallelization technique that reportedly scales linearly, meaning that doubling computational capacity results in a doubling of processing speed. This release is open-sourced, fostering transparency and collaboration in AI research. For more details, see the DeepEP V2 pull request and the TileKernels repository. One commenter highlights that Deepseek is fulfilling a role that OpenAI was expected to play by advancing research and sharing findings openly, which builds goodwill despite proprietary technologies. Another commenter questions if the parallelization technique indeed scales linearly, suggesting a significant technical breakthrough if true.
- DeepEP V2 and TileKernels by DeepSeek are noted for their potential advancements in parallelization techniques. A user speculates that these techniques might achieve linear scaling, meaning that doubling computational capacity could directly double processing speed. This could represent a significant efficiency improvement in model training and inference.
- There is speculation about DeepSeekâs hardware usage, particularly regarding the SM100 and Blackwell GPUs. One commenter suggests that DeepSeek might be using Blackwell GPUs for training, possibly through rented B200 units on Vast.ai. This hardware choice could influence the performance and capabilities of their models.
- The potential innovations in DeepSeekâs next model, possibly named v4, are highlighted. The focus is on the integration of Engram and mHC technologies, which are expected to play a crucial role in the modelâs performance. The success of these innovations will likely depend on the new dataset DeepSeek has developed.
2. Qwen 3.6 Model Performance and Benchmarks
-
This is where we are right now, LocalLLaMA (Activity: 1755): The image depicts a MacBook Pro running a Qwen3.6 27B model via Llama.cpp, showcasing the capability of executing complex AI models locally, even in airplane mode. This highlights the potential for local AI models to enhance efficiency, security, privacy, and sovereignty by operating independently of cloud services. The post underscores the technological advancement in making powerful AI models accessible on personal devices, emphasizing the importance of local execution for privacy and control. Commenters express skepticism about the overstatement of the Qwen3.6-27B modelâs capabilities, suggesting that while it is impressive for its size, it does not match the performance of more advanced models like Sonnet or Opus. There is concern that exaggerated claims could lead to user disappointment and backlash against the broader LLM community.
- ttkciar highlights the potential for user disappointment with the Qwen3.6-27B model, noting that while itâs impressive for its size and suitable for agentic code generation, it doesnât match the capabilities of more advanced models like Sonnet or Opus. The concern is that overhyping its abilities could lead to backlash against the broader LLM community, not just the individual making the claims.
- sooki10 agrees that while the model is impressive for local coding tasks, comparing it to more advanced models like Opus is misleading and could undermine the credibility of the claims being made. This suggests a need for more accurate benchmarking and communication about the modelâs capabilities to manage user expectations effectively.
- Melodic_Reality_646 points out the disparity in resources, comparing the use of a high-end 128GB RAM m5max system to a more accessible setup. This highlights the importance of considering hardware limitations when evaluating model performance, as not all users have access to such powerful systems, which can skew perceptions of a modelâs capabilities.
-
DS4-Flash vs Qwen3.6 (Activity: 470): The image presents a benchmark comparison between DS4-Flash Max and Qwen3.6 models, specifically the
35B-A3Band27Bversions. The chart highlights that DS4-Flash Max generally outperforms the Qwen models across various categories, particularly excelling in âLiveCodeBenchâ and âHLEâ benchmarks. This suggests that DS4-Flash Max may have superior capabilities in coding and reasoning tasks. The discussion in the comments hints at the potential for larger models like a122Bversion of Qwen3.6, and emphasizes the significance of the1M token contextfeature, which could impact performance in other benchmarks like âomniscenseâ. Commenters note that despite DS4-Flash Maxâs larger size, its performance is only slightly better than Qwen3.6, raising questions about efficiency versus scale. The1M token contextis highlighted as a significant feature that could influence future benchmark results.- Rascazzione highlights the significant increase in context length with Qwen 3.6, noting its ability to handle a 1 million token context. This is a substantial improvement over previous models and could have significant implications for tasks requiring extensive context handling, such as document summarization or complex dialogue systems.
- LinkSea8324 points out the size difference between the models, with DS4-Flash at 284 billion parameters compared to Qwen 3.6âs 27 billion. This raises questions about the efficiency and performance trade-offs between model size and capability, especially in terms of computational resources and inference speed.
- madsheepPL discusses the non-linear nature of benchmark improvements, suggesting that even if a model appears only slightly better in benchmarks, the practical implications can be more significant. They emphasize that improvements in scores are not directly proportional and can have varying impacts on real-world applications.
-
Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6 (Activity: 964): Qwen 3.6 27B has achieved parity with Sonnet 4.6 on the Agentic Index from Artificial Analysis, surpassing models like Gemini 3.1 Pro Preview, GPT 5.2 and 5.3, and MiniMax 2.7. The model shows improvements across all indices, although the gains in the Coding Index are less pronounced due to its reliance on benchmarks like Terminal Bench Hard and SciCode, which are considered unconventional. The focus of training appears to be on agentic applications for OpenClaw/Hermes, highlighting the potential of smaller models to approach frontier capabilities. Anticipation is building for the upcoming Qwen 3.6 122B model. Commenters express excitement about the potential of smaller models like Qwen 3.6 27B, noting the significant improvements and potential for future versions. However, there is skepticism about the extent of these gains, suggesting that some improvements might be due to âbenchmaxxingâ rather than inherent model capabilities.
- Iory1998 highlights the impressive performance of the Qwen 3.6 27B model, noting that it surpasses a 670B model from the previous year. They mention running the Q8 version at 170K with KV cache at FP16 on an RTX 3090 and RTX 5070ti, utilizing 40GB of VRAM, which underscores the modelâs efficiency and power.
- AngeloKappos discusses the narrowing benchmark gap, sharing their experience running the Qwen3-30b-a3b model on an M2 chip. They note its capability to handle multi-step tool calls effectively, suggesting that if the 27B dense model performs this well, the upcoming 122B model could pose challenges for API providers due to its potential performance.
- Velocita84 raises a point about potential âbenchmaxxingâ in the reported performance gains of the Qwen 3.6 27B model, implying that some of the improvements might be attributed to optimized benchmarking rather than inherent model capabilities. This suggests a need for scrutiny in evaluating model performance claims.
-
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (Activity: 491): The post compares two versions of the QWEN 3.6 model, specifically the
35Band27Bparameter versions, on a MacBook Pro M5 MAX with64GBRAM. The35Bmodel achieves72 TPS(tokens per second), while the27Bmodel achieves18 TPS. Despite the slower speed, the27Bmodel produces more precise and correct results for coding tasks, whereas the35Bmodel is faster but less accurate. The test involved generating a single HTML file to simulate a moving car with a parallax effect, using no external libraries. The models were hosted using Atomic.Chat, with source code available on GitHub. One comment highlights the output of theQwen 3.6 27B FP8model using opencode, taking approximately52 seconds. Another comment provides a visual comparison with theQwen 3.5 27B Q3model, suggesting differences in output quality.- The user âsacrelegeâ shared a performance result for the Qwen 3.6 27B model using FP8 precision, noting that it took approximately 52 seconds to complete a task with âopencodeâ. This suggests a focus on optimizing model performance through precision adjustments, which can significantly impact computational efficiency and speed.
- User ânikhilprasanthâ provided a visual comparison for the Qwen 3.5 27B Q3 model, indicating a potential interest in comparing different versions and quantization levels of the Qwen models. This highlights the importance of understanding how different model configurations can affect performance and output quality.
- âTechnical-Earth-3254â inquired about the quantization methods used in the tests, which is crucial for understanding the trade-offs between model size, speed, and accuracy. Quantization can greatly influence the efficiency of large models like Qwen, especially in resource-constrained environments.
-
Qwen 3.6 27B is a BEAST (Activity: 1239): The post discusses the performance of the Qwen 3.6 27B model on a high-end laptop with an RTX 5090 GPU and
24GB VRAM, highlighting its effectiveness for pyspark/python and data transformation debugging tasks. The user employs llama.cpp withq4_k_matq4_0and is exploring further optimizations with IQ4_XS at200k q8_0. The user has not yet implemented speculative decoding. The setup includes an ASUS ROG Strix SCAR 18 with64GB DDR5 RAM. Comments suggest avoiding kv cache as q4 for coding, recommendingq8for130kcontext. Another comment anticipates performance improvements with upcoming releases from z-lab and a specific GitHub pull request that promises a2xdecode speed increase. There is also curiosity about the modelâs performance on systems with16GB VRAMand32GB DDR5 RAMwith offloading.- sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to limitations, and instead suggests using q8 to achieve a
130kcontext window, which can significantly enhance performance for large context tasks. - inkberk points out an upcoming improvement in decoding speed, referencing a pull request #22105 on the
llama.cpprepository. This update, along with the anticipated release of the âdflash drafterâ by z-lab, promises a potential2xincrease in decode speed, which could greatly benefit users in terms of efficiency. - Johnny_Rell inquires about the performance of Qwen 3.6 27B on a system with
16 GB VRAMand32 GB DDR5, specifically regarding the effectiveness of offloading. This suggests a focus on optimizing resource allocation to handle the modelâs demands, which is crucial for running large models efficiently on consumer-grade hardware.
- sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to limitations, and instead suggests using q8 to achieve a
3. Local AI Model Implementations and Innovations
-
Been using PI Coding Agent with local Qwen3.6 35b for a while now and its actually insane (Activity: 656): The post discusses the use of the PI Coding Agent with the Qwen3.6 35b a3b q4_k_xl model for real-world projects, highlighting the effectiveness of a custom âplan-firstâ skill file. This file enforces a structured workflow by requiring a
TODO.mdapproval before any code execution, ensuring tasks are completed in a planned and orderly manner. The model is run locally, demonstrating significant advancements in local model capabilities. The skill file includes phases for project analysis, clarifying questions, TODO.md creation, revision loops, and task execution, emphasizing a disciplined approach to coding tasks. The setup achieves15-30 tokens per secondon an8GB VRAM and 32GB RAMlaptop, showcasing the modelâs efficiency on modest hardware setups. Commenters share similar setups, with one using a Macbook Pro M4 Pro with 48GB RAM, noting the modelâs speed and intelligence, leading to the cancellation of IDE and Claude subscriptions. Another user highlights the availability of âplan modeâ as an extension in official examples, indicating community interest and adoption.- SoAp9035 shares their configuration for running the Qwen3.6-35B model using
llama.cpp, highlighting specific parameters such as--temp 0.6,--top-p 0.95, and--top-k 20. They achieve a performance of15-30 tokens per secondon a setup with8GB VRAMand32GB RAM, indicating efficient use of resources for local model inference. - ibishitl mentions using a similar setup with a Macbook Pro M4 Pro and
48GB RAM, noting the systemâs speed and intelligence in task completion. They have replaced their IDE and Claude subscriptions, suggesting that the local setup with Qwen3.6-35B is both cost-effective and capable enough to meet their needs. - audiophile_vin discusses using the Qwen3.6 27B model locally and finds it impressive. They reference an extension called âPlan modeâ available in the official examples on GitHub, which can enhance the functionality of the coding agent. This highlights the flexibility and expandability of the local setup.
- SoAp9035 shares their configuration for running the Qwen3.6-35B model using
-
Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post (Activity: 402): The post discusses an experiment using speculative decoding with the Qwen-3.6-27B model, demonstrating significant improvements in token generation speed from
13.60 t/sto136.75 t/s. The user attributes this to specific settings in thellama-servercommand, particularly the use of--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48. The setup includes a Linux PC with40GB VRAMand128GB DDR5 RAM, utilizingRTX3090andRTX4060tiGPUs. The user notes recent changes inllama.cppand provides links to documentation and a pull request for further reading. Commenters discuss the necessity of the--no-mmproj-offloadparameter for speculative decoding, with some not observing speed gains on different hardware setups. There is also curiosity about which model was used for drafting and skepticism about the speed improvements in different use cases.- EatTFM is questioning the necessity of the
--no-mmproj-offloadflag for speculative decoding, noting no speed gains on an RTX5090 with their current setup. They provide a detailed command line configuration forllama.cppusing the Qwen-3.6-27B model, highlighting parameters like--spec-type ngram-modand--spec-ngram-size-n 24. They suspect an incompatibility with another parameter might be the issue. - kiwibonga points out a limitation of using n-grams in speculative decoding, specifically mentioning that it âdoesnât work for codingâ and can âbreak tool calls.â This suggests that while n-grams might be beneficial for certain text generation tasks, they may introduce issues in contexts requiring precise tool integration or code generation.
- nunodonato shares their experience, noting no observable speed difference with speculative decoding in their use case. This implies that the benefits of speculative decoding might be context-dependent, potentially varying with different hardware setups or specific model configurations.
- EatTFM is questioning the necessity of the
-
just wanted to share (Activity: 1336): The user has developed a distributed AI system named âChappieâ using a cluster of four Mac Mini M4 Pros, each contributing to a unified node cluster with
256GB of unified memory,56 CPU cores,80 GPU cores, and64 Neural Engine cores. The system utilizes Exo for pooling nodes into a distributed inference cluster and employs a Qdrant vector database for memory sharing and replication. Chappie autonomously generates questions, reads arXiv papers, and develops new skills based on its findings. It features a sub-agent framework for task distribution and a âcouncilâ of reviewer models to ensure quality control of its outputs. The AIâs architecture includes a mix of models such as Qwen 3.6 35B, Qwen 3.6 27B, and others for various tasks, with a focus on autonomous exploration rather than being a mere tool or assistant.- bionicdna highlights a technical improvement by suggesting the use of RDMA over Thunderbolt for clustering, which Apple now supports. This could potentially enhance performance compared to using 10G Ethernet, as RDMA (Remote Direct Memory Access) allows for faster data transfer by enabling direct memory access from the memory of one computer into that of another without involving either oneâs operating system.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. GPT-5.5 Launch and Benchmarks
-
Introducing GPT-5.5 (Activity: 1407): OpenAI has released GPT-5.5, which is priced at
$5 per 1 million input tokensand$30 per 1 million output tokens, doubling the cost of its predecessor, GPT-5.4. The model is optimized for tasks like coding and knowledge work, offering state-of-the-art accuracy in complex workflows with low latency and token usage. It includes advanced safeguards to prevent misuse and is available to Plus, Pro, Business, and Enterprise users, with API access to follow. For more details, see the original article. There is skepticism about the effectiveness of the new safeguards, as indicated by the comment, âWe are releasing GPTâ5.5 with our strongest set of safeguards to dateâ đ«Ș oh boy, suggesting doubts about their robustness.- MapForward6096 highlights the pricing structure for GPT-5.5, noting it costs
$5 per 1 million input tokensand$30 per 1 million output tokens, which is double the price of GPT-5.4. This suggests a significant increase in cost for users, potentially impacting budget allocations for projects relying on this model. - spryes criticizes GPT-5.5âs performance on the SWE-Bench Pro benchmark, where it scored
58.6%, compared to Mythos, which achieved78%. This comparison indicates that GPT-5.5 may not be as competitive in certain technical benchmarks, raising questions about its efficacy relative to other models. - mph99999 expresses disappointment with GPT-5.5, describing it as a âmicro step forwardâ rather than the significant advancement expected. This sentiment suggests that the improvements in GPT-5.5 may not meet the expectations set by previous announcements or marketing, particularly in terms of innovation or performance enhancements.
- MapForward6096 highlights the pricing structure for GPT-5.5, noting it costs
-
GPT-5.5 benchmark results have been released (Activity: 779): The image presents a comparative analysis of AI modelsâ performance on various benchmarks, highlighting GPT-5.5 and its variants. GPT-5.5 shows improved performance over its predecessor, GPT-5.4, and other models like Claude Opus 4.7 and Gemini 3.1 Pro. Notably, GPT-5.5 Pro achieves a
90.1%score in the BrowseComp benchmark, indicating significant advancements in browsing capabilities. However, the SWE-Bench Pro results are less impressive, with only a marginal increase from57.6%to58.6%, compared to Mythosâs77.8%. Commenters note the marginal improvements in some benchmarks, particularly criticizing the small increase in the SWE-Bench Pro score and suggesting that the results were selectively highlighted to favor GPT-5.5. There is also a sentiment against prematurely judging models based solely on benchmark scores without practical usage.- MapForward6096 and spryes highlight that GPT-5.5 shows only a marginal improvement in the SWE-Bench Pro benchmark, increasing from
57.6%to58.6%, while the Mythos model achieves a significantly higher score of77.8%. This suggests that GPT-5.5 may not be competitive in this specific benchmark compared to Mythos. - TuteliniTuteloni points out a potentially overlooked advantage of GPT-5.5: it delivers better results with significantly fewer tokens. This efficiency in token usage could be a critical factor for applications where computational resources or processing time are limited, offering a practical benefit despite the modest benchmark improvements.
- BrennusSokol expresses skepticism about GPT-5.5, questioning whether it represents a significant advancement or just an incremental update. This reflects a desire within the community for a more substantial leap in AI capabilities, rather than minor improvements.
- MapForward6096 and spryes highlight that GPT-5.5 shows only a marginal improvement in the SWE-Bench Pro benchmark, increasing from
-
Chat GPT 5.5 got launched and we got some really bold words by Sam Altman. Thoughts? (Activity: 784): The image is a tweet from Sam Altman discussing the launch of GPT-5.5, emphasizing the importance of iterative deployment for rapid improvements and democratizing AI to ensure equal access. Altman highlights the platformâs focus on cybersecurity and its ability to support a wide range of users, including companies and entrepreneurs. The new version reportedly uses fewer tokens and operates with lower latency, which could enhance performance and accessibility. The comments reflect a mix of skepticism and support, with some users expressing distrust towards overly positive messaging, while others show enthusiasm for the advancements.
-
thoughts on GPT 5.5 (Activity: 1414): The image is a meme that humorously comments on the release of a new version, likely GPT 5.5, by sarcastically celebrating the increase in version number. The playful tone reflects excitement about the ânumber business,â suggesting a light-hearted take on version updates. View Image Commenters express a desire for improved voice mode in GPT 5.5 and compare it favorably to Claude, indicating that users are looking for specific enhancements and are generally positive about the update.
- One_Internal_6567 highlights that GPT-5.5 Pro is significantly better than its predecessors, noting a visible improvement from version 5.2 to 5.4. This suggests a consistent enhancement in performance and capabilities across these iterations, which may include better handling of complex queries or more efficient processing.
- hardworkinglatinx compares GPT-5.5 favorably against Claude, implying that GPT-5.5 offers superior performance or features. This could involve aspects like response accuracy, speed, or the ability to handle diverse topics more effectively.
- blownaway4 expresses a positive view of GPT-5.5, describing it as âgreat.â While lacking specific technical details, this sentiment may reflect general satisfaction with the modelâs improvements or new features introduced in this version.
-
ChatGPT 5.5 đ„đ„đ„ (Activity: 1359): The image is a humorous depiction of a conversation with ChatGPT 5.5, where the AI suggests walking instead of driving to a car wash 50 meters away. This showcases the modelâs ability to provide practical advice based on context, emphasizing energy efficiency and convenience. The conversation highlights the AIâs reasoning capabilities, as it considers factors like unnecessary engine starts and the hassle of moving the car for such a short distance. This reflects improvements in the modelâs contextual understanding and decision-making processes. One commenter notes that the AIâs response quality varies with its âthinkingâ mode, suggesting that extended thinking leads to more accurate responses. Another comment humorously suggests that the questionâs prevalence on the internet might have influenced the AIâs training data.
- Successful-Earth678 discusses the impact of âextended thinkingâ mode on ChatGPTâs performance, noting that when the model is set to think longer, it consistently provides correct answers. This suggests that the modelâs accuracy can be improved by allowing more processing time, highlighting a potential trade-off between speed and accuracy in AI responses.
- Portatort suggests that the widespread availability of certain questions on the internet may influence ChatGPTâs training data, potentially affecting its ability to answer those questions accurately. This raises questions about the modelâs exposure to common queries and how it impacts its learning and response generation.
- ---0celot--- provides a detailed, practical response from ChatGPT regarding a decision-making scenario about whether to walk or drive a short distance. The response includes considerations for practicality, safety, and environmental conditions, demonstrating the modelâs ability to offer nuanced advice based on context.
2. DeepSeek V4 Release and Benchmarks
-
DeepSeek V4 has released (Activity: 1407): DeepSeek V4, released on HuggingFace, incorporates the innovative manifold-constrained hyper-connections (MHC) technique, which was detailed in a recent paper. This approach enhances model performance by optimizing the connections within the neural networkâs manifold space, potentially offering superior results at a competitive price point. One commenter highlights the modelâs impressive performance relative to its cost, suggesting it offers significant value. Another notes the implementation of the MHC technique as a noteworthy advancement.
- FaceDeer highlights that DeepSeek V4 implements the âmanifold-constrained hyper-connectionsâ technique, which was detailed in a recent paper. This approach likely contributes to the modelâs enhanced performance, as it optimizes the neural networkâs architecture by constraining connections within a manifold, potentially improving both efficiency and accuracy. Read more.
- InterstellarReddit points out the impressive cost-to-performance ratio of DeepSeek V4, suggesting that if the reported statistics hold true, the model could significantly disrupt the American market. This implies that DeepSeek V4 offers substantial computational power or accuracy improvements at a lower cost compared to competitors, making it a competitive choice for businesses and researchers.
- cryyingboy notes DeepSeekâs consistent delivery of new models, contrasting it with competitors who may focus more on marketing or theoretical discussions. This suggests that DeepSeekâs strategy of frequent, tangible updates could be a key factor in its market success, potentially leading to faster adoption and integration into various applications.
-
DeepSeek V4 Benchmarks! (Activity: 466): The image presents a benchmark comparison of various models, including DS-V4-Pro Max and DS-V4-Flash Max, across categories like âReasoning Effort,â âKnowledge & Reasoning,â âLong Context,â and âAgentic.â The benchmarks used include MMLU-Pro, SimpleQA-Verified, and Codeforces, highlighting each modelâs strengths and weaknesses. Notably, the DS-V4-Flash Max is praised for its cost-effectiveness, performing comparably to Gemini 3 Flash on artificial analysis tasks but at a significantly lower cost, estimated at about 50 cents per month for typical usage scenarios. Commenters note that while the V4 models excel in coding tasks, they lack image analysis capabilities. The DS-V4-Flash Max is highlighted as a cost-effective option, offering competitive performance at a fraction of the cost of other models.
- Dangerous-Sport-2347 highlights that the DeepSeek V4 Flash model is particularly cost-effective, performing comparably to Gemini 3 Flash in artificial analysis tasks but at a significantly lower costâapproximately 5 times less. This makes it a competitive option for users focused on cost-efficiency, especially for those engaging in frequent AI searches and coding tasks, estimating a monthly API cost of around 50 cents for moderate usage.
-
DeepSeek V4 dropped 1.6T params and 1M context without Nvidia GPUs. Hereâs the data. (Activity: 470): DeepSeek-V4 introduces a
1.6 trillionparameter model with a1 milliontoken context window, operating without Nvidia GPUs, using Huawei Ascend 950PR silicon. The model features two tiers: V4-Pro with49Bactive parameters and V4-Flash with13Bactive parameters. It employs Engram Conditional Memory for efficient context management, reducing inference overhead by85%. The API pricing is projected between$0.14 and $0.28per million tokens, significantly undercutting competitors. The modelâs architecture leverages parameter sparsity and native memory retrieval, challenging the Nvidia GPU monopoly and potentially transforming AI economics. Commenters note potential further price reductions and skepticism about the impact on Nvidiaâs market position. There are also observations about inconsistencies in the modelâs self-identification and knowledge cutoff, indicating possible issues with model updates.- Neo_Shadow_Entity highlights a potential issue with DeepSeek V4âs self-identification and knowledge cutoff. The model still identifies as DeepSeek-V3 and seems to have a knowledge cutoff at 2025, leading to confusion when discussing events or versions beyond that year. This suggests that the modelâs internal data or update mechanisms might not be fully synchronized with its latest version, causing it to misinterpret or hallucinate information about DeepSeek V4 from 2026.
- smflx points out a misunderstanding regarding the term âEngramâ in the context of DeepSeek V4. Contrary to some expectations, âEngramâ is not related to KV-cache but rather to the modelâs weights. The commenter notes that the Huggingface page lacks a description of âEngram,â indicating a need for further investigation to understand its role or presence in the model.
- Wickywire emphasizes the significance of DeepSeek V4âs pricing strategy, noting that the model offers substantial capacity at competitive price points. This pricing could significantly alter the landscape for AI users, particularly in environments like Openclaw, where cost-effective, high-capacity models can provide a competitive edge.
-
Deepseek-v4 flash and v4 pro (Activity: 549): The image provides a detailed comparison between two AI models, deepseek-v4-flash and deepseek-v4-pro, highlighting their features and pricing. Key differences include the context length and maximum output capabilities, with the v4-pro offering enhanced features like JSON output and tool calls. The pricing structure for input and output tokens is also compared, indicating a cost-benefit analysis for potential users. A notable point from the comments is the depreciation of the deepseek reasoner to the v4 flash thinking mode, which affects performance but still maintains competitive capabilities.
- The discussion highlights that the Deepseek Reasoner is being deprecated in favor of the Deepseek v4 Flash model, which is noted for its impressive performance despite being a âflashâ model. Users are surprised by its capability, as it performs almost on par with the previous Deepseek Reasoner, albeit with some caveats. This transition is likely a factor in the recent performance improvements observed in the API, as the Flash model is significantly smaller than its predecessor, Deepseek v3.
- There is a mention of increased costs associated with the Deepseek v4 Pro model, suggesting a shift in the pricing strategy that may affect users who previously enjoyed a balance of quality and affordability. This change implies that while performance may have improved, the financial barrier to access these models has also increased, potentially limiting accessibility for some users.
- The comments also touch on the broader strategic moves by Deepseek, such as joining forces with other entities, which could be influencing these changes in model deployment and pricing. This could indicate a shift in the companyâs focus towards more integrated or collaborative approaches in AI development.
3. Claude Code Issues and Updates
-
Anthropic just published a postmortem explaining exactly why Claude felt dumber for the past month (Activity: 3991): Anthropic published a postmortem detailing three bugs that caused a perceived degradation in Claude Codeâs performance. The first bug involved a silent downgrade of reasoning effort from
hightomediumon March 4, which was reverted on April 7. The second bug, a caching issue from March 26, led to Claude forgetting its reasoning history, causing cache misses and faster usage limit depletion. The third bug, a system prompt change on April 16, limited responses to 25 words between tool calls, affecting coding quality, and was reverted on April 20. These issues, affecting different traffic slices, were fixed by April 20 (v2.1.116), and usage limits are being reset for subscribers. Read the full postmortem. Commenters noted that the issues matched user suspicions, suggesting a disconnect between user feedback and company acknowledgment. The transparency of the postmortem was appreciated, though some users expressed frustration over the initial lack of communication.- Direct-Attention8597 provides a direct link to the postmortem by Anthropic, which details the technical issues that led to Claudeâs perceived performance drop. The postmortem is a valuable resource for understanding the specific engineering challenges and resolutions implemented by Anthropic. Read more here.
- Jack_Dnlz highlights a strategic decision by Anthropic to reset usage limits just before the weekend, suggesting it minimizes the impact on users since many are less active during this time. This implies a calculated approach to managing user experience and resource allocation, potentially reducing the immediate load on their systems.
- Sufficient-Farmer243 comments on the communityâs ability to diagnose the issues with Claude before official confirmation, suggesting that user feedback and observations were accurate. This highlights the importance of community insights in identifying and understanding AI performance issues.
-
Usage Reset due to Claude Code quality issues (Activity: 615): The image is a tweet from ClaudeDevs explaining a reset of usage limits due to quality issues with Claude Code. After user reports, they investigated and published a post-mortem on three identified issues, which have been fixed in version
2.1.116+. As a result, usage limits have been reset for all subscribers. Image Some users noted the reset was unusual, with varying remaining time limits, and expressed hope that the fixes would address cache misses and unusual usage limit burn issues.- YatzyNanimous highlights concerns about cache misses and unusual usage limit burn issues with Claude, suggesting that the reset might address these technical problems. Cache misses can lead to inefficient data retrieval, impacting performance, while unexpected usage limit burns could indicate underlying resource management issues.
- dwight-is-right notes the release of GPT 5.5 and mentions recent open weight releases like Kimi 2.6, GLM 5.1, and qwen 3.6. These releases are significant as they reportedly reduce the performance gap between different AI models, suggesting a competitive landscape where improvements in one model prompt advancements in others.
- The discussion touches on the technical implications of AI model updates and resets, with a focus on how these changes might affect performance and resource allocation. The mention of specific model versions and their impact on the competitive AI field underscores the rapid pace of development and the importance of staying updated with the latest releases.
-
Claude limits no longer round to the nearest hour (Activity: 494): The image highlights a change in the way the AI service Claude manages its usage limits, moving from rounding to the nearest hour to a more precise minute-based system. This adjustment likely addresses user behavior where individuals would send a message just before the hour to maximize their usage limit. The notification also suggests an option to upgrade to a Pro version, indicating a tiered service model. One comment suggests that the previous system was flawed by treating limits as âhourly buckets,â which could lead to inefficient usage. Another comment humorously points out the frustration of hitting usage limits quickly, emphasizing the need for better management of message limits.
- jake_that_dude suggests that the issue with Claudeâs limits is conceptualized as an âhourly bucket,â which can lead to inefficient usage. For longer tasks, itâs recommended to split work into smaller chats and include detailed handoff notes with state, blockers, and next steps to avoid wasting limits on context churn rather than productive output.
- idiotiesystemique emphasizes the importance of managing chat sessions effectively by opening new chats and creating handover files. This approach can help in maintaining continuity and efficiency, especially when dealing with complex or extended interactions.
- KronosDeret mentions a change in the âfuel management plugin,â implying a technical update or modification that could affect how resources or limits are managed within the system. This could be relevant for users needing to adapt to new configurations or settings.
-
Claude reset limits for everyone (Activity: 2094): The image depicts a dashboard for a service, likely related to AI or machine learning usage, showing that usage limits have been reset to 0% for all categories, including âCurrent session,â âAll models,â and âClaude Design.â This reset suggests a change in the serviceâs usage policy or a temporary reset of limits, which could be related to a new feature or update, such as the rumored launch of GPT-5.5. The reset is beneficial for users who were nearing their usage limits, as noted in the comments. One comment humorously suggests that the billing system is âvibes-based,â implying unpredictability or inconsistency in how limits are managed. Another comment notes that the reset is advantageous for users who were close to their limits, but also mentions that limits seem to be consumed faster post-reset, indicating potential changes in usage tracking or model efficiency.
- National-Data-3928 highlights a significant issue with the reset of usage limits, noting that they are burning through their limits faster than before. This suggests potential changes in the underlying usage tracking or billing algorithm, which could impact users who rely heavily on the service.
- DispensingLCQP expresses frustration with the reset timing, which unexpectedly shifted their usage cycle from Thursday to Friday. This change disrupts their planned usage pattern, particularly affecting those who schedule their usage around specific days. The comment also criticizes Opus 4.7 for its performance in creative writing tasks, indicating dissatisfaction with its capabilities compared to other models.
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.