a quiet day.
AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
Agent Products, Harnesses, and the Shift Beyond “Just the Model”
- The product surface is moving up-stack: A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly model + harness + workflow + UI + memory + economics. @gdb put it bluntly: “the model alone is no longer the product,” while @dzhng argued top-tier products need model <> harness <> product symbiosis. The same pattern shows up in practice: @signulll framed ambient AI and agentic AI as the new seam of computing interfaces, and @teortaxesTex noted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces.
- Coding-agent product differentiation is becoming concrete: OpenAI shipped another substantial Codex update via “codex thursday no. 6” with appshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics. @gdb separately highlighted Appshots, while users reported meaningful workflow shifts: @gdb said it’s hard to remember coding before Codex, and @reach_vb said they haven’t opened an IDE in over a month. But product rough edges remain: @theo praised T3 Code’s remote feature as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-up post. On the Claude side, @ClaudeDevs expanded auto mode to the Pro plan and added Sonnet 4.6 support; @_mohansolo also had to clarify and patch IDE support in Antigravity 2.0 after user backlash.
Model Performance, Cost Curves, and Frontier Competition
- DeepSeek’s pricing move was the biggest market signal: @deepseek_ai made the 75% DeepSeek-V4-Pro discount permanent, triggering strong reactions because it materially changes the cost/performance frontier. @ArtificialAnlys quantified first-party pricing at $0.435/M input, $0.87/M output, $0.0036/M cached input, estimating a blended ~$0.18/M and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs ~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7. Community reaction centered on DeepSeek’s push toward “intelligence too cheap to meter,” as @scaling01 put it. @Yuchenj_UW and @kimmonismus both emphasized the magnitude of the cut.
- Gemini Flash improved, but usage feedback was mixed: @OfficialLoganK reported Gemini 3.5 Flash making major progress over 3.1 Pro on GDPval, claiming Flash is now “competing at the frontier,” and @Designarena placed it 16th overall on Design Arena, a 16-position jump from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains: @Alezander907 saw only slight browser-agent improvement at higher cost, @giffmana argued this isn’t “Flash progress” if the brand still implies cheapness, and @jeremyphoward said the model feels optimized to max evals rather than cooperate with humans. That aligns with broader eval skepticism from @HamelHusain, who argued current tooling underweights qualitative, HITL judgment.
- Qwen and Chinese frontier models keep compressing the race: The official @Alibaba_Qwen teasers and a long third-party review from @ZhihuFrontier portrayed Qwen3.7-Max as a meaningful step up, especially in instruction following, context reliability, and stability, while still suffering from verbosity and high token usage. Elsewhere, @scaling01 claimed recent ALE-Bench runs show Chinese models like Kimi-K2.6, DeepSeek-V4, GLM-5.1 outperforming several Western releases in that setting. @ArtificialAnlys also reported Cursor Composer 2.5 as 3–18x cheaper than Opus 4.7 and 5–32x cheaper than GPT-5.5 on Coding Agent benchmarks, with notably lower token use.
Protocols, Infra, and Agent Runtime Tooling
- MCP’s new release candidate is a substantive protocol simplification: @dsp_ announced the MCP 2026-07-28 release candidate, with the key change that the protocol is now stateless: no handshake, no session ID, and any request can hit any server instance. The RC also introduces first-class extensions like MCP Apps and Tasks, plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns.
- Sandboxes and managed execution are becoming first-class primitives: @_philschmid demoed Gemini Managed Agents + Interactions API to give an agent a secure hosted Linux sandbox with memory and code execution. @CoreWeave launched CoreWeave Sandboxes in public preview for RL, agent tool use, and model eval, while @cnakazawa released Cloudsail for per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer, @skypilot_org argued RL doesn’t work on Slurm because modern RL is a multi-service system with heterogeneous hardware and recovery needs.
- Open-source harnesses and memory layers are proliferating: @NVIDIAAI open-sourced AI-Q agent skills for portable deep-research pipelines that can plug into arbitrary harnesses. @Teknium added Bitwarden support for key management in Hermes and later restored 256K context for Grok Build v0.1 in Hermes here. @shannholmberg described a shared-memory “gBrain” layer under Hermes agents, with typed folders and read-first access for specialist agents. @aakashadesara updated CTOP to support Devin and a CLI for listing, searching, and killing agent sessions.
Research: RL, Distillation, Architectures, and Evaluation
- RL post-training and reward design are under active reconsideration: @RyanBoldi introduced Vector Policy Optimization (VPO), arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes vector-valued rewards, improving search performance even on the original scalar objective. @lateinteraction framed this as a way to train LLMs for more diverse environments and goals, while @FeiziSoheil connected it to broader moves toward structured feedback instead of a single reward number. Separately, @jsuarez teased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment.
- Agent compilation/distillation is emerging as a serious economic idea: @dair_ai highlighted a paper showing a full agentic workflow—multi-step calls, tool use, scratchpads, decision structure—can be distilled into weights and run at ~100x lower inference cost while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models.
- Architecture work remains lively beyond vanilla transformers: @ChunyuanDeng introduced LT2, a linear-time looped transformer combining sparse and linear attention to make looping practical, along with a distilled Ouro-hybrid-1.4B. @ZyphraAI shared work extending Equilibrium Propagation beyond energy-based models toward biologically realistic neurons. On MoE, @Jianlin_S proposed Moving Quantile Balancing for sequence-level load balancing without a loss penalty. Meanwhile @allen_ai launched ArtifactLinker, which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl.
- Math and reasoning capability discourse shifted again: @cozyblaze265065 reported 99.46% on a multi-digit multiplication experiment using gpt-5.5 with medium reasoning and no tools, and @teortaxesTex noted modern LLMs can now do 100-digit multiplication without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points.
Multimodal Systems: Video, Speech, World Models, and Imaging
- Google’s I/O stack pushed toward persistent agents and world simulators: @Google introduced Gemini Spark, a 24/7 personal AI agent for recurring tasks, skills, and workflows. @GoogleDeepMind also launched Project Genie + Street View, letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to Google AI Ultra subscribers via Google Labs. The multimodal side was reinforced by @Google announcing Gemini Omni for conversational video creation/editing and custom avatars, while @emollick emphasized the significance of a fully multimodal system that can natively edit video.
- Runway and image/video tooling keep raising editability: @runwayml released Aleph 2.0, supporting multishot sequences up to 30s at 1080p with targeted edits that preserve the rest of the scene. @CuriousRefuge highlighted SeeDance 2 Stitcher for seamlessly extending AI-generated cinematic clips using Omni-generated continuations.
- Speech and image generation saw notable jumps: @ArtificialAnlys ranked Cartesia Sonic-3.5 as the new #1 TTS model on their Speech Arena, citing an Elo of 1218, support for 42 languages, and strong naturalness/transcript following. Cartesia claims 82ms end-to-end first audio in production here. In image generation, @wildmindai flagged Tencent’s Z-Image 6B as a pixel-space generator with no VAE, 1K resolution, and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from @victormustar and training support for Z-Image L2P 1k in AI Toolkit from @ostrisai.
Security, Cyber, and Policy Pressure
- Cybersecurity is quickly becoming a proving ground for advanced agents: @AnthropicAI said Project Glasswing and partners found more than ten thousand high- or critical-severity vulnerabilities in essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models like Claude Mythos Preview can find. Security productization is following: @perplexity_ai open-sourced Bumblebee, a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs; @AravSrinivas said enterprise deployment will require agentic sandboxes plus continuous security engineering.
- US immigration policy changes triggered sharp backlash from AI leaders: Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See @Nick_Davidov, @AndrewYNg, @theo, @garrytan, and @togelius. The common argument: the rule punishes legal high-skill immigrants, undermines startups and research, and harms US competitiveness in AI.
Top tweets (by engagement)
- @deepseek_ai on making the V4-Pro discount permanent — the clearest single-market signal in this batch around LLM inference economics.
- @gdb on “the model alone is no longer the product” — concise articulation of the current agent/harness product thesis.
- @AnthropicAI on Glasswing finding 10,000+ critical vulnerabilities — one of the strongest data points for AI-driven cyber capability moving into production.
- @dsp_ on MCP 2026-07-28 RC — important protocol update: stateless MCP plus first-class extensions.
- @GoogleDeepMind on Project Genie + Street View — notable step toward consumer-facing world models.
- @cursor_ai on opening the Cursor SDK for custom agents — relevant for teams building on top of coding-agent infrastructure.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Qwen 3.7 Launch and Qwen 3.6 Local Performance
-
Waiting for Qwen 3.7 open weight… The new King has arrived… (Activity: 1217): The image is a benchmark/marketing comparison from the Qwen3.7 blog positioning Qwen3.7-Max as a leading frontier model across agentic coding, software engineering, MCP/tool-use, reasoning, and knowledge evaluations versus Qwen3.6-Plus, DS-V4-Pro Max, GLM-5.1, Kimi K2.6, and Claude Opus-4.6 Max. The technical significance is that the slide frames Qwen3.7-Max as highly competitive with or ahead of Claude-class models on many benchmarks, though Claude Opus-4.6 Max still appears to lead on some tasks such as
ClawEvalandCoWorkBench. Commenters note that this is the Max model, not necessarily representative of smaller/open-weight releases, and speculate about a potential3.7-122B-A17BMXFP4model with512kcontext for local hardware such as Strix Halo. The main debate is skepticism around open weights: commenters point out that Qwen has historically not open-weighted the Max series, so the title’s “waiting for open weight” framing may be unrealistic. Others caution not to expect a hypothetical27Bmodel to match the shown Max-tier benchmark results.- Several commenters distinguish Qwen Max from likely open-weight releases, noting that “Qwen has never open-weighted the Max series” and warning not to expect a smaller
27Bvariant to match Max-level benchmark performance. The implied technical takeaway is that any public/open-weight Qwen 3.7 release may use a different architecture/scale than the benchmarked flagship model. - One technical wishlist centers on a hypothetical Qwen 3.7
122B-A17BMTP MXFP4 model with512kcontext, which commenters argue would be well-suited to Strix Halo-class local hardware. Another user references Qwen 3.5397B-A17BNVFP4, claiming it fits on4x RTX 6000 ProGPUs with enough memory headroom for roughly10concurrent200k-token sessions, positioning it as a potential “Opus at home” if Qwen 3.7 matches reported benchmarks. - A commenter argues that open-weight frontier releases may be less likely because highly capable local models can undermine provider monetization. They claim Qwen’s strategy has shifted from disruption toward monetized frontier competition, which could affect whether large MoE models like
397B-A17Bare released openly.
- Several commenters distinguish Qwen Max from likely open-weight releases, noting that “Qwen has never open-weighted the Max series” and warning not to expect a smaller
-
Qwen3.6 35Ba3 has changed my workflows and even how I use my computer (Activity: 567): The post describes a local-agent workflow using Qwen3.6 35B a3 via
pi, where the user converts repeatable procedures into “skills” generated/documented by Codex, then reuses them for VPS DevOps,doclingPDF→EPUB conversion, Playwright testing, code tickets, and OS-level shell tasks. A concrete example: WhatsApp audio → transcription in AnythingLLM →content.md→ locally generated landing page, then aplan.mdticket queue executed by a “manager”piprocess spawning fresh-context sub-agents withpi -p @plan.md "Check the first Ticket with Status UNDONE and do it", marking ticketsDONE, committing via git, and finally deploying via a VPS skill. Commenters focused on operational concerns: what hardware can run this setup, whether the agent is sandboxed/trustworthy with OS access, and how hardpiis to adopt compared with other agentic tools such as Hermes.- A user reports running
unsloth/Qwen3.6-35B-A3B-MTP-GGUFvia Unsloth Studio on an MS-02 with a 24GB RTX Pro 4000 Blackwell SFF GPU, consistently seeing>100 tokens/s. They compare performance to “unoptimized GGUFs” on a Mac Studio M2, using the MS-02 as a small remote GPU server for the Mac workstation, and note that future MLX support in Unsloth could improve Mac-side performance. Screenshot: preview.redd.it.
- A user reports running
-
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp (Activity: 565): The post benchmarks Qwen3.6-35B-A3B MTP using byteshape’s
IQ4_XS4.19 bpwGGUF on an RTX 4070 Super 12GB + Ryzen 7 9700X, comparing upstreamllama.cppvsik_llama.cppwith--ctx-size 131072,q8_0KV cache, MTP draft max3, andp_min=0.75. Using the samemtp-bench.pyworkload, upstreamllama.cppaveraged89.76 tok/swith aggregate MTP accept rate0.9393, whileik_llama.cppaveraged110.24 tok/sover16.64s, a claimed23%throughput gain, despite lower aggregate accept rate0.8749in the updated results. The OP attributes practical fit to--fit/--fit-margin 1664onik_llama.cpp, with OOM mitigation by raising--fit-marginto1792or2048, and notes that running the display on an iGPU frees essentially all12GBVRAM for inference. Commenters focused on reproducibility: they requested the full upstreamllama.cppcommand and noted that several MTP-related PRs had merged recently, so benchmark timing may depend strongly on build date. One technical workaround suggested for single-GPU CachyOS/KDE users is a software-rendered Plasma Wayland session usingLIBGL_ALWAYS_SOFTWARE=1andGALLIUM_DRIVER=llvmpipe, reducing idle VRAM from roughly>1024MBto126MBat the cost of slow/disabled compositor effects.- A CachyOS/KDE Wayland user described a VRAM-saving workaround for single-GPU systems: create a custom SDDM session that forces KDE Plasma to render via CPU using
LIBGL_ALWAYS_SOFTWARE=1,GALLIUM_DRIVER=llvmpipe, andKWIN_COMPOSE=Q. They reported KDE Wayland idle VRAM dropping from >1024 MBto ~126 MB, freeing nearly a gigabyte of VRAM for running the 35B model, at the cost of disabled or very slow compositor animations. - Several commenters focused on whether the reported
110 tok/scomes from ik_llama.cpp having better MTP/speculative decoding behavior than upstreamllama.cpp. One noted that ik_llama.cpp’s acceptance rate was reportedly never below0.790, while llama.cpp dropped as low as0.477, asking for the exact llama.cpp command/settings and noting that multiple MTP-related PRs had landed in llama.cpp within the previous 24 hours. - A commenter asked about the
IQ4_XSquantization used for Qwen3.6 35B A3B, noting it appears to be the lowest-memory Q4 quant and requesting details on both model quality/intelligence impact and the final VRAM/RAM split. This highlights the key tradeoff for 12 GB VRAM runs: fitting the model via aggressive quantization versus maintaining reasoning quality and avoiding excessive CPU/RAM offload bottlenecks.
- A CachyOS/KDE Wayland user described a VRAM-saving workaround for single-GPU systems: create a custom SDDM session that forces KDE Plasma to render via CPU using
2. Open-Source AI Funding and Legal Pressure
-
Heretic has been served a legal notice by Meta, Inc. (Activity: 2705): The Heretic Free Software Project says it received an email legal notice from a provider representing Meta Platforms, Inc. and has removed derivatives of Meta’s Llama model weights from Heretic-controlled repositories. The project also announced an official German-hosted Codeberg mirror and says it is working on “technological measures” to preserve access to Heretic-created models without relying on a single hosting provider; the post sarcastically cites Llama as “among the 200 best” models, “trailing only
168other models” on the LM Arena leaderboard. Top comments focused on the post’s sarcasm, especially the “168other models” leaderboard jab, and criticized Meta’s enforcement given allegations that Meta used torrented books or copyrighted material in model training.- A commenter highlights the legal-response wording that contextualizes Meta’s Llama family against current open/model competition: it is described as ranking within the top
200on LM Arena, but behind168models from23competitors. The technical implication raised is that Meta’s naming-enforcement posture is being contrasted with Llama’s relative benchmark standing and a perceived slowdown in recent model releases.
- A commenter highlights the legal-response wording that contextualizes Meta’s Llama family against current open/model competition: it is described as ranking within the top
-
DeepSeek is pushing forward with $10.29 billion financing round, with Liang Wenfeng committing to continue developing open-source AI models rather than pursuing short-term commercialization goals (Activity: 797): DeepSeek is reportedly advancing a
$10.29Bfinancing round, with founder Liang Wenfeng reiterating an AGI-oriented roadmap and a commitment to continue releasing/opening AI models rather than prioritizing near-term commercialization, per Bloomberg. Commenters framed this as a strategic bet that model advantages have short half-lives and that open research can accelerate iteration faster than closed talent/model moats. Top comments argued that local inference users are a small minority, so releasing weights would not materially hurt SaaS/API revenue for labs like OpenAI, Anthropic, Google, or Mistral; any architectural lead was estimated to have roughly a~1 yearshelf life. Another commenter said open models are already “good enough” for coding assistance around GLM 5.1-level capability, and the next frontier is compressing similar capability into smaller, faster, more efficient models.- Commenters argued that model weights have a short technical/commercial shelf life: architectural advantages may last only ~
1 year, while local inference users are a tiny minority compared with hosted API users. The claim was that OpenAI, Anthropic, Google, Mistral, etc. could release weights without materially harming revenue, because most users lack the hardware/interest to run even a9Bmodel locally. - One technical thread framed current open models as reaching “good enough” capability for coding assistance, citing GLM 5.1 as a threshold model. The remaining priority, according to the comment, is not raw intelligence but distillation/compression: preserving that coding capability in smaller, faster, and more efficient deployable models.
- A commenter pointed to DeepSeek’s own report saying they are working on adding multimodal capabilities: DeepSeek_V4.pdf. The notable technical angle was that DeepSeek is continuing model expansion despite GPU/export-sanction constraints, suggesting continued progress under limited hardware access.
- Commenters argued that model weights have a short technical/commercial shelf life: architectural advantages may last only ~
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Claude Code Workflows and Anthropic Agent Training
-
Claude Code dropped /workflows (Activity: 1074): The image is a simple Claude-branded announcement graphic for
/workflowsin Claude Code, tied to the post’s claim that Anthropic briefly exposed a new workflow system inClaude Code 2.1.147before removing it from the changelog. The claimed technical significance is replacing an LLM-based orchestrator with aworkflow.jscode-driven controller: structured phases, parallel fan-out, conditionals/loops/budgets, retries, background execution, and reduced context-window “token tax” by passing sub-agent outputs between phases instead of through the main chat context. Image: https://i.redd.it/6tuq1a2i3p2h1.png. Commenters were skeptical that this is a fundamentally new multi-agent pattern, pointing to existing Claude Code agent teams. Others dismissed it as a low-priority feature compared with wanting a newer/better model such as “Opus 4.5.”- A commenter linked Anthropic’s existing Claude Code “agent teams” docs (https://code.claude.com/docs/en/agent-teams), noting that the described
/workflowspattern—“one main agent (an LLM) decides what sub-agents to spawn, holds every intermediate result, and plans the next step”—overlaps with already documented multi-agent orchestration concepts. - The reported
/workflowsfeature appears to have been transient: one commenter says it was visible in the changelog earlier but Anthropic has since taken it down, providing a screenshot mirror of the removed changelog entry (https://preview.redd.it/720w663mcp2h1.png?width=2056&format=png&auto=webp&s=d7afca73806dd159eff3141db0f61de5a37526a8). - One user compared the feature to their own custom orchestration stack built around skills + YAML + a JavaScript CLI, implying
/workflowsmay formalize a pattern developers are already implementing manually for repeatable Claude Code task pipelines.
- A commenter linked Anthropic’s existing Claude Code “agent teams” docs (https://code.claude.com/docs/en/agent-teams), noting that the described
-
Anthropic officially launched 13+ FREE AI courses with certificates (Including Agentic AI and Claude Code!) (Activity: 2547): Anthropic is offering a free official training catalog via its Skilljar-based academy, reachable from Anthropic Learn, with certificates for courses covering Claude, Claude Code, Claude API, MCP / agentic workflows, and deployment tracks for Amazon Bedrock and Google Cloud Vertex AI. The technically notable content called out is the MCP material, including advanced topics around
STDIOandStreamableHTTPtransports, plus Claude Code modules for codebase editing, test execution, and “Plan Mode.” A separate free CodeSignal track, “Developing Claude Agents,” is mentioned for interactive Python/TypeScript labs and certificates. Commenters confirm the Skilljar courses are legitimate because they are linked from Anthropic’s official site, and one user who completed10/15courses specifically recommends the MCP and advanced MCP modules as “worth the squeeze.”- Several commenters confirmed the Skilljar courses are legitimate Anthropic training materials, noting the course portal is linked from anthropic.com/learn rather than being a third-party scam or repost.
- One user who completed
10/15courses specifically highlighted the MCP and MCP Advanced Topics modules as worthwhile, citing practical coverage ofSTDIOandStreamableHTTPtransport protocols for Model Context Protocol integrations. - A few users noted the catalog is not newly launched and has been available for months; one commenter who completed two courses described them as “quite basic”, suggesting the material may be more introductory than advanced for experienced AI developers.
2. Z-Image 6B, Gemini 3.5 Flash and OpenAI Math Updates
-
Tencent released Z-Image 6B with pixel space gen. No VAE & 1k Resolution. (Activity: 899): The image is a sample collage for Tencent/Z-Image 6B / L2P, illustrating
1024px-class pixel-space image generation across portraits, animals, fantasy scenes, vehicles, and stylized compositions, with the key technical claim being generation without a VAE. The post links the project page at nju-pcalab.github.io/projects/L2P and a commenter points to model files on Hugging Face: zhen-nan/L2P. Commenters mainly focused on the architectural trend — “Everyone going for No-VAE now huh” — and questioned practical quality with “Is it any good?” rather than providing benchmarks or detailed evaluations.- A commenter points to the model files on Hugging Face: zhen-nan/L2P at https://huggingface.co/zhen-nan/L2P/tree/main, relevant for readers wanting to inspect/download Tencent’s Z-Image 6B release and its claimed pixel-space generation / no-VAE setup.
- Several comments highlight the broader technical trend toward No-VAE / pixel-space image generation, with one user noting “Everyone going for No-VAE now huh”. This is notable because avoiding a VAE changes the compression/latent bottleneck tradeoff and may affect reconstruction fidelity, memory cost, and native high-resolution generation such as the post’s claimed
1kresolution. - One commenter raises a comparison to Lodestone, asking whether Tencent’s approach learned from Lodestone’s no/low-latent direction or whether Lodestone could learn from Z-Image. The thread does not provide benchmark data, but the technical comparison suggests interest in converging open-weight architectures for direct pixel-space diffusion/flow generation.
-
Google’s latest creation: Gemini 3.5 Flash vs all (Activity: 1503): The post reports a simple arithmetic failure in Google Gemini 3.5 Flash via the Gemini app: for the prompt
300+140=460/ “Is this correct? Breakdown?”, the shared Gemini run allegedly accepts the incorrect sum, while comparison runs were linked for Claude, Grok, and ChatGPT. Commenters reproduced the issue and attributed it to Gemini app inference settings: “Standard”/default thinking behaves like minimum or no reasoning, while Extended thinking or AI Studio with higher thinking settings reportedly returns the correct300 + 140 = 440. The main debate is that this is less evidence about the base model’s capability and more about product-level serving configuration: commenters argue the Gemini app is “nerfed” relative to AI Studio, especially under default/minimum thinking settings. The OP frames the result as embarrassing given claimed SOTA/finance-agent rankings, while others suggest benchmark performance may not reflect low-effort app defaults.- Users reported that the apparent failure depends heavily on Gemini’s thinking level: switching to Extended thinking fixes the answer, while Standard was characterized as effectively “doesn’t think at all.” Another commenter reproduced the same output via a screenshot (preview image) and claimed the Gemini app defaults to something like minimum thinking, whereas AI Studio with even Low thinking avoids the mistake.
- A technical comparison was raised around tool-calling behavior: one commenter argued Gemini’s weakness is not necessarily raw reasoning but tool-routing logic, noting that ChatGPT would likely delegate the task to Python rather than solve it purely in-model. This implies benchmark results may depend on whether the model is allowed to invoke tools and how reliably it decides to use them.
-
Math grad student friend says we’re cooked (Activity: 825): The image is a tweet screenshot relaying a math grad student’s alarmed reaction to a claimed recent Erdős proof, framed by the post title “Math grad student friend says we’re cooked.” It does not provide technical details of the proof, theorem statement, model, benchmark, or verification process; its significance is contextual/social: a mathematician characterizes the result as previously “completely unapproachable” and says OpenAI’s announcement was “exceedingly tacky and in bad taste.” Comment discussion is mostly non-technical and meme-driven, pivoting to jokes about “OnlyFans but for nerds.” One commenter questions what “exceedingly tacky and in bad taste” means, but there is no substantive debate about the mathematics or AI capability claim.
- A commenter argues that the perceived safety of “creative and intellectual” work has weakened as AI systems have begun to show capability in mathematics, theorem proving, and research-level reasoning. The technical takeaway is that automation risk may not correlate cleanly with whether a task is repetitive; instead, advanced reasoning benchmarks and formal proof systems are increasingly relevant to assessing AI impact.
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.