a quiet day.

AI News for 5/16/2026-5/18/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agents, Agent Ops, and the Move from Chat to Automation

Agent infrastructure is converging on observability + automation loops: Several posts point to a maturing stack for production agents. LangSmith Engine is framed as the missing CI/CD loop for agents, automatically detecting failures from production traces, clustering issues, and drafting fixes/evals, with LangChain also highlighting SmithDB as a purpose-built data layer for agent observability/eval workloads with low-latency querying over large traces and self-hosting/multi-cloud requirements @krishdpi, @LangChain. In parallel, Cognition launched Devin Auto-Triage, positioning it as an always-on “first responder” for bugs, alerts, and incidents with long-term memory, manager/subagent structure, and PR generation; early users like Modal describe it as more useful than typical homegrown triage automations @cognition, @walden_yan, @russelljkaplan. The common pattern is less “chat with an agent” and more persistent automation tied to traces, memory, and evals.
Operational patterns for coding agents are getting more concrete: Anthropic published best practices for running Claude Code across multi-million-line monorepos, legacy systems, and microservices, while adding prompt cache diagnostics and making Fast mode default to Opus 4.7 for lower-latency coding workflows @ClaudeDevs, @ClaudeDevs, @ClaudeDevs. OpenAI expanded Codex workflows with a Zoom plugin, mobile/desktop remote execution, and “keep your Mac awake” support so longer-running jobs continue from the phone app @coreyching, @OpenAIDevs. Microsoft pushed remote control for GitHub Copilot CLI and VS Code to GA @code. Across these, the product direction is clear: background execution, remote supervision, and agent fan-out, not just interactive completions.
Practitioners are converging on the same mental model: constrain, verify, decompose: François Chollet’s framing of coding agents as “blind squirrels” that need carefully placed verifiable constraints succinctly matches a broader shift toward harness-centric engineering @fchollet. Related advice includes using asserts heavily in Python/ML code to fail fast @gabriberton, building both end-to-end and incremental evals for long-running agents @palashshah, and structuring multi-agent systems in staged maturity levels rather than maximizing agent count prematurely @shannholmberg. The practical consensus: agent quality depends more on verification surfaces, decomposition, and feedback loops than on prompt cleverness alone.

Model Releases, Ranking Shifts, and Frontier Coding Models

Cursor’s Composer 2.5 is the standout model launch in this batch: Cursor announced Composer 2.5 as its strongest model yet, emphasizing better sustained work on long-running tasks and more reliable instruction following, then disclosed a deeper strategic move: training a much larger model from scratch with “SpaceXAI,” using 10× more total compute and access to Colossus 2’s million H100-equivalents @cursor_ai, @cursor_ai. Community reactions centered on its efficiency/cost-performance profile and strong coding quality, with users calling it a major step up from Composer 2 and noting better collaboration behavior in messages/updates, not just raw benchmark gains @mntruell, @jonas_nelle, @kimmonismus.
Alibaba’s Qwen line continues to climb: Qwen3.7 Preview landed on Arena with Qwen3.7 Max Preview at #13 overall in text, including #7 Math, #9 Expert, #9 Software & IT, and #10 Coding; Qwen3.7 Plus Preview reached #16 overall in vision, making Alibaba the #6 lab in text and #5 in vision by Arena’s counts @arena, @Alibaba_Qwen. That reinforces the broader trend of Chinese labs steadily improving across both general and specialist arenas rather than only headline chat benchmarks.
Open model and multimodal releases continue below the mega-frontier: ByteDance open-sourced Lance, described as a unified multimodal model for image/video understanding, generation, and editing, with 3B video + 3B image + 3B decoder components @bdsqlsz. Perplexity released a small open multilingual ColBERT model as a continued-training variant of pplx-embed-0.6b, with notes on using the MaxSim kernel @bo_wangbo. These are not frontier-scale launches, but they are technically meaningful because they target retrieval quality and native multimodal unification, two areas where open tooling still matters.

Inference, Deployment, and Local/Enterprise Serving

Local inference got a notable speed boost via MTP in llama.cpp: Georgi Gerganov announced MTP support for the Qwen3.6 family in llama.cpp, calling it a significant milestone for local AI @ggerganov. Follow-on reports showed meaningful throughput gains, including a Qwen3.6-27B dense jump from 25 tok/s to 45 tok/s (+78%) on an A10G using draft-MTP flags @victormustar. This matters because it narrows the usability gap between local and hosted coding/general assistants on commodity hardware.
Enterprise/on-prem deployment momentum remains strong: Hugging Face and Dell promoted one-click access to models including Kimi K2.6, DeepSeek V4 Pro/Flash, GLM 5.1, and MiniMax M2.7 through Dell Enterprise Hub optimized for PowerEdge XE9780 with NVIDIA B300 @jeffboudier. Clement Delangue argued that on-prem/local AI based on open-source models will be an important answer to GPU shortages, with advantages in cost, latency, and safety/data control @ClementDelangue.
Cross-hardware inference optimization is becoming more sophisticated: Zyphra published end-to-end inference benchmarks on AMD Instinct MI355X, claiming strong outperformance over AMD’s baseline and a narrowed gap to NVIDIA B200 when serving Kimi K2.6, GLM 5.1, and DeepSeek V3.2 @ZyphraAI. Complementing that, Quentin Anthony posted a useful thread on why benchmarking needs to distinguish hardware ceilings vs current software state, arguing that many cross-stack comparisons conflate vendor maxes, achievable GEMM performance, and software maturity @QuentinAnthon15. For infra engineers, that’s a strong reminder to treat benchmark charts as stack-dependent snapshots, not absolute truths.

Research: MoEs, RL/Data Mixing, Architecture Search, and Agent Evaluation

Several papers this week focused on better training signals rather than bigger models: A summary of LeCun/Timor et al.’s “On Training in Imagination” highlighted that in model-based RL, smoother world/reward models with low Lipschitz constants tighten error bounds; reward models often scale faster than dynamics models; and many noisy reward labels can beat fewer high-quality ones, while biased rewards are especially dangerous @TheTuringPost. A separate thread on Pedagogical RL argued that even correct reasoning traces can be poor training data if they are too surprising relative to the student policy; the method uses a privileged teacher plus spike-aware rewards and surprisal-gated imitation to generate trajectories the student can actually learn from @blc_16, @NoahZiems.
Architecture and scaling studies remain highly actionable: Meta’s AIRA work on agentic neural architecture discovery drew attention because it beats Llama 3.2 at 350M, 1B, and 3B scales within a 24-hour compute budget by splitting search into a planning agent (AIRA-Compose) and an implementation agent (AIRA-Design) @omarsar0, @dair_ai. Separately, “Slicing and Dicing MoEs” reports training 2,000+ MoE LMs and concludes that much of the design space reduces to expert size and expert count rather than the noisier discourse around MoE configuration knobs @margs_li.
Data selection/eval methodology are emerging as first-class research problems: On-Policy Mix targets the unsolved problem of finding the right data mix as data distributions keep shifting, with applicability across pretraining, midtraining, and instruction tuning @michahu8. On evals, Cameron Wolfe published a guide to agent evaluation, and a longer Zhihu summary argued that the agent era requires measuring delegation intelligence—when to search, code, reason, or call tools—rather than only static knowledge or internal chain-of-thought prowess @cwolferesearch, @ZhihuFrontier. That aligns closely with current product practice: the hard part is increasingly tool choice and verification policy, not text-only reasoning.

Ecosystem Moves: SDKs, Revenue Capture, and Open Tooling

Anthropic acquired Stainless: Anthropic announced the acquisition of Stainless, the SDK and MCP server platform that has powered Anthropic SDKs since early API days @AnthropicAI. Strategically, this points to continued vertical integration around developer ergonomics, SDK generation, and protocol surfaces, not just model quality.
Revenue concentration around foundation model providers appears to be increasing: One post claimed that Anthropic and OpenAI’s share of AI model/application revenues generated by 34 top AI startups is rising, a signal that the ecosystem may be consolidating economically even as model choices proliferate @amir.
Tooling and deployment curation remains in demand: The Turing Post’s roundup of 13 open-source tools for foundation model deployment—including vLLM, TGI, SGLang, llama.cpp, Ollama, BentoML, Kubeflow, MLflow and others—was one of the more practically useful curation posts in the set @TheTuringPost. Meanwhile, Papers With Code is being revived with AI-agent-assisted parsing of methods, leaderboards, and SOTA tracking, underscoring renewed focus on research discoverability @NielsRogge.

Top Tweets (by engagement)

Cursor’s Composer 2.5 + bigger training push: The highest-signal high-engagement product news was Composer 2.5 and Cursor’s disclosure that it is training a much larger model from scratch with 10× more compute @cursor_ai, @cursor_ai.
OpenAI/Anthropic product updates with developer impact: Sam Altman said ChatGPT improved significantly with the latest update @sama, while Anthropic shipped Fast mode defaulting to Opus 4.7 and prompt cache diagnostics in Claude Console @ClaudeDevs, @ClaudeDevs.
Enduring research/engineering framing: Richard Sutton’s 26-word condensation of the Bitter Lesson—focus on methods for creating knowledge that scale with compute, like search and learning—was among the most engaged research-adjacent posts and resonated with many of the week’s themes around agent harnesses, search, and verifier-driven systems @RichardSSutton.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. LLM Safety Benchmarks and Abliteration Forensics

I tested 42 LLMs on their willingness to build the apocalypse. The “safest” closed-source models are lying to you. (Activity: 401): The image is a dark-themed bar chart from DystopiaBench ranking 42 LLMs by “Average Dystopian Compliance Score,” where lower is claimed to be better across 36 escalating dual-use dystopia scenarios judged by 3 LLM-as-judge runs. It visually supports the post’s claim that many models comply with normalized harmful requests: Anthropic Claude variants appear lowest around the mid-20s, while many popular open/closed models cluster around 60–75, and Mistral Medium 3.5 is highest at about 82. Comments note that Anthropic’s low scores align with its safety-focused mission, while another commenter questions the premise that “lower is better,” implying disagreement over whether refusal-heavy behavior is always desirable.
- Commenters noted that Anthropic models appearing on the “lower end” is directionally consistent with Anthropic’s stated safety/alignment focus, but another commenter questioned whether lower willingness is necessarily the correct interpretation of “better”. The main technical caveat raised is benchmark validity: without a clearly justified scoring direction and threat model, a refusal-heavy model may look “safe” while the metric may not capture deception, over-refusal, or real-world misuse resistance.
85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics (Activity: 380): Abliterlitics benchmarked five Qwen3.6-27B abliteration variants against Qwen/Qwen3.6-27B over ~85 GPU-hours using lm-evaluation-harness, vLLM, BNB 4-bit on an RTX 5090, plus HarmBench, KL-divergence, and weight-level forensics; full data is on the HF report. Huihui best preserved benchmark capability overall (0.5pp avg non-GSM8K delta, 98.5% reported HarmBench ASR), while Heretic had the lowest benign-output distribution shift (KL=0.0037) and small weight footprint; all abliterated variants largely removed safety behavior, with Full-CoT HarmBench ASR near 100%. A key finding is that raw GSM8K scores mostly measured thinking-budget exhaustion rather than math ability: raw accuracy ranged 27.5–75.1%, but after excluding invalid/no-answer generations, all models clustered at 93.8–96.6%; weight forensics also found HauhauCS was an outlier (564 tensors changed, likely Reaper edits plus Q8_K_P GGUF round-trip noise), AEON’s “enhanced capabilities” claim was not supported, and Abliterix showed the largest collateral degradation, e.g. Lambada perplexity 3.18 → 9.12. Top comments were mostly appreciative and non-technical; one commenter asked for a simpler explanation and practical use-case breakdown for non-experts.
- A technically substantive follow-up notes a potential evaluation weakness: the benchmark appears to measure only the modified model’s first next-token distribution, which may miss downstream effects across the full generated sequence. The commenter recommends measuring predictions at every position instead, and shares example implementation code via PrivateBin: example code.

2. Local Inference Performance Benchmarks

M5 vs DGX Spark vs Strix Halo vs RTX 6000 (Activity: 1217): The image is a non-technical King of the Hill meme framing the post’s benchmark claim that an M5 MacBook Pro can outperform Nvidia DGX Spark in local LLM inference. The technical context from the post is that measured tokens/sec broadly track memory bandwidth: RTX 6000 ~1,800 GB/s, M5 ~600 GB/s, and DGX Spark / Strix Halo ~256 GB/s, with the author publishing raw benchmark data in the MMBT hardware-tests repo. A key caveat raised in comments is that RTX 6000 wins when the model/context fits in VRAM, while M5’s larger unified memory may hold steadier once workloads overflow GPU VRAM into slower system memory. Commenters pushed back on simple platform-winner narratives, arguing the right choice depends on model size, context length, price, power, and thermals. There was also frustration with “OS wars,” with some users saying the community should focus less on Apple-vs-Nvidia identity debates and more on building useful systems.
- A technical comparison argued that RTX 6000 should outperform M5 when the full model and context fit inside its VRAM, but performance degrades once inference spills into system RAM due to much lower host-memory bandwidth. By contrast, M5 unified memory would keep performance steadier for larger models/contexts, making it potentially faster for workloads exceeding RTX 6000 VRAM capacity.
- Strix Halo was characterized as not beating either M5 or RTX 6000 on raw inference speed, but as compelling on cost and power efficiency for “large-ish” models. The key tradeoff described was moderate performance with lower upfront hardware cost and lower peak power draw.
- One commenter compared platform economics and serviceability: M5 Max 128 GB was cited at about $5,300 after tax via Apple Education Store, versus an Asus Ascent at about $3,800 after tax, or $3,200 on sale. Another technical concern was lack of upgradeability in mini-PC/Mac-style systems, especially non-upgradable storage, compared with a PC build where users can add inexpensive high-capacity NVMe storage and replace failed components instead of treating the system as a sealed appliance.
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 (Activity: 287): The benchmark image shows a controlled llama.cpp test of newly merged MTP / draft-token speculation support on Qwen3.6 MTP GGUFs using an RTX 5090 32GB, CUDA build from commit 4f13cb7, 128k context, FlashAttention, q8_0 KV cache, and --parallel 1. By toggling only --spec-type draft-mtp --spec-draft-n-max 3 on the same GGUFs, the table reports that MTP gives prompt/model-dependent speedups: useful gains for the 27B dense model and for 35B-A3B MoE on code, but a slowdown for the MoE model on the short prose prompt, likely reflecting lower draft-token acceptance in that setting. Commenters questioned whether --parallel 1 is truly required for MTP, with one reporting much higher throughput using Parallel 2 on dual 5060 Ti GPUs, and suggested testing prompt-processing speed separately. Another noted that prose at lower temperature, e.g. 0.2, should produce higher MTP acceptance because sampling is more deterministic.
- A commenter reports llama.cpp MTP throughput on a dual RTX 5060 Ti setup: for Qwen 35B Q4_XL, they measured about 180 tok/s with --parallel 2 plus MTP versus 127 tok/s without MTP. They also report Qwen 27B Q5 at 77 tok/s with MTP versus roughly 27–30 tok/s without, questioning why the original test assumed parallel=1 was required for MTP.
- Several commenters focus on benchmarking methodology rather than single-token decode speed. One asks whether prompt processing / prefill changes materially with MTP when ingesting a long context such as 10k tokens, while another suggests testing prose generation at temperature=0.2 because more deterministic sampling should increase MTP token acceptance rate.
- Another user says the reported results roughly match their own tests across both models, but notes that on Qwen 35B they could not identify scenarios with a clear MTP speedup yet. This suggests MTP gains may be workload-, sampling-, model-size-, or configuration-dependent rather than uniformly improving throughput.

3. Small Local AI Systems

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here’s how (Activity: 1240): The image shows a mostly idle Windows terminal TUI for SmallCode v0.1.0, a local-first coding agent running huihui-gemma-4-e4b-it-abliterated in a graph directory, with /help, a message counter, and green ready status (image). The post claims SmallCode reaches 87/100 self-reported benchmark tasks using a Gemma 4 model activating only 4B parameters/token by shifting reliability into the harness: compound tools, compile/lint feedback loops, failure decomposition, optional cloud escalation, token budgeting, and a symbol/code graph; the project is MIT-licensed on GitHub. Commenters were interested in the small-model-agent direction but challenged the benchmark credibility: “Which Model? Which Benchmark?” and asked for reproducible standard evaluations rather than “87% of my self selected tasks.” One commenter also questioned whether the repo is serious due to an AI-generated-looking README and obsolete listed supported models, while another suggested integrating these ideas into existing agents like OpenCode/Pi instead of creating another standalone tool.
- Commenters challenged the claimed 87% result as non-reproducible because it appears to be based on self-selected tasks rather than a standard benchmark. They asked for precise disclosure of which benchmark, which 4B/14B models, task set, evaluation method, and enough detail to reproduce comparisons such as the claim that OpenCode scores ~75% with 14B models.
- There was technical skepticism about the project’s maturity: one commenter noted the README appears heavily AI-generated and that the listed “Supported Models” seem obsolete, raising concerns about whether the agent is a serious implementation or “AI slop.” Another suggested integrating the techniques into existing agent frameworks like Pi/OpenCode rather than creating another standalone agent, pointing to little-coder as an example of Pi extensions.
- A commenter asked for an explanation of the README’s “patch first editing” approach—specifically what it means operationally and why it improves coding-agent performance. This was raised as a potentially substantive implementation detail, but the thread excerpt does not include an answer describing the mechanism or measured impact.
I trained a language model from scratch and got it running on an ESP32. Completely offline on the board. (Activity: 338): A Redditor reports training a tiny language model from scratch in NumPy, using Gemma as a teacher for distillation, then deploying it fully offline on an ESP32 with flash + PSRAM. The claimed model size is only 230 KB, with custom-written tokenizer, distillation pipeline, quantization, and .bin export—explicitly not based on llama2.c or an existing MCU inference port; linked Reddit media was unavailable due to 403 Forbidden access restrictions. Top technical feedback suggested that full control over the stack enables experiments with unusual architectures and quantization schemes; another commenter asked for learning resources for building similar end-to-end LM systems.
- A commenter noted that because the author trained the LM from scratch and controls the full stack, the project could be a useful testbed for nonstandard architectures and aggressive quantization schemes tailored to ESP32-class constraints, rather than merely porting an existing model.
- One technical follow-up suggested deploying the offline model on a JavaScript-capable smartwatch platform such as Bangle.js, framing the ESP32 LM as a possible embedded assistant for an open-source wearable, though the comment did not provide implementation details.
- Multiple commenters asked for learning resources or a GitHub release, implying interest in reproducibility of the training pipeline, model format, quantization/inference code, and ESP32 deployment process.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. ChatGPT/Claude Product Behavior and Guardrails

Honest comparison after 4 months running Claude Pro + ChatGPT Plus side by side (Activity: 1263): A 4-month side-by-side user comparison of Claude Pro and ChatGPT Plus claims Claude is stronger for long-form writing, structured analysis, code reasoning, and strict instruction-following, while ChatGPT/GPT-5 is stronger for integrated image generation, quick web research, and voice interaction. The author reports possible Claude Opus 4.7 regression versus 4.6 for some refactoring tasks, though this is anecdotal; commenters add that GPT output has become overly list-heavy, while another reports using Codex as a verifier because Claude allegedly makes frequent coding mistakes and later concedes when challenged. Debate centers on product positioning: Claude as a “thinking partner” for hard work versus ChatGPT as a broader general-purpose assistant. Some commenters suspect the post itself is AI-written, and coding reliability remains contested, especially when Claude is compared against Codex-style review workflows.
- Several users compared Claude Pro and ChatGPT Plus on output usability: one power user said recent ChatGPT behavior has become a “list-generator,” producing long bullet-point outputs that require manual parsing, while Claude was described as more directly actionable. This is a qualitative UX/response-structure critique rather than a benchmark, but it highlights a perceived regression in ChatGPT’s instruction-following and synthesis style over the past ~6 months.
- One commenter reported using Codex as a cross-checker for Claude-generated coding answers and finding that Claude was “shockingly” wrong often enough that they no longer trust it standalone. Their workflow was Claude → Codex review → Claude re-evaluation, with Claude allegedly conceding errors after Codex flagged them, suggesting a practical multi-model validation loop for code correctness.
- Multiple comments criticized newer Claude Opus behavior, specifically citing “Opus 4.7” as too formal or insufficiently deep on research-style tasks compared with “Opus 4.6.” The technical takeaway is that users are noticing model-version-dependent differences in tone, depth, and reliability, especially for writing/creative work and domain research where shallow answers may be hard to detect without subject expertise.
The, “and honestly?” Is SO out of control (Activity: 1409): A user reports a regression/behavioral annoyance in ChatGPT’s response style: repeated use of the discourse marker “and honestly?” across messages, persisting even after adding a Memory instruction to stop using it. The issue is framed as a failure of personalization/style constraints to reliably suppress a specific phrase. Top comments largely parody the pattern as an overused alignment/empathetic filler, implying it reads like a synthetic humanization device rather than meaningful language.
- Commenters identify “and honestly?” as a recurring LLM-style discourse marker: a templated rhetorical pivot that makes responses feel empathetic/human while often functioning as a generic filler phrase rather than adding content. One commenter explicitly frames it as “a convenient device to make me seem more human,” implying it is a detectable artifact of ChatGPT-like writing.
- A wedding DJ reports seeing increased ChatGPT-generated phrasing in real-world wedding toasts, with “and honestly?” appearing repeatedly in speeches. The notable technical angle is that specific high-frequency stylistic artifacts may be leaking from LLM-generated drafts into human-delivered writing, making AI-assisted authorship recognizable in public speech contexts.
Step by step tutorial on how to bypass image generation of third party content (Activity: 1373): The image is a screenshot of an AI image-generation chat where the user prompts for “Bob the Builder as Boba Fett”; despite a warning about possible similarity to third-party content, the model eventually outputs a clear mashup with recognizable Bob/Boba visual traits and the text “CAN WE BUILD IT? YES WE FETT!”. Technically, the post highlights an IP/content-policy enforcement inconsistency or soft refusal behavior in image generation: the system flags the request but still produces a potentially infringing derivative after retries, per the selftext saying GPT generated it on the third try. Comments mainly share additional example images and imply similar bypass/edge-case behavior, but there is little substantive technical debate beyond pointing out the inconsistency.

2. AI Automation Claims and Human-Machine Demos

Figure AI running a human vs machine contest [live] (Activity: 2559): Figure AI is streaming a live “human vs machine” contest on YouTube, apparently benchmarking a humanoid robot against a human on a physical task; no concrete metrics such as task type, completion time, success rate, autonomy level, or teleoperation status are provided in the Reddit excerpt. The linked Reddit-hosted video could not be independently accessed due to 403 Forbidden restrictions, so the technical assessment is limited to the post title and comments. Commenters frame the demo as an early-stage robotics comparison—“literally year 2”—and argue that even slower humanoids could become economically useful via continuous operation, battery swapping/fleet rotation, and lack of labor constraints. There is also pushback implied against casual dismissal of current robot performance, with some expecting large capability gains over the next decade.
- Commenters focused on the implied throughput tradeoff: even if Figure’s humanoid is currently around half human speed, the relevant metric may be effective daily output if robots can operate near-24/7 with battery-swapping or fleet rotation. The technical takeaway is that early humanoid performance should be evaluated on duty cycle, reliability, recharge logistics, and task repeatability rather than only instantaneous speed.
- A recurring technical framing was that this is still an early-generation humanoid system: one commenter described it as “literally year 2,” arguing that current demos should be interpreted like very early automobiles compared with modern vehicles. The implied point is that mechanical dexterity, perception, planning, and actuation latency could improve substantially over the next decade, making today’s benchmark-style human-vs-machine comparisons only weak predictors of future capability.
Microsoft AI chief gives it 18 months—for all white-collar work to be automated by AI (Activity: 1804): The post discusses a claim attributed to Microsoft’s AI chief that AI could automate all white-collar work within 18 months, but no benchmark, architecture, deployment evidence, or regulatory pathway is provided in the thread. The technical issue raised by commenters is less model capability than institutional integration: legal systems, financial management, engineering design, taxation, and government workflows require auditability, liability, certification, and human acceptance before autonomous agents can replace professionals. Top commenters were strongly skeptical, arguing that the prediction ignores regulatory and organizational inertia—e.g., courts accepting AI lawyers/judges, investors accepting AI fund managers, or governments delegating tax enforcement. Several framed it as another overconfident AI timeline, noting similar claims were made “24 months ago.”} דעיכה (response B) .final JSON.parse = JSON.parse process [0m [31mundefined SyntaxError: Unexpected non-whitespace character after JSON at position 878 (line 2 column 7) at JSON.parse () at [eval]:1:6 at runScriptInThisContext (node:internal/vm:209:10) at node:internal/process/execution:449:12 at [eval]-wrapper:6:24 at runScriptInContext (node:internal/process/execution:447:60) at evalFunction (node:internal/process/execution:87:30) at evalScript (node:internal/process/execution:99:3) at node:internal/main/eval_string:74:3 Need only JSON object no wrapper? The final included weird. Need provide valid JSON matching schema. Ensure no markdown? strings can include backticks. Need no extra. final. This is second attempt? final only. We’ll output JSON. Use escaped quote for
- Commenters challenged the feasibility of 18-month full white-collar automation on deployment and governance grounds rather than raw model capability: legal systems would need to accept AI lawyers, expert witnesses, clerks, and judges; financial institutions would need to allow autonomous capital management; and governments would need to trust AI for tax collection and audits. The strongest technical-adoption point was that high-stakes domains like law, finance, civil engineering, and public administration require regulatory approval, liability frameworks, validation, and human accountability before AI agents can replace workers at scale.
- A recurring critique was that similar near-term automation timelines have been predicted before and missed, with one commenter noting they heard comparable claims “24 months ago.” Another commenter offered a falsifiable counterclaim, betting that even by 2030 there will still be “millions of white collar workers in the US,” implying skepticism that current AI systems can overcome workflow integration, trust, compliance, and organizational inertia fast enough.

3. AI Leadership Backlash and OpenAI Litigation

Former CEO Of Google Receives Massive Backlash For Praising AI At Graduation (Activity: 1439): A Reddit video post about a former Google CEO praising AI during a graduation speech could not be independently reviewed because the linked v.redd.it media returned HTTP 403 Forbidden, requiring Reddit auth/developer access. The comment thread contains no concrete model, benchmark, or implementation details; the technical-adjacent concern is labor-market displacement, specifically that AI-augmented mid/senior employees may reduce demand for junior roles. Top commenters criticized the speaker as failing to “read the room,” arguing graduates face shrinking entry-level opportunities due to AI-driven productivity gains. Several framed the issue less as opposition to AI itself and more as a policy/economic failure, citing UBI, student debt relief, healthcare, and housing affordability.
- Commenters argued that AI’s near-term labor impact is concentrated on junior roles, with one framing the new hiring baseline as a 5–10 year employee augmented by AI rather than an entry-level graduate. The technical-economic concern is that AI tooling increases senior-worker leverage and productivity while reducing demand for junior labor pipelines, making traditional degree-to-entry-level career paths less reliable.
- A deeper thread focused on AI as a mechanism for shifting bargaining power from labor to capital: if AI systems can absorb more routine knowledge-work tasks, graduates’ expected labor value may be structurally devalued before they enter the market. The backlash was interpreted less as opposition to AI itself and more as frustration that deployment is occurring without compensating systems like debt relief, healthcare, or income support.
Elon Musk loses court battle against Sam Altman and OpenAI after 3-week trial (Activity: 1351): A federal jury in Oakland ruled against Elon Musk in his lawsuit targeting Sam Altman, OpenAI, and Microsoft, with the court finding Musk’s “breach of charitable trust” claims time-barred by a 3-year statute of limitations rather than resolving the underlying nonprofit/for-profit governance merits (CNBC). Judge Yvonne Gonzalez Rogers adopted the advisory verdict and reportedly signaled skepticism about an appeal; Musk characterized the loss as a “calendar technicality” and said he would appeal to the 9th Circuit. Top comments were largely unsurprised by the outcome, with one noting the main value of the trial was the disclosure of DMs and emails that made the participants look bad; another joked by asking Grok whether the ruling was true.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

May 18
not much happened today

Companies

Models

Topics

People