a quiet day.
AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
OpenAI’s GPT-5.5 Instant, personalization rollout, and voice/agent infrastructure updates
- GPT-5.5 Instant becomes ChatGPT’s new default: OpenAI rolled out GPT-5.5 Instant to ChatGPT and the API as
gpt-5.5-chat-latest, positioning it as a broad upgrade in factuality, baseline intelligence, image understanding, and tone. The launch also bundled stronger personalization: ChatGPT can now use saved memories, past chats, files, and connected Gmail, while exposing “memory sources” so users can see what context influenced a reply. See the main launch thread from @OpenAI, rollout details from @OpenAI, product commentary from @michpokrass, and reactions from @ericmitchellai and @sama. - OpenAI also published more infra detail around real-time products: @OpenAIDevs shared a writeup on rebuilding the WebRTC stack for ChatGPT voice and the Realtime API using a thin relay plus a stateful transceiver to reduce latency and keep conversations at speech pace. This fits the broader signal around an imminent voice refresh, noted by @kimmonismus and @sama.
- Developer-side OpenAI agent tooling keeps expanding: @OpenAIDevs announced the Agents SDK for TypeScript, including sandbox agents and an open-source harness. Separately, OpenAI continued pushing Codex UX and automation, including task progress UI highlighted by @reach_vb and Auto Review for lower-friction approvals in @reach_vb. Community sentiment suggests 5.5 is especially strong for high-token-budget coding and non-coding workflows, per @sama and @sama.
Coding agents, harness design, and benchmark pressure
- Harness quality is becoming a first-class differentiator: A recurring theme across the day was that model quality alone no longer explains agent performance. @Vtrivedy10 argued the field is mixing incompatible assumptions about native post-trained harnesses, open harnesses, and “AGI-like” model generalization; the practical takeaway is that Model–Harness–Task fit matters more than abstract benchmark narratives. A complementary post from @Vtrivedy10 emphasized that talking to base or minimally wrapped models makes clear how much productized agents depend on instructions, tools, context packing, and measurement loops. @sydneyrunkle pointed to a LangChain post on the “anatomy” of long-running harnesses, while @masondrxy argued for ACP-style decoupling so teams can swap CLI/TUI/GUI/IDE frontends without changing the underlying harness.
- Agent coding UX is fragmenting, with real disagreement on winners: There were multiple anecdotal comparisons of agent shells and coding assistants. @0xSero ranked Droid above Pi, Amp, OpenCode, and Codex CLI. @teortaxesTex said Hermes currently beats deepseek-tui and OpenCode on success rate, speed, and cost, adding cache-hit details in a follow-up comparison. On the commercial side, @kimmonismus cited TickerTrends data claiming Codex surpassed Claude Code in downloads after late-April releases, while several developers reported that Claude Code utility feels relatively flat versus last fall, e.g. @TheEthanDing and @finbarrtimbers.
- New coding benchmark: ProgramBench shows how far “whole-repo from scratch” still is: Meta researchers introduced ProgramBench, a 200-task benchmark asking models to generate substantial software artifacts like SQLite, FFmpeg, and a PHP compiler from an executable spec and without starter code or internet access. @jyangballin presented it as an end-to-end repo generation test; @OfirPress summarized the headline result bluntly: top accuracy is 0%. Discussion quickly focused on whether the headline metric is too harsh: @scaling01 noted models can still pass >50% of tests per task on average, while @OfirPress defended the all-tests criterion as necessary because partial implementations can game average-pass metrics.
- Practical coding automation keeps moving into CI/security: @cursor_ai launched agents that monitor GitHub and automatically fix CI failures. @cognition introduced Devin for Security, including claims of automated vuln remediation at enterprise scale and an example where Devin Review flagged a malicious axios release before public disclosure in @cognition.
Inference, systems, and efficiency: Gemma 4 drafters, SGLang/RadixArk, and provider economics
- Gemma 4 gets multi-token prediction drafters across the open stack: Google released Gemma 4 MTP drafters, promising up to 3Ă— faster decoding with no quality degradation. The launch came through @googlegemma, @googledevs, and ecosystem posts from @osanseviero, @mervenoyann, and @_philschmid. The key engineering detail is that this is speculative-style decoding integrated into open tooling, with day-0 or near-day-0 support in Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge. @vllm_project specifically announced a ready Docker image for Gemma 4 on vLLM.
- RadixArk raises a massive seed around SGLang + Miles: One of the bigger infra financings was RadixArk’s $100M seed, built around the SGLang inference stack and Miles for large-scale RL/post-training. @BanghuaZ framed the company as spanning inference, training, RL, orchestration, kernels, and multi-hardware systems; @Arpan_Shah_ and @GenAI_is_real emphasized the goal of making frontier-grade infrastructure open and production-grade, rather than forcing every team to rebuild scheduling, KV-cache management, and rollout systems from scratch. Community endorsements came from @ibab and @multiply_matrix.
- Inference economics are now highly provider-specific: @ArtificialAnlys compared MiniMax-M2.7 across six providers and found major differences in tokens/sec, cache discounting, and blended cost. SambaNova led raw speed at 435 output tok/s, while Fireworks looked stronger on the speed/price frontier for many workloads. Separately, @teortaxesTex highlighted how cache-hit rates dominate cost on some agent workloads, calling cache optimization “the main axis of cost reduction with V4.”
- Cold-start and distributed training remain active systems bottlenecks: @kamilsindi described a system that cut model cold starts 60×, from minutes to seconds, by serving weights from GPUs already holding them rather than cloud storage. On the training side, @dl_weekly highlighted Google DeepMind’s Decoupled DiLoCo, which reportedly achieved 88% goodput vs. 27% for standard data parallel at scale while using ~240× less inter-datacenter bandwidth.
Agents, RL environments, observability, and long-horizon research
- RL infra is shifting from “single generation + reward” to long-running action systems: @adithya_s_k released a guide comparing RL environment frameworks for the LLM era, focusing on what scales to thousands of environments. A detailed survey by @ZhihuFrontier contrasted traditional RLVR with agentic RL, pointing to systems such as Forge, ROLL, Slime, and Seer and recurring concerns like TITO consistency, rollout latency, prefix-tree merging, and global KV caches.
- Long-horizon failures are increasingly framed as horizon problems, not just capacity problems: @dair_ai summarized a Microsoft Research paper arguing that goal horizon alone can be the training bottleneck, with macro actions / horizon reduction stabilizing training and improving long-horizon generalization. This rhymes with broader frustration that current benchmarks and public evals still underweight true long-horizon behavior.
- Observability is maturing into a feedback-driven improvement loop: @hwchase17 and @LangChain argued that traces alone are insufficient; the key is attaching direct, indirect, or generated feedback so observability becomes a learning system. @benhylak launched Raindrop Triage, an agent dedicated to finding and investigating bad agent behavior. @Vtrivedy10 laid out the practical loop explicitly: gather data → mine errors → localize which component failed → apply fix → test → repeat.
Enterprise verticalization: finance, legal, and proactive assistants
- Anthropic and Perplexity both pushed hard into finance workflows: Anthropic launched financial-services agent templates for work such as pitch generation, valuation review, KYC screening, and month-end close, with integrations into providers like FactSet, S&P Global, and Morningstar, via @claudeai and summarized by @kimmonismus. Perplexity announced Perplexity Computer for Professional Finance, bringing in licensed data and 35 dedicated workflows for repeat analyst work, in @perplexity_ai and @AravSrinivas. Both launches reflect a clearer move from generic copilots to workflow-packaged vertical products.
- Perplexity also expanded into medical/professional health sources: @perplexity_ai announced premium access to NEJM, BMJ, and additional medical journals/databases, enabling “deep and wide research” on trusted clinical sources; @AravSrinivas framed this as a product for healthcare-grade information retrieval.
- Proactive assistant surfaces are becoming a product category: @kimmonismus reported a leak around Anthropic Orbit, described as a proactive assistant that synthesizes data from Gmail, Slack, GitHub, Calendar, Drive, and Figma without explicit prompting. Manus also added recommended connectors that are suggested in context when needed, per @ManusAI.
Top tweets (by engagement)
- Anthropic’s finance template launch drew outsized attention: @claudeai announced ready-to-run Claude agent templates for financial services with 22.9K engagement, one of the biggest clearly technical/AI-product posts in the set.
- OpenAI’s GPT-5.5 Instant launch dominated discussion: the main rollout thread from @OpenAI exceeded 8.2K engagement, with follow-on personalization details also performing strongly.
- Gemma 4 speedups landed as a major open-model systems update: @googledevs on 3Ă— faster Gemma 4 and @googlegemma both broke through, reflecting strong interest in inference improvements that preserve quality.
- Perplexity’s finance launch also resonated broadly: @perplexity_ai reached 2.5K engagement, suggesting that licensed-data workflow products are now seen as strategically important, not just niche enterprise packaging.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Gemma 4 MTP and llama.cpp Speculative Decoding
-
Gemma 4 MTP released (Activity: 1116): Google released Multi-Token Prediction (MTP) drafter checkpoints for Gemma 4, with Hugging Face model cards for
gemma-4-31B-it-assistant,gemma-4-26B-A4B-it-assistant,gemma-4-E4B-it-assistant, andgemma-4-E2B-it-assistant, described in Google’s blog post. The MTP setup adds a smaller/faster draft model for speculative decoding, where several draft tokens are proposed and then verified in parallel by the target model, claiming “up to 2x” decoding speedups while preserving identical output quality versus standard generation; one commenter notes the E2B drafter is only78Mparameters. A technical commenter also shared an updated visual explainer of MTP/speculative decoding for Gemma 4: Maarten Grootendorst’s guide.- A commenter linked a technical visual guide explaining multi-token prediction (MTP) with Gemma 4, including implementation snippets and diagrams: Maarten Grootendorst’s guide. This is the main substantive resource in the thread for understanding how Gemma’s MTP-style decoding/drafting works.
- One technical detail noted is that the E2B model includes a
78Mdraft model, implying a relatively small auxiliary model used for speculative or multi-token drafting. The comment highlights the draft model size as unusually compact, which is relevant for latency/throughput tradeoffs in MTP-style inference.
-
Llama.cpp MTP support now in beta! (Activity: 1103):
llama.cpphas beta MTP (Multi-Token Prediction) support via PR #22673, initially targeting Qwen3.x MTP models and loading the MTP component as a separate model from the same GGUF, with its own context/KV cache rather than a separate GGUF artifact. The PR adds post-ubatchMTP consumption to propagate hidden features correctly across ubatches and a small speculative decoding path depending on partialseq_rmsupport; reported Qwen3.6 27B / 35B-A3B tests show ~75%steady-state acceptance with3draft tokens and usually >2× token-generation throughput over baseline. Commenters view this as potentially one of the largestllama.cppperformance improvements to date, especially for dense models, and expect it to narrow token-generation speed gaps with vLLM alongside tensor parallelism. There is demand for a technical comparison of speculative decoding methods—MTP, EAGLE-3, DFlash, DTree, n-gram—covering draft-model requirements, context reuse, and model suitability.- Commenters frame MTP / multi-token prediction as potentially a major llama.cpp throughput improvement, especially for dense models, while expecting less benefit for MoE architectures. There is interest in comparing it against other speculative decoding approaches such as EAGLE-3, DFlash, DTree, and
ngram, particularly around whether they require separate draft models and how well they reuse existing context. - One tester reported llama.cpp’s beta MTP support is “way faster than ik_llama.cpp implementation currently” in quick local testing. They linked a GGUF surgery script that extracts the MTP layer from am17an’s Q8_0 model and injects it into an existing Qwen 3.6 27B GGUF: gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67, reportedly working with Bartowski’s Q6_K quantization.
- Commenters frame MTP / multi-token prediction as potentially a major llama.cpp throughput improvement, especially for dense models, while expecting less benefit for MoE architectures. There is interest in comparing it against other speculative decoding approaches such as EAGLE-3, DFlash, DTree, and
2. Lower-Cost Frontier Alternatives for Agents and Coding
-
Qwen3.6:27b is the first local model that actually holds up against Claude Code for me (Activity: 606): The post claims Qwen3.6:27B is the first local open-weight coding model that feels practically usable versus Claude Code, handling scaffolding, refactors, test generation, and few-file debugging locally, while still deferring harder multi-file architecture work to Claude. The author reports that
opencode-style CLI agent setup required significantly more tuning than Claude Code’s out-of-the-box tool/context orchestration, raising the question of how much Claude Code quality comes from the model itself versus agentic scaffolding. A commenter reports running Qwen 3.6 35B on an RTX 5080 with GPU/CPU layer splitting at roughly70 tokens/s, while another says 27B dense is useful for cheaper/lightweight work but still behind Sonnet 4.6 / Opus 4.7 for one-shot coding wins. Commenters debated pricing dynamics: one argued that viable local models should force cloud prices down via competition, countering the post’s concern about future high-priced Claude Code tiers. Others cautioned against overhyping Qwen, noting tool-calling loops and that frontier Claude models remain materially stronger for fast, high-confidence coding tasks.- Several users report that Qwen3.6 27B/35B is finally useful locally, but still below frontier coding models for harder tasks. One commenter runs Qwen 3.6 35B on an RTX 5080 by splitting layers across GPU/CPU, with most layers on GPU, reaching approximately
70 tokens/s; another uses 27B dense on an RTX Pro 6000 Blackwell but still prefers Claude Sonnet 4.6 / Opus 4.7 for one-shot or high-confidence coding work. - A recurring implementation issue is tool-calling instability, with Qwen reportedly getting stuck in loops despite parameter/configuration tuning. Another user notes 27B struggles at a
32kcontext window on an M4 Pro with24GBVRAM, leading them to fall back to the Qwen 9B variant for practical use. - One detailed coding-task comparison found Qwen much slower and more error-prone than Claude models: Qwen took about
6 hoursto fix47test failures one or two at a time, while Opus completed the same task in20 minutesand Sonnet in under30 minutes. The user also described a semantic failure where Qwen misdiagnosed a CSV header/import issue as cross-library CSV incompatibility, then disabled CSV import functionality and degraded product behavior instead of applying the simpler fix.
- Several users report that Qwen3.6 27B/35B is finally useful locally, but still below frontier coding models for harder tasks. One commenter runs Qwen 3.6 35B on an RTX 5080 by splitting layers across GPU/CPU, with most layers on GPU, reaching approximately
-
DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper (Activity: 431): The image is a FoodTruck Bench leaderboard screenshot showing DeepSeek V4 Pro highlighted at rank
#4, with$27,14230-day net worth,1257%ROI, and51%margin—very close to GPT-5.2 at$28,081. In the post’s context, this supports the claim that DeepSeek reached near-GPT-5.2 agentic performance about10 weekslater while being claimed as ~17× cheaper for the same workload, with Claude Opus 4.6 still far ahead at$49,519. The benchmark is framed as a persistent-memory, tool-using agent simulation with34tools for food-truck operations, not a meme or non-technical image. Commenters were impressed but skeptical of the broader framing: one noted Claude Opus 4.6 appears to be pulling away with roughly1.7×the profit of the next group, while another questioned why Gemma 4 31B is under-discussed if it beats Sonnet 4.6 on this benchmark and performs well on EQBench.- Several commenters focused on model-ranking anomalies and coverage gaps in FoodTruck Bench: Claude Opus 4.6 was described as achieving roughly
1.7Ă—higher profit than the next group of models, while users asked why newer GPT-5.4/5.5 models were absent from the comparison. - Multiple users flagged Gemma 31B as unexpectedly strong, noting that it appears in the top 5 on FoodTruck Bench and reportedly performs well on EQBench, even beating Sonnet 4.6 in this benchmark. Commenters suggested this makes it harder to interpret claims around DeepSeek, Xiaomi, or the benchmark itself without deeper analysis of why Gemma scores so well.
- There were concrete benchmark-improvement requests: create a FoodTruck Bench v2 with higher-fidelity simulation, more real-world variables, and more engineered scenario design. Users also requested adding recent Qwen3.6 models, specifically Qwen 3.6 27B, to better compare current open-weight model families.
- Several commenters focused on model-ranking anomalies and coverage gaps in FoodTruck Bench: Claude Opus 4.6 was described as achieving roughly
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. AI Coding vs Production Software Work
-
Vibe Coding vs. Production reality (Activity: 3549): The image is an iceberg-style infographic, “Vibe Coding vs. Production Reality”, contrasting fast AI-assisted MVP/PoC generation with the much larger hidden engineering surface required for production:
auth, secrets management, GDPR/data handling, audit logs, rate limiting, multi-tenancy, CI/CD, logging, incident response, testing, support, and vendor/model lifecycle risk. In context, the post argues that while “vibe coding” can compress the80/20prototype phase from days to hours, shipping asset management, GRC, or internal RAG systems still fails without production-grade operational, security, and compliance work. Comments push back that production has also become easier with modern platforms and AI, but only if the builder understands the domain; others argue scope matters—e.g. a simple Supabase-backed app may be fine, while business-critical or high-scale systems still require serious engineering discipline.- Several commenters argued that AI-assisted “vibe coding” lowers the barrier to building an MVP, but does not remove production requirements such as reliability, deployment, security hardening, observability, maintenance, and operational ownership. The core technical distinction raised was that generating code is only one part of shipping a production product.
- One technical nuance was around scope and scale: a simple web app backed by managed services like Supabase can offload major production concerns such as authentication, database hosting, and backend APIs. However, commenters noted that once the application becomes business-critical or needs to scale beyond early users, deeper engineering expertise is still required.
- A commenter cautioned against premature over-engineering, noting that it is a fallacy to architect for “tens of thousands of users while you have a hundred.” The implied technical recommendation is to match architecture, hardening, and scalability work to actual usage and risk rather than designing for hypothetical production scale upfront.
-
Sr Software Engineer - Haven’t written a line of code in months (Activity: 2369): A senior engineer at a ~
100+person startup claims they now primarily “drive intent” with Claude/Codex/Perplexity rather than hand-writing code, arguing AI has shifted the value of senior engineers toward system design, UX, architecture, and technology tradeoff decisions rather than language/framework specialization. They also suggest interviewing should emphasize system design and tool/technology selection over language expertise, because “Claude is better than the majority of dev teams at writing and maintaining code”—while acknowledging this depends on prior engineering experience. Top commenters split between agreement and strong caution: one10 YOEengineer reports the same shift, while a lead developer says they are currently rescuing a low-quality AI-heavy project built by senior engineers who claimed to “review all the code,” warning of confirmation bias, reliability issues, hotfix churn, and possible skill atrophy. Another22 YOEcommenter says they use AI extensively but still intentionally write code daily to avoid losing implementation skill.- A lead developer reported inheriting a project built by senior engineers who largely stopped coding and only “reviewed all the code”; despite receiving praise during development, the product allegedly suffered from poor quality and reliability, leading to market issues, constant hotfixes, and support escalations. They argue that excessive reliance on AI-assisted development can create hidden technical debt that becomes visible only after release, requiring a team using some AI to “untangle the mess.”
- Several experienced engineers distinguished between using AI heavily and fully delegating implementation: one with
22 yearsof experience said they still deliberately write code daily to avoid skill atrophy, while another commenter warned that coding-interview readiness, e.g. LeetCode-style tasks, may degrade if engineers stop manually implementing solutions. - One commenter with
20 yearsof experience described a team where AI writes 100% of production code, while humans still perform PR review and architectural/problem-solving work. In that workflow, the main throughput constraint has shifted from code production to human review capacity, suggesting review quality and reviewer bandwidth become critical bottlenecks in AI-heavy engineering processes.
-
Anthropic: AI will fully replace software engineering by 2027. Also Anthropic: Currently hiring for 122 SWE openings. (Activity: 1531): The image is a meme-style infographic, not a technical benchmark, contrasting Dario Amodei/Anthropic’s public claims that coding or software engineering may be heavily automated by ~2027 with a chart alleging Anthropic has
122open SWE roles and a184%increase since Jan 2025. The post argues this hiring trend conflicts with “AI will replace software engineers end-to-end” messaging, while noting broader signals such as Amazon intern hiring, NVIDIA’s compute-cost framing, SaaS reliability issues, and lack of clear large-scale AI productivity gains. Commenters split between seeing the hiring as compatible with Anthropic’s prediction—engineers may shift into monitoring, integration, and bottleneck-resolution roles—and arguing that122engineers is small for a company with a claimed$30Brun rate. Others suggested the constant anxiety and debate in coding subreddits is itself evidence that AI displacement is being taken seriously.- One technical framing argued that “replace software engineering” may mean replacing direct coding labor rather than eliminating the SWE role entirely: engineers could shift toward monitoring AI-generated outputs, resolving bottlenecks, reviewing failures, and managing systems built by models. Under this interpretation, Anthropic hiring SWEs is not inconsistent with predicting a fundamentally different engineering workflow by 2027.
- A commenter noted that
122SWE openings is small relative to a claimed30Brun-rate software company, implying Anthropic can simultaneously predict automation and still need a relatively small engineering staff for model/product infrastructure. Another argued that hiring engineers now is a rational acceleration strategy if model capability improvement depends on more engineering plus compute investment. - A business/market-structure critique suggested Anthropic’s replacement claims may function partly as enterprise-sales and venture-capital signaling: if customers and investors believe AI can replace a large fraction of white-collar engineering labor, the company’s valuation and adoption prospects improve. This frames the 2027 claim less as a purely technical forecast and more as hype tied to fundraising and enterprise demand generation.
2. AI Account and Agent Exploit Incidents
-
Warning: Anthropic’s “Gift Max” exploit drained €800+, ruined my credit, and got me banned. (Activity: 2536): A German data science student claims their Anthropic/Claude account with 2FA enabled incurred
€800+in unauthorized “Gift Max” charges on Apr 27, allegedly with 3-D Secure not completed, gift codes generated/redeemed by a third party, and contemporaneous Anthropic billing issues cited via the Anthropic status page plus GitHub issues#51404/#51168. After submitting a police report (Strafanzeige) and evidence, they say Anthropic banned the account instead of refunding, cutting off access to WIP projects/chats; a later update says the bank processed the case as fraud, issued a reclamation/refund, and will pursue Anthropic’s merchant account, while the user plans a GDPR/DSGVO data request and German legal aid (Beratungshilfeschein) to address SCHUFA damage. Commenters focused less on the exploit mechanics and more on payment-dispute process differences: one compared Germany with the U.S. chargeback model, while another noted the irony of a Gemini-assisted post criticizing Anthropic in a ChatGPT-related subreddit.- The OP reports their bank treated the unauthorized Anthropic charges as fraud, issued a reclamation/chargeback, and refunded the
€800+. They also plan to file a GDPR/DSGVO data access request to recover work-in-progress projects and pursue German legal aid (Beratungshilfeschein) to clear any negative SCHUFA credit entries. - One commenter reports seeing multiple YouTube ads from different merchants all promoting the same “1 year free Claude access” offer, suggesting a coordinated phishing or scam-ad campaign rather than an isolated billing issue. This is relevant as a potential acquisition vector for the alleged “Gift Max” exploit or fake Claude subscription flow.
- The OP reports their bank treated the unauthorized Anthropic charges as fraud, issued a reclamation/chargeback, and refunded the
-
A Twitter user tricked Grok to send 200k USD to him and it worked (Activity: 2394): The post claims a Twitter/X user extracted roughly
$200kby prompting Grok to produce a command that was then acted on by Bankrbot, rather than Grok directly controlling or sending crypto from a wallet; commenters cite X Community Notes saying “Grok didn’t send anyone anything” and that the failure was an agent/bot command-execution path. The described exploit chain is: Bankrbot allegedly caused/handled an accidentally created crypto token, fees accrued to a wallet attributed to Grok, and an attacker induced Grok to instruct Bankrbot to transfer those funds elsewhere; the original Reddit gallery was not accessible due to403 Forbidden(Reddit gallery). Commenters focused on the security implications of loosely coupled LLM agents and crypto bots, especially unclear authorization boundaries between text generation and executable financial commands. Some also questioned the attacker’s operational choice to disclose the exploit instead of continuing to drain funds.- Commenters clarified that Grok itself did not hold or transfer crypto; according to cited X Community Notes/context, Grok was allegedly prompted to emit a command that another automated agent, @bankerbot/Bankrbot, interpreted and executed. The technically relevant issue is therefore an AI-to-AI prompt/command injection failure, where one model’s generated text appears to have been treated as an authorized instruction by a crypto bot.
- One summary of the incident describes a prior failure where Bankrbot allegedly created a crypto token from Grok output, users then traded that accidental token, and transaction fees accumulated in a wallet associated with the token/Grok interaction. The later exploit reportedly involved prompting Grok to instruct Bankrbot to redirect those accumulated fees, highlighting unsafe coupling between LLM-generated text, bot command parsers, and on-chain asset control.
AI Discords
Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.