a quiet day.*

AI News for 3/10/2026-3/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

NVIDIA’s Nemotron 3 Super Release and the Open-Model Efficiency Push

  • Nemotron 3 Super was the clearest technical release of the day: a 120B parameter / ~12B active open model with 1M context, a hybrid Mamba-Transformer / SSM Latent MoE architecture, and explicit support for agentic workloads. NVIDIA positioned it as unusually open — weights, data, recipe, infra details — and performance-focused for Blackwell-era deployment, with claims of up to 2.2x faster inference than GPT-OSS-120B in FP4 and large throughput gains over prior Nemotron releases (announcement via @ctnzr, tech perspective via @kuchaev, Wired reporting on NVIDIA’s broader open-model investment).
  • Third-party reactions converged on the same theme: strong capability-per-active-parameter and unusually high serving speed. @ArtificialAnlys scored it 36 on the AA Intelligence Index, ahead of gpt-oss-120b (33) but behind Qwen3.5-122B-A10B (42), while noting ~10% higher throughput per GPU than GPT-OSS-120B and launch-day serving speeds of up to 484 tok/s. Community and infra support landed immediately across vLLM, llama.cpp, Ollama, Together, Baseten, W&B Inference, LangChain, and Unsloth GGUFs.
  • The most interesting technical discussion was about why it is fast. @ctnzr highlighted native multi-token prediction (MTP) as a key inference optimization: provisional multi-token guesses get verified on subsequent passes, exploiting otherwise-unused GPU compute at small batch sizes. @bnjmn_marie also quantified a major KV-cache advantage versus Qwen3.5-122B: roughly 8,192 bytes/token in BF16 for Nemotron’s attention KV term versus 24,576 bytes/token for Qwen3.5-122B, making long-context serving materially lighter.

Agent Infrastructure, Orchestration, and the “Bigger IDE” Thesis

  • The strongest product trend was a shift from “chat with a model” to persistent agent runtimes and orchestration layers. @karpathy argued the “age of the IDE is over” framing is wrong; instead, “we’re going to need a bigger IDE” where the unit of work becomes an agent rather than a file, and later extended that into the notion of legible, forkable agentic orgs with real-time observability and control (follow-up, org legibility thread).
  • Multiple launches fit that framing. Perplexity announced Personal Computer, an always-on local/cloud hybrid that runs on a Mac mini, works across local files/apps/sessions, and can be controlled remotely (launch, waitlist). It also expanded Computer for Enterprise, describing orchestration across 20 specialized models and 400+ apps (enterprise launch, API platform update). Separately, Replit Agent 4 pitched a more collaborative, canvas-like workflow with parallel agents for apps, sites, and slides (launch), while Base44 Superagents emphasized “batteries included” integrations with Gmail, Slack, Stripe, CRM, and more for nontechnical users (launch).
  • The engineering discussion is increasingly around the harness, not just the model. @Vtrivedy10 described a fast-moving design space where improved models unlocked product experiences that were previously too brittle, with a self-improving loop of evals/metrics → autonomous harness edits → hill climbing. LangChain added autonomous context compression to Deep Agents so models can compact at task boundaries instead of hard token thresholds (announcement), while @OpenAIDevs published a technical writeup on computer access for agents, covering execution loops, filesystem context, network access, and guardrails.

Anthropic, Claude-Centric Workflows, and Early RSI Anxiety

  • A major meta-story was Anthropic’s institutional framing of powerful AI. The company launched The Anthropic Institute, led by Jack Clark in a new Head of Public Benefit role, with a mandate spanning ML engineering, economics, and social science to shape the public conversation around advanced AI (launch, leadership note, Jack Clark on role change).
  • At the same time, several tweets amplified concerns that Anthropic may be seeing early recursive-self-improvement dynamics internally. The most substantive references came indirectly via discussion of a TIME article: @kimmonismus summarized claims that 70–90% of the code used in developing future models is now written by Claude, model release cadence has compressed from months to weeks, and some researchers think fully automated AI research could be as little as a year away. @Hangsiin highlighted one especially striking line: Claude being 427x faster than human overseers at some internal tasks, with nested parallel usage patterns already common.
  • This narrative had an immediate practical counterpoint: operational dependence on Claude Code. A login/auth outage triggered visible developer pain, with @Yuchenj_UW joking that Silicon Valley productivity fell 90%, @dejavucoder reporting inability to log in, and @HamelHusain describing fallback to token-based access. The outage even prompted @karpathy to note his autoresearch labs got wiped out in the OAuth outage, framing future frontier-model service interruptions as potential “intelligence brownouts.”

Research on Agent Evals, Retrieval, Post-Training, and Self-Improvement

  • Several papers focused on what looks like the next bottleneck: measuring and improving agent systems, rather than just base-model quality. @karinanguyen_ released PostTrainBench v1.0, a benchmark for whether frontier agents can post-train language models in a simplified setting, explicitly aimed at tracking progress toward AI R&D automation / recursive self-improvement. One notable ablation from the thread: for GPT-5.1 Codex Max, medium reasoning effort beat high, because extra tokens caused context compaction and hurt performance (ablation details).
  • On the agent-learning side, @omarsar0 highlighted EvoSkill, where an executor/proposer/skill-builder triad discovers and refines reusable skills from failures; on OfficeQA it reportedly improved Claude Code + Opus 4.5 from 60.6% to 67.9% exact match. @dair_ai shared AgentIR, a reasoning-aware retriever that jointly embeds an agent’s reasoning trace with its query; they report 68% accuracy on BrowseComp-Plus, versus 52% for larger conventional embedding models and 37% for BM25.
  • There was also renewed emphasis on agent reliability as a security problem even without adversaries. @random_walker argued many AI-agent failures arise from unreliability rather than explicit attacks, pointing to a Princeton response to NIST on the need to define, measure, and mitigate that failure mode. Combined with the growing emphasis on eval craft — e.g. @gabriberton calling eval creation the most useful skill in the age of code agents — the center of gravity keeps shifting toward measurement, harnesses, and production feedback loops.

Multimodal Models, Embeddings, and Physical/Visual AI

  • On the multimodal side, Google’s Gemini Embedding 2 drew practical pricing analysis rather than benchmark talk. @osanseviero summarized the release: embeddings for text, images, video, audio, PDFs, plus Matryoshka embeddings for lower-dimensional storage. @neural_avb offered the most useful deployment note: text pricing appears high relative to competitors, suggesting the model is best reserved for multimodal retrieval; video embedding costs can explode unless clients aggressively lower FPS before upload.
  • Qwen3.5’s multimodal architecture also got a detailed community breakdown from @ZhihuFrontier: a hybrid attention stack mixing Gated DeltaNet linear attention and Gated full attention, with a 397B A17B MoE variant and 27B dense variant, 262k native context extensible toward 1M, and MTP in training. That thread is useful mostly as a compact survey of where attention innovation is going: hybrid linear/full attention, GQA, DSA, and MoE routing are now core design axes.
  • In vision/physical AI, Reka Edge launched as a production-focused VLM for physical AI, claiming 3x fewer input tokens and 65% faster throughput than leading 8B models across image/video understanding, object detection, and tool use (launch). Google also shared two healthcare deployments: an AI system that identified 25% of interval breast cancers missed by standard screening (Google) and a real-world study of AMIE for conversational clinical reasoning that found it safe, feasible, and well-received by patients (Google Research).

Top tweets (by engagement)

  • Perplexity’s “Personal Computer”: always-on local/cloud agent on a Mac mini with remote control and local app/file access (launch).
  • Anthropic Institute / Jack Clark’s new role: Anthropic formalizes a public-benefit and public-discourse effort around powerful AI (Anthropic, @jackclarkSF).
  • Replit Agent 4: collaborative, multi-agent canvas for shipping apps/sites/slides (announcement).
  • NVIDIA Nemotron 3 Super: open 120B/12B-active hybrid model with 1M context and day-0 ecosystem support (@ctnzr).
  • Claude Code outage as infra risk: frontier-model auth failure visibly disrupting real engineering workflows (@karpathy, @Yuchenj_UW).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen Model Releases and Benchmarks

  • M5 Max just arrived - benchmarks incoming (Activity: 2188): The post discusses the arrival and benchmarking of the M5 Max 128GB 14” laptop, focusing on testing various machine learning models using the mlx_lm tool. The models tested include Qwen3.5-122B-A10B-4bit, Qwen3-Coder-Next-8bit, Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit, and gpt-oss-120b-MXFP4-Q8. The benchmarks reveal performance metrics such as tokens-per-second and peak memory usage for different prompt sizes. The author initially faced issues with BatchGenerator but resolved them by using a fresh Python environment and stream_generate. The results show varying performance across models, with peak memory usage ranging from 25.319 GB to 92.605 GB and generation speeds from 14.225 to 87.873 tokens-per-second. Commenters are eager for the benchmark results, with one expressing interest in the performance of the Qwen 3.5 27b MLX models. Another commenter humorously notes the anticipation for the benchmarks.

    • The benchmarks for the M5 Max 128GB 14” using mlx_lm.generate show varying performance across different models and configurations. For instance, the Qwen3.5-122B-A10B-4bit model achieves a prompt throughput of 1,239.7 t/s at 16K context with a peak memory usage of 73.8 GB. In contrast, the Qwen3-Coder-Next-8bit model reaches 1,887.2 t/s at 32K context, but with higher memory consumption at 89.7 GB.
    • The Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit model shows a significant drop in generation throughput, with only 14.9 t/s at 32K context and a peak memory usage of 30.0 GB. This suggests a trade-off between model complexity and performance, as more distilled models may require less memory but also offer reduced throughput.
    • The gpt-oss-120b-MXFP4-Q8 model demonstrates impressive performance with a prompt throughput of 2,710.5 t/s at 16K context and a relatively low peak memory usage of 64.9 GB. This indicates that the model is optimized for high throughput while maintaining efficient memory usage, making it suitable for applications requiring fast processing speeds.
  • Qwen3.5-35B-A3B Uncensored (Aggressive) — GGUF Release (Activity: 1019): The release of Qwen3.5-35B-A3B Aggressive on Hugging Face is notable for its uncensored nature, maintaining the original model’s capabilities without refusals (0/465 refusals). This model features 35B parameters with ~3B active, utilizing a mixture of experts (MoE) with 256 experts and 8+1 active per token. It supports multimodal inputs (text, image, video) and employs hybrid attention mechanisms (Gated DeltaNet + softmax in a 3:1 ratio). The model includes various quantization formats like BF16, Q8_0, and Q6_K, and is optimized for vision support with mmproj. Recommended sampling parameters include temp=1.0, top_k=20, and presence_penalty=1.5. Users are advised to use the --jinja flag with llama.cpp for optimal performance. The community appreciates the release, with users expressing gratitude for the developer’s efforts and anticipation for trying the model once all components, like Q4_K_M, are available.

    • Velocita84 raises a critical point about the need for evaluating Kullback-Leibler Divergence (KLD) to substantiate claims of ‘no capability loss’ in the Qwen3.5-35B-A3B model. This metric is essential for quantifying the difference between the probability distributions of the original and modified models, ensuring that the aggressive uncensoring does not degrade performance.
    • Iory1998 highlights concerns about potential quality degradation, particularly in handling long context scenarios. This is a common issue with large language models where modifications, such as aggressive uncensoring, might impact the model’s ability to maintain coherence and accuracy over extended text inputs. The commenter questions how the modified model compares to the original in these aspects.
    • No-Statistician-374 mentions the anticipation for the Q4_K_M version of the model, indicating a community interest in different quantization formats. This suggests that users are keen on exploring various configurations to optimize performance and resource usage, reflecting the technical community’s focus on balancing model size and computational efficiency.

2. Open-Source TTS and Speech Models

  • Fish Audio Releases S2: open-source, controllable and expressive TTS model (Activity: 362): Fish Audio has released S2, a new open-source TTS model that allows for highly expressive and controllable voice synthesis using natural language emotion tags such as [whispers sweetly] or [laughing nervously]. The model supports over 80 languages, enables multi-speaker dialogue generation in a single pass, and achieves a 100ms time-to-first-audio. S2 reportedly surpasses closed-source models from Google and OpenAI in the Audio Turing Test and EmergentTTS-Eval. The model and code are available on Hugging Face and GitHub, though commercial use requires a separate license. There is debate over the model’s open-source status, as the license restricts commercial use without a separate agreement. The founder acknowledged a premature launch and provided additional resources for users, including a GitHub repository and performance benchmarks.

    • The S2 model by Fish Audio is not fully open-source as it requires a separate license for commercial use, despite being available for research and non-commercial purposes. The model is accessible on Hugging Face and the code is available on GitHub, though it is still being refined.
    • The founder of Fish Audio mentioned that the S2 model can achieve approximately 130 tokens per second on an H200 setup using the fish-speech repository, with potential for higher concurrency through SGLang. This suggests significant performance capabilities for users interested in high-throughput TTS applications.
    • The S2 model supports a wide range of languages with high-quality output and includes expressive tags like [angry] or [laughing], which enhance its utility for generating nuanced speech. This feature set makes it particularly valuable for users requiring high-quality non-English TTS solutions.

3. LocalLLaMA and Model Execution Experiences

  • I regret ever finding LocalLLaMA (Activity: 1408): The post humorously describes a journey from using AI for study aids to deep involvement with local LLMs like LocalLLaMA. The user details a progression from using AI for simple tasks to complex implementations involving MI50 GPUs, quantization, and custom matrices. They mention waiting for advancements like GLM flash and Qwen models, indicating a deep dive into optimizing local AI performance. The post reflects a shift from practical application to a hobbyist’s obsession with local AI technologies. A commenter from a major AI company notes that local AI is not widely appreciated outside engineering circles, comparing its potential impact to Linux in computing. Another commenter views the obsession with local AI as a positive addiction, emphasizing the value of knowledge.

  • 1 million LocalLLaMAs (Activity: 430): The image highlights the rapid growth of the ‘LocalLlama’ subreddit, which focuses on discussions about locally hostable AI models. The subreddit, created in March 2023, has quickly amassed 1 million members, reflecting significant interest and engagement in the community. This growth is notable given the subreddit is less than a year old, indicating a strong and active interest in local AI hosting solutions. One comment reflects on the community’s resilience, mentioning past challenges with moderation instability, while another expresses a preference for alternative AI lore, suggesting some members are seeking different thematic directions.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. AI Model and Benchmark Developments

  • Anthropic: Recursive Self Improvement Is Here. The Most Disruptive Company In The World. (Activity: 1141): Anthropic is reportedly accelerating AI development with its model, Claude, writing 70% to 90% of the code for future models, suggesting a shift towards recursive self-improvement. Evan Hubinger from Anthropic claims this phenomenon is already present, with the potential for fully automated AI research within a year. The release of Claude 3.7 Sonnet was delayed by 10 days due to safety concerns, highlighting the company’s cautious approach. Dario Amodei warns of AI’s potential to displace half of entry-level white-collar jobs within five years, urging transparency about these impacts. Anthropic’s stance on AI deployment in military contexts and its criticism of political influences on AI policy are also noted in the article. Some commenters question the criticism of safety delays, arguing that with 90% of code not written by humans, thorough testing is essential to ensure safety. The debate reflects concerns about balancing rapid AI development with ethical and safety considerations.

    • Substantial-Elk4531 raises a critical point about the necessity of testing AI models for safety, especially when a significant portion of the code is not human-written. This highlights the importance of rigorous testing protocols to ensure the safety and reliability of AI systems, which can be complex and unpredictable due to their autonomous nature.
    • BiasHyperion784 discusses the timeline for Anthropic’s infrastructure upgrades, noting that by Q3 2027, new hardware (‘rubin ultra’) will be deployed, enhancing compute capabilities. This suggests that current improvements in training times are crucial as they will be amplified by the upcoming bespoke hardware, indicating a strategic focus on scaling computational power to advance AI capabilities.
    • Unethical_Gopher_236 references historical concerns about AI safety, drawing a parallel to when Ilya Sutskever deemed GPT-2 too dangerous for release. This comment underscores ongoing debates in the AI community about balancing innovation with safety, reflecting on past instances where safety concerns have delayed AI model releases.
  • Andrej Karpathy’s Newest Development - Autonomously Improving Agentic Swarm Is Now Operational (Activity: 1125): Andrej Karpathy has developed an autonomously improving agentic swarm that significantly enhances neural network training. The system autonomously made around 700 changes, with 20 of these changes leading to an 11% improvement in the “Time to GPT-2” metric, reducing the time from 2.02 hours to 1.80 hours. This marks a significant milestone as it demonstrates an AI system effectively executing the full research loop of “try → measure → think → try again” without human intervention. The project is part of Karpathy’s tiny LLM initiative, and further details can be found on GitHub. Commenters are impressed by the AI’s ability to autonomously optimize processes, drawing parallels to similar improvements in other AI applications like RAG pipelines. The development is seen as a potential step towards AI singularity, where AI systems can independently improve and optimize themselves.

    • SECONDLANDING highlights a significant achievement where an AI agent autonomously improved a model’s training efficiency by approximately 11%, reducing the time from 2.02 hours to 1.80 hours to reach GPT-2 level performance. This was achieved through a self-directed research loop involving iterative testing and optimization, marking a notable instance of AI surpassing manual tuning efforts. GitHub link.
    • Worldly_Expression43 shares a parallel experience with Opus 4.6, which autonomously optimized a Retrieval-Augmented Generation (RAG) pipeline using pgvector. The AI evaluated multiple chunking strategies, ultimately achieving a 3x speed improvement over the original vector database method. This underscores the potential of AI in self-benchmarking and optimization, leading to significant performance gains.
    • TumbleweedPuzzled293 raises concerns about the alignment and control of autonomously improving AI swarms. While the technology is exciting, the lack of clear strategies for maintaining alignment as these systems evolve and modify themselves presents a potential risk, highlighting the dual nature of such advancements as both promising and perilous.
  • An EpochAI Frontier Math open problem may have been solved for the first time by GPT5.4 (Activity: 646): GPT-5.4 has reportedly solved an open problem from the EpochAI Frontier Math collection, which consists of unsolved mathematical problems that have resisted solutions from professional mathematicians. This achievement, if verified, marks a significant milestone as it suggests AI’s potential to contribute to open research problems, potentially leading to more substantial breakthroughs in the future. The problem solved is described as ‘moderately interesting,’ indicating a meaningful advancement in AI’s capability to tackle complex mathematical challenges. EpochAI’s open problems are designed to advance human mathematical knowledge through AI solutions. Commenters highlight the significance of AI solving open research problems, suggesting it could lead to larger breakthroughs. The involvement of Archivara lends credibility to the claim, though some skepticism remains about the problem’s relative difficulty.

    • ImmuneHack highlights the significance of AI potentially solving open research problems, emphasizing that while the problem may not be the most challenging in mathematics, the ability of AI to contribute to open research is a critical milestone. This could indicate a trajectory towards more significant breakthroughs, suggesting a shift in AI’s role in mathematical research.
    • FundusAnimae points out the involvement of a credible source, Archivara, in the discussion, which lends legitimacy to the claim. The problem solved is described as ‘moderately interesting,’ indicating it is a notable achievement but not at the pinnacle of mathematical challenges. This underscores the importance of AI’s growing capability in tackling complex problems.
    • socoolandawesome provides an update that an Epoch researcher believes the solution is correct but awaits confirmation from the problem’s author. This highlights the ongoing process of validation in mathematical research, where peer review and author confirmation are crucial steps in establishing the correctness of a solution.
  • Yann LeCun unveils his new startup Advanced Machine Intelligence (AMI Labs) — and raises $1.03B (Activity: 997): Yann LeCun has co-founded a new startup, Advanced Machine Intelligence (AMI Labs), with Alexandre LeBrun. The company has raised $1.03 billion to develop world models using LeCun’s JEPA architecture, which aims to model physical reality rather than just text, addressing limitations of current LLMs like hallucination issues. This initiative is positioned as a long-term research endeavor, with no immediate product or revenue expectations, and involves a notable team including Saining Xie, Pascale Fung, and Michael Rabbat. The project is backed by major investors such as NVIDIA, Samsung, and Bezos Expeditions, and will release its code and papers as open source. TechCrunch Commenters express optimism about LeCun’s new venture, highlighting his reputation for providing honest perspectives on AI. There is anticipation for the potential impact of AMI Labs’ research on the AI field.

    • Yann LeCun’s new startup, Advanced Machine Intelligence (AMI Labs), has raised $1.03 billion and is reportedly seeking a $5 billion valuation. The company is focusing on developing world models, which are sophisticated AI systems capable of understanding and predicting complex real-world phenomena. This ambitious goal aligns with LeCun’s reputation for pushing the boundaries of AI research and development.
    • AMI Labs has assembled a notable leadership team, including LeBrun as CEO, LeFunde as CFO, and LeTune as head of post-training. The strategic hires suggest a focus on robust financial management and advanced AI model optimization techniques, particularly in post-training processes, which are crucial for enhancing model performance and efficiency.
    • The startup is considering hiring LeMune as Head of Growth and LePrune to lead inference efficiency, indicating a strategic emphasis on scaling operations and optimizing AI inference processes. This reflects a broader industry trend towards improving the efficiency of AI models, which is critical for deploying large-scale AI systems in real-world applications.
  • How I topped the Open LLM Leaderboard using 2x 4090 GPUs - Research notes in Blog form (Activity: 234): The post describes a novel approach to improving the performance of large language models (LLMs) by duplicating a specific block of 7 middle layers in the Qwen2-72B model, without altering any weights. This method led to significant performance gains across all Open LLM Leaderboard benchmarks, maintaining top positions since 2026. The author suggests that pre-training creates discrete functional circuits within the model’s layer stack, which only function effectively when preserved as a whole. This work was conducted using 2x RTX 4090 GPUs, demonstrating that substantial advancements can be achieved without extensive computational resources. The author is now experimenting with current models like GLM-4.7 and Qwen3.5 on a dual GH200 rig, with plans to release code and new models soon. More details are available in the full technical write-up. Commenters discuss the potential of looping layer circuits instead of duplicating them, suggesting that training models to recognize when to stop looping could yield meaningful results. There is curiosity about whether these circuits are discrete or overlapping, and whether attention patterns or activation statistics were analyzed before and after duplication. The approach aligns with some mechanistic interpretability work, and the idea of achieving significant results on a pretrained model is seen as impressive and refreshing.

    • The blog post discusses an unconventional approach to Transformer architecture, where layers are rearranged, such as feeding layer 60’s output into layer 10, which the model never encountered during training. This suggests that Transformer layers are more interchangeable than previously thought, with internal representations being homogenous enough to handle out-of-order hidden states. This flexibility implies that the architecture can function without strict layer order, challenging traditional views on model training and architecture design.
    • A commenter suggests the potential for looping layer circuits instead of duplicating them, proposing training the model to determine when to stop looping. This approach could involve training for loop/continue/halt commands to achieve meaningful early exits. The idea is to leverage existing circuits for efficiency, potentially offering a ‘free upgrade’ to models by enhancing their reasoning capabilities without extensive retraining. This concept aligns with some existing models that incorporate similar mechanisms from the start.
    • Another commenter highlights the significance of the observation that useful circuits exist in small layer blocks, which aligns with mechanical interpretability research. They express surprise at the effectiveness of duplicating blocks without altering weights and inquire about attention patterns or activation statistics before and after duplication. This curiosity extends to whether these duplicated blocks behave consistently across different LLM architectures like Qwen or GLM, suggesting a potential area for further research.
  • Benchmarking Model Performance: Launch Day vs. Current API Generations (Activity: 227): The image compares outputs from the Gemini 3.1 Pro model on two different dates, highlighting a perceived degradation in quality over time. The left image, from February 19, 2026, shows a more detailed Ferrari, while the right image, from May 10, 2026, appears simpler. This suggests potential issues with model updates or API changes affecting output quality. However, the comparison’s validity is questioned due to the stochastic nature of LLM inference, which requires multiple runs to draw reliable conclusions. Commenters highlight the stochastic nature of LLMs, suggesting that a single comparison is insufficient to assess model performance changes. There is also skepticism about the date mentioned, implying a possible error or misunderstanding.

    • DifficultSelection highlights the importance of multiple runs in benchmarking LLMs, noting that inference is stochastic and requires around 30 runs per date to draw meaningful conclusions. This underscores the probabilistic nature of LLM outputs, which can vary significantly between runs.
    • Cet-Id emphasizes a common misunderstanding about LLMs, pointing out that many users fail to grasp their probabilistic nature. This suggests that variability in outputs is inherent and should be accounted for in performance evaluations.
    • sankalp_pateriya references an image showing a date discrepancy, suggesting a potential error or manipulation in the benchmarking post. This raises questions about the validity of the data presented, as the image indicates a future date, 10th May 2026, which is inconsistent with current timelines.

2. AI in Creative and Video Production

  • Been quietly building a faceless YouTube channel using Claude and I’m embarrassingly close to monetisation (Activity: 2938): The Reddit post describes a workflow for creating a faceless YouTube channel using AI tools, specifically Claude for scripting, ElevenLabs for voiceover, Magic Hour for video generation, and CapCut for editing. The user reports nearing monetization on YouTube, highlighting the use of AI to generate content that sounds human-like. The process is described as unsophisticated but effective for the user’s needs, with no claims of significant financial success yet. Comments express strong disapproval of AI-generated content on YouTube, labeling it as ‘dead Internet content’ and ‘AI slop.’ There is skepticism about the monetization potential of such content, citing YouTube’s history of banning similar channels.

  • I’ve made $70k from AI Videos since August 2025 AMA (Activity: 224): The Reddit post discusses a video producer’s pivot to AI video production in August 2025, resulting in $70k earnings. The author emphasizes the importance of joining AI video communities like Skool for networking and learning, and highlights the demand for AI skills in production houses. They share strategies such as creating high-quality videos to showcase to decision-makers and exploring user-generated content (UGC) markets. The post also mentions using simple prompts for AI models like Nano Banana and Kling to generate diverse content efficiently. Commenters are interested in practical details such as the author’s portfolio and client acquisition strategies, and inquire about comprehensive platforms for accessing AI models, suggesting a focus on practical implementation and resource optimization.

    • Ant12-3 inquires about the best value all-in-one platform for accessing AI models, mentioning Higgsfield as an example. This suggests a discussion on the efficiency and cost-effectiveness of using consolidated platforms versus maintaining separate accounts for different AI tools. The response could delve into the trade-offs between these approaches, such as ease of use versus flexibility and access to cutting-edge models.
    • advertisingdave mentions using Higgsfield and inquires about the use of Flow, indicating a comparison between these AI tools. This could lead to a technical discussion on the features, performance, and specific use cases of Higgsfield versus Flow, highlighting their strengths and weaknesses in AI video production workflows.
    • TheFreakmode, a TV editor, asks about the type of work being done for production companies, specifically whether it involves creating stories or shots. This opens up a conversation about the role of AI in content creation, detailing how AI tools are integrated into traditional production processes, and what production companies typically seek from AI-generated content.

3. Claude and AI Tools in Everyday Use

  • Stop paying $1,000+ for “AI Bootcamps”. Anthropic (makers of Claude) just dropped a 100% free academy. (Activity: 1679): Anthropic has launched a free online academy offering courses on AI, specifically focusing on their AI model, Claude. The courses cover practical applications such as integrating Claude with platforms like Amazon Bedrock and Google Cloud’s Vertex AI, and are tailored for diverse audiences including educators and nonprofit professionals. This initiative aims to provide accessible education on AI fluency and ethical collaboration, countering the trend of expensive AI bootcamps. Some commenters note that the academy has been available since mid-2025, suggesting that the announcement is not new. There is also skepticism about the value of expensive AI bootcamps, questioning who would pay $1000 for such programs.

  • Claude helped me get a traffic light reprogrammed in my town (Activity: 3301): The image depicts an email exchange where a user named Lenny successfully used Claude, an AI language model, to translate a layman’s request into technical language suitable for a signal engineer. This resulted in a modification to the traffic signal programming at a specific intersection in Essex, improving traffic flow by allowing an additional 2-3 vehicles to pass each cycle. The exchange highlights the practical application of AI in facilitating communication between the public and technical experts, leading to tangible improvements in infrastructure. Commenters expressed surprise and appreciation for the quick response time from the engineering team, noting the effectiveness of AI in real-world applications.

  • ChatGPT vs Gemini vs Claude vs Perplexity: I gave them $1k each to trade stocks. After 9 weeks, ChatGPT went from frozen in cash to +21% (one stock doubled) (Activity: 1345): In a 9-week experiment, four AI models—ChatGPT, Gemini, Claude, and Perplexity—were each given $1,000 to trade stocks autonomously using Alpaca APIs. ChatGPT led with a +21.1% return, primarily due to a strategic all-in on healthcare stocks, notably IOVA which doubled, and ACHC which rose 52%. Perplexity maintained a +1.1% return by holding mostly cash, while Gemini and Claude underperformed with -6.6% and -11.5% respectively, due to high-risk trades and frequent stop-outs. The S&P 500 fell -1.5% in the same period, highlighting ChatGPT’s significant outperformance. The experiment is documented on GitHub and further details are available on Substack. Commenters suggest the results might be accidental and propose using multiple instances of each model to validate findings, though this would require significant financial resources. Another suggestion was to include a random control, like throwing darts, to benchmark AI performance against chance.

    • TripIndividual9928 highlights the significance of understanding each model’s risk tolerance and trading behavior. ChatGPT’s strategy of holding cash and then investing heavily aligns with a known quant finance strategy called conviction-based position sizing, which involves waiting for high-conviction opportunities. In contrast, Claude’s frequent trading reflects a common retail trading mistake of over-trading, which often leads to poorer returns, as supported by behavioral finance research.
    • vegt121 suggests a more robust experimental design by using multiple instances of each model to trade stocks, which would provide a more statistically significant analysis of performance. This approach would require substantial financial resources, approximately $400k, to implement across 100 accounts for each model, but it could yield more reliable insights into the models’ trading capabilities.
    • TripIndividual9928 also proposes an enhancement to the experiment by having each model generate a reasoning memo before making trade decisions. This would allow for an analysis of the quality of pre-trade reasoning compared to actual returns, potentially revealing whether models that provide better analytical insights also achieve better trading outcomes.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.