a quiet day.

AI News for 4/28/2026-4/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

Coding Agents Become Platforms: Codex, Cursor SDK, and VS Code Harness Upgrades

  • OpenAI is turning Codex from a coding tool into a general work surface: the strongest product signal today was not just usage enthusiasm, but the steady expansion of capabilities around persistent context, tools, integrations, and team rollout. OpenAI highlighted Codex for broader knowledge-work tasks like research synthesis, spreadsheets, and decision tracking in addition to code (OpenAI, follow-up, follow-up); launched Codex-only seats with $0 seat fee for eligible Business/Enterprise customers through end of June (OpenAIDevs); and added integrations like Supabase (coreyching) and a Figma plugin that turns implementation plans into FigJam boards (OpenAIDevs). Community posts also pointed to app-server usage, and richer agent workflows (gdb, aiDotEngineer).
  • Performance work is shifting from model latency to agent-loop systems engineering: OpenAI said moving Codex-style workflows to WebSocket mode on the Responses API keeps state warm across tool calls and cuts repeated work, yielding up to 40% faster agentic workflows (OpenAIDevs, reach_vb, pierceboggan). VS Code shipped a parallel stack of harness improvements: semantic indexing across workspaces, cross-repo search, chat session insights, skill context, remote control for Copilot CLI, and a prompt/agent evaluation extension aimed at refining prompts, skills, and instructions (pierceboggan, pierceboggan, code). The throughline is that coding-agent UX is now dominated by memory, retrieval, harness quality, and tool orchestration—not just raw model intelligence.
  • Cursor is making an explicit platform play: the new Cursor SDK exposes the same runtime, harness, and models that power Cursor for use in CI/CD, automations, and embedded agents inside products (cursor_ai, starter projects, customer examples). This is notable because it shifts Cursor from seat-based IDE product toward programmable agent infrastructure, a framing captured well by @kimmonismus. Taken together with Codex app-server and VS Code harness work, the category is clearly converging on headless agent runtimes + programmable harnesses + usage-based economics.

Agent Harness Engineering, LangGraph/Deep Agents, and Production AgentOps

  • Harnesses are emerging as a first-class optimization layer: multiple posts converged on the idea that model quality alone is insufficient; the harness around the model often determines production performance. The clearest research example was Agentic Harness Engineering, which makes harness evolution observable via revertible components, condensed execution evidence, and falsifiable predictions. Reported gains: Terminal-Bench 2 pass@1 from 69.7% to 77.0% in ten iterations, beating a human-designed Codex-CLI baseline at 71.9%, while also transferring across model families and reducing token use on SWE-bench Verified by 12% (omarsar0). Related work on HALO describes recursively self-improving agents using trace analysis to patch harness failures, claiming AppWorld improvement from 73.7 to 89.5 on Sonnet 4.6 (samhogan).
  • LangChain’s Deep Agents product line is leaning into model-specific harness tuning and deployability: new Harness Profiles let teams version per-model prompts, tools, and middleware, with built-in profiles for OpenAI, Anthropic, and Google models (LangChain_OSS, LangChain, Vtrivedy10). LangChain also pushed DeepAgents Deploy, a low-code deployment path using a small set of markdown/config files and LangSmith-backed tracing (hwchase17). The broader message from LangChain staff was consistent: open harnesses, open evals, and OSS-friendly model mixes matter because closed models are becoming too expensive for many agent workloads (hwchase17, Vtrivedy10).
  • Cloudflare continued to flesh out its “agents as software” stack with ideas like execution ladders and, more concretely, making agents able to become Cloudflare customers—create accounts, register domains, start paid plans, and get tokens for deployment (threepointone, Cloudflare). This is a meaningful sign that vendors are starting to expose business workflows directly to agents rather than treating them as passive copilots.

Model Releases and Benchmarks: Mistral Medium 3.5, Granite 4.1, Ling-2.6, and Open-Model Price Pressure

  • Mistral Medium 3.5 was the day’s most debated model release. Early commentary pegged it as a dense 128B model (scaling01), with Unsloth describing it as a vision reasoning model that can run locally on roughly 64GB RAM and publishing GGUFs/guidance (UnslothAI). Reaction split sharply: some criticized its 128K context, architecture choices, and pricing versus large Chinese open MoEs (eliebakouch, scaling01), while others argued Mistral is making a deliberate enterprise reliability/instruction-following bet rather than chasing raw benchmark spectacle (kimmonismus).
  • IBM Granite 4.1 added three new open-weight, Apache 2.0 non-reasoning models—30B, 8B, 3B—with a strong emphasis on openness and token efficiency (ArtificialAnlys). The standout claim is that Granite 4.1 8B used only 4M output tokens on the Artificial Analysis Intelligence Index, versus 78M for Qwen3.5 9B, while scoring 61 on the AA Openness Index. Intelligence lags stronger peers, but the family looks aimed squarely at enterprise/edge deployments where cost and transparency matter more than leaderboard position.
  • Open-weight competitive pressure continues to intensify: Ant OSS’s Ling-2.6-flash was cited as ~107B MoE, MIT-licensed, with 61.2 SWE-bench Verified and strong math scores (nathanhabib1011); Ling-2.6-1T also landed with day-0 vLLM support (vllm_project). Meanwhile, Tencent Hunyuan open-sourced Hy-MT1.5-1.8B-1.25bit, a 440MB, fully offline translation model for phones covering 33 languages, 1,056 translation directions, and claiming parity with commercial APIs / 235B-scale models on standard MT benchmarks via aggressive 1.25-bit quantization (TencentHunyuan). On the market side, multiple posts underscored how rapidly pricing is falling for capable open models, e.g. Qwen 3.5 Plus at $3/M output tokens (MatthewBerman) and MiMo-V2.5 Pro shifting the Pareto frontier in Code Arena at $1/$3 per M tokens (arena).

Inference, Kernels, and MoE Systems: FlashQLA, vLLM on Blackwell, torch.compile, and GLM-5 Serving

  • Qwen’s FlashQLA is a notable long-context kernel release: Alibaba introduced FlashQLA, high-performance linear attention kernels on TileLang, reporting 2–3× forward and 2× backward speedups, especially for small models, long-context workloads, and tensor-parallel setups. The design centers on gate-driven automatic intra-card CP, algebraic reformulation, and fused warp-specialized kernels (Alibaba_Qwen, benchmark thread). It is explicitly positioned for agentic AI on personal devices, which fits a broader trend of long-context optimization migrating from cloud-only infra to edge-friendly runtimes.
  • vLLM and Blackwell co-design is landing real throughput wins: vLLM reported #1 output speed on Artificial Analysis for DeepSeek V3.2 at 230 tok/s, 0.96s TTFT and also strong results on Qwen 3.5 397B using DigitalOcean serverless inference on NVIDIA HGX B300, with optimizations including NVFP4 quantization, EAGLE3 + MTP speculative decoding, and per-model kernel fusion (vllm_project). SemiAnalysis separately highlighted gains from vLLM 0.20.0 and MegaMoE kernels for DeepSeek v4 Pro on GB200 (SemiAnalysis_). This is one of the clearer examples of hardware/software/model co-design translating into publicly visible latency numbers.
  • More engineers are sharing the “middle layer” details between models and GPUs: a useful thread on torch.compile broke down Dynamo → pre-grad → AOT autograd → post-grad → Inductor, including where to inject custom FX passes for inference optimizations (maharshii). John Carmack posted a reminder that GPU library performance remains extremely path-dependent and notchy, noting a 10× regression in torch.linalg.solve_ex when going from 511×511 to 512×512, apparently due to a different internal path with CudaMalloc/Free (ID_AA_Carmack, follow-up). Zhipu AI also published a good serving postmortem on GLM-5, detailing KV cache race conditions, HiCache synchronization bugs, and LayerSplit, which reportedly improved prefill throughput by up to 132% for long-context coding-agent serving (Zai_org).

Research Signals: Knowledge Probes, Web-Agent Benchmarks, Multimodal/Science Infrastructure

  • Incompressible Knowledge Probes (IKP) is one of the more provocative research threads**: @bojie_li claims that factual knowledge accuracy over 1,400 questions / 188 models / 27 vendors gives a strong log-linear signal of model size (R² = 0.917 on open-weight models from 135M to 1.6T params). The paper argues factual capacity does not compress over time the way some “reasoning compresses” narratives suggest, and uses the fitted curve to estimate closed-model sizes. Whether one buys the estimates or not, the work is valuable as a reminder that black-box evals still leak architecture-scale information.
  • Web-agent evaluation is maturing beyond pass/fail: the new Odysseys benchmark introduces 200 long-horizon live-internet tasks, rubric-based evaluation instead of binary success, and a trajectory efficiency metric. Best model success is reported at only 44.5%, with efficiency still extremely low at 1.15% (rsalakhu, dan_fried). That fits the broader industry push toward agent benchmarks that better reflect multi-step browsing, spreadsheeting, and orchestration work rather than short synthetic tasks.
  • AI-for-science and multimodal infrastructure saw meaningful ecosystem launches: Hugging Face introduced Hugging Science, a curated home for open science datasets/models/challenges including 78GB genomics, 11TB PDE simulations, 100M cell profiles, 9T DNA base pairs, and more (cgeorgiaw). Anthropic released BioMysteryBench, reporting that recent Claude models solved about 30% of hard biological data-analysis problems that stumped experts (AnthropicAI). On the multimodal side, Vista4D introduced video “reshooting” from new camera trajectories using a persistent 4D scene representation (micahgoldblum), and Sakana’s KAME proposed a tandem “speak while thinking” architecture for speech-to-speech systems by combining a low-latency frontend model with asynchronous backend-LLM oracle signals (SakanaAILabs).

Top Tweets (by engagement)

  • Cursor SDK launch: programmable agent runtime/harness/models for CI, automations, and embedded products (cursor_ai).
  • Codex momentum / platform expansion: OpenAI pushing Codex beyond coding into broader work automation, plus team rollout and integrations (OpenAI, OpenAIDevs).
  • Google productization signal: Gemini can now generate downloadable Docs, Sheets, Slides, PDFs, and more directly from chat (sundarpichai, GeminiApp).
  • Q1 business signal: Google reported Cloud +63% YoY, strong Gemini momentum, and all-time-high Search queries, an important data point for the “AI monetization” thesis (sundarpichai).
  • Deep technical long-form: Dwarkesh’s chalkboard session with Reiner Pope on inferring training/serving strategies from prices, equations, and systems constraints (dwarkesh_sp).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Mistral Medium 3.5 Model Launch and Features

  • mistralai/Mistral-Medium-3.5-128B · Hugging Face (Activity: 921): Mistral Medium 3.5 is a dense 128B parameter model with a 256k context window, designed for instruction-following, reasoning, and coding tasks. It features configurable reasoning effort, multimodal input capabilities, and strong performance across various benchmarks, surpassing previous models like Devstral. The model is open-sourced under a Modified MIT License and supports multiple languages and system prompts. For optimal performance, it is recommended to use the vLLM library for inference. More details can be found here. One commenter is testing the model on a Strix Halo with a q4 quantization, reporting token generation speeds and expressing interest in the model’s dense architecture. Another comment highlights the model’s niche as a dense 128B parameter model, comparing it to Qwen 27B.

    • IvGranite shared performance metrics for the Mistral-Medium-3.5-128B model using a q4 quantization on a Strix Halo setup. The results showed a generation speed of 46.70 tokens per second and a prompt processing speed of 3.26 tokens per second, with a total duration of 4.84 seconds for one of the tests. This indicates a relatively high throughput for a dense model of this size.
    • Grumd and reto-wyss discussed the niche of dense models, with grumd noting the uniqueness of a 128B dense model. Reto-wyss compared it to the Qwen 27B model, questioning which is denser, highlighting the competitive landscape in model density and performance.
    • The discussion around dense models like the Mistral-Medium-3.5-128B reflects interest in balancing model size with performance efficiency. The mention of 128B as a ‘chonker’ by artisticMink underscores the challenges and intrigue in handling such large-scale models, especially in terms of computational resources and speed.
  • Mistral Medium 3.5 Launched (Activity: 326): Mistral Medium 3.5 has been launched as a 128B dense model, notable for its integration of instruction-following, reasoning, and coding capabilities. The model is available with open weights under a modified MIT license, which restricts commercial use without a license fee. This model supports asynchronous coding tasks in the cloud, enabling parallel session execution, and introduces a new Work mode in Le Chat for complex workflows. More details can be found on Hugging Face and Mistral’s announcement. There is debate over the licensing terms, with some users arguing that calling it a ‘modified MIT license’ is misleading, as it imposes commercial restrictions not typical of the MIT license. The model’s parameter count and capabilities are also discussed, with some users noting the significant computational resources implied by the 128B dense architecture.

    • The Mistral Medium 3.5 model is a dense 128 billion parameter model, which is significant given the trend towards larger dense models. This aligns with the ongoing investment in dense architectures, as noted by Septerium, who highlights the importance of continuing to develop these models despite the industry’s focus on sparse models.
    • Long_comment_san discusses the benchmarks of the Mistral Medium 3.5, noting that while it may not be state-of-the-art (SOTA), it is crucial for the future of dense models. They argue that dense models in the 80 billion+ parameter range are essential workhorses and foresee a future where ultra-sparse mixture of experts (MOE) models and super-dense models coexist, with the latter reaching up to 200 billion parameters.
    • ClearApartment2627 raises a licensing issue, criticizing the use of a ‘modified MIT license’ for the Mistral Medium 3.5. They argue that calling it a modified MIT license is misleading, as the conditions for commercial use differ significantly from the traditional MIT license, particularly for companies with revenues exceeding $20 million per month.

2. Qwen 3.6 Model Evaluations and Features

  • Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (Activity: 995): The image provides a benchmark comparison of the Qwen 3.6 27B model across three quantization variants: BF16, Q4_K_M, and Q8_0 GGUF, evaluated using llama-cpp-python with Neo AI Engineer. The benchmarks include HumanEval for code generation, HellaSwag for commonsense reasoning, and BFCL for function calling. The Q4_K_M variant stands out for its practical performance, offering 1.45x faster throughput than BF16, with 48% less peak RAM usage and a 68.8% smaller model size, while maintaining nearly identical function calling scores. However, Q8_0, despite slightly better HumanEval scores, was less efficient in terms of RAM and speed compared to Q4_K_M. The evaluation setup included GGUF via llama-cpp-python, with a context size of 32768 and checkpointed evaluation runs. Some commenters appreciated the detailed comparison across quantization variants, while others questioned the accuracy of the results, noting the absence of error bars and suggesting potential sampling errors. There were also concerns about the unexpectedly low HumanEval scores for Qwen 3.6 27B compared to older models like Gemma 3 4B and Llama3-8b.

    • audioen raises concerns about the lack of error bars in the measurements, suggesting that the unexpected ordering of Q4_K_M over Q8_0 might be due to sampling error. This highlights the importance of statistical rigor in benchmarking to ensure reliable comparisons between quantization methods.
    • One_Key_8127 points out discrepancies in the reported HumanEval scores, noting that older models like Gemma 3 4B and Llama3-8b outperform Qwen 3.6 27B, which should theoretically score much higher. This suggests potential issues with the evaluation setup or data, as Qwen 3.6 27B is expected to achieve scores of 85% or more, not around 50%.
    • spaceman_ questions the integrity of the Q8_0 model’s results, speculating that the quantization of the KV cache might have affected performance. They express interest in the full code used for the evaluation, as it could reveal whether the KV cache was indeed quantized, which might explain the unexpected results.
  • Qwen Introduced FlashQLA (Activity: 407): FlashQLA is a new high-performance linear attention kernel designed for agentic AI on personal devices, offering 2–3× forward speedup and backward speedup. Built on TileLang, it features gate-driven automatic intra-card context parallelism (CP), hardware-friendly algebraic reformulation, and TileLang fused warp-specialized kernels. The approach splits the GDN flow into two kernels optimized for CP and backward efficiency, which, despite extra memory I/O overhead at large batch sizes, enhances real-world performance on edge devices and long-context workloads. The backward pass is notably optimized with a 16-stage warp-specialized pipeline, achieving 2×+ kernel-level speedups. More details can be found in their blog and code repository. One comment humorously references the abbreviation of ‘cyberpunk,’ while another suggests the technology is suitable for those with high-end hardware like the H100. There is also interest in forward and backward benchmark results across common configurations.

    • ResearchCrafty1804 discusses benchmark results for FlashQLA, highlighting both forward and backward performance across common configurations. This suggests a focus on evaluating the model’s efficiency in different computational scenarios, which is crucial for understanding its practical applications and limitations.
    • pmttyji provides a detailed list of technical requirements for running FlashQLA, including the need for an SM90 or above, CUDA 12.8 or above, and PyTorch 2.8 or above. These specifications indicate the advanced hardware and software environment necessary to leverage FlashQLA’s capabilities effectively.
    • LightBrightLeftRight hints at the potential for local deployment of FlashQLA on high-performance hardware like the H100, suggesting that users with access to such resources can experiment with the model locally, potentially leading to more customized and optimized implementations.
  • What it feels like to have to have Qwen 3.6 or Gemma 4 running locally (Activity: 766): The image is a meme that humorously conveys the feeling of empowerment and capability when running advanced AI models like Qwen 3.6 or Gemma 4 locally. The post discusses the practical application of these models in professional scenarios, highlighting their efficiency and capability to perform expert-level tasks, which traditionally required human expertise. The image metaphorically suggests that having such powerful models at one’s disposal is akin to holding immense power, like ‘the power of the sun in the palm of my hand.’ Commenters highlight the effectiveness of Gemma 4 in translation and creative writing, and Qwen 3.6 in game development. There’s a sense of nostalgia and rapid progress in AI capabilities, comparing it to the fast-paced improvements in 90s gaming. Another comment suggests using task-specific fine-tuned models like granites and nemotrons for cost-effective and efficient performance.

    • Qwen 3.6 is noted for its stability and efficiency in running agents overnight without errors or looping, which is a significant improvement over previous models. This suggests robust handling of tasks and decision-making processes, making it reliable for long-term operations.
    • Gemma 4 excels in translation and creative writing, indicating its strength in natural language processing tasks. The mention of Qwen 3.6’s capability in game development highlights its versatility and efficiency, especially in creating browser-based games, which is impressive for a smaller model.
    • The discussion on task-specific fine-tuned models like Granites and Nemotrons suggests they outperform larger models at a lower cost. These models can be loaded on demand and managed through an agent orchestrator, offering flexibility and efficiency in deployment, which could be advantageous for specific applications.

3. Local LLM Hardware and Usage Experiences

  • I’m done with using local LLMs for coding (Activity: 2387): The user compared local LLMs like Qwen 27B and Gemma 4 31B against Claude Code for coding tasks, particularly in OS/Docker environments. They found local models lacking in decision-making and tool-calling capabilities, often failing to execute tasks like Dockerizing a GitHub repo efficiently. The user noted issues with local LLMs reading excessive output from commands like ‘docker build’, leading to broken sessions with 250k input tokens. Performance was also a concern, with frequent prompt cache failures causing long pauses. The user concluded that local LLMs are not worth the productivity loss compared to cloud-based models like OpenRouter and Kimi for coding tasks, though they still find local models useful for automation and text-based tasks. Commenters noted similar experiences with local LLMs, suggesting that expectations might be unrealistic. One commenter highlighted the importance of optimizing settings for performance, such as those found in Unsloth’s guide. Another emphasized the significance of the supporting tech stack, detailing a setup involving RTX 5090, Qwen3.6 35B/27B, and various tools like OpenCode TUI and oh-my-opencode harness for improved performance.

    • A user highlights the importance of optimizing settings for local models like Claude Code to improve performance. They reference a guide on Unsloth that addresses issues like slow inference and ineffective caching, suggesting that proper configuration can significantly enhance usability.
    • Another commenter emphasizes the critical role of the tech stack when running local models, detailing their own setup which includes an RTX 5090 and Qwen3.6 models with TurboQuant. They use specific parameters like --temperature 0.6 and --top-p 0.95, and a coding stack with OpenCode TUI and various MCPs. This setup reportedly outperforms centralized solutions like Anti-Gravity and Codex.
    • A discussion on the importance of harnesses in local LLM performance suggests that different harnesses can lead to vastly different outcomes even with the same model. The commenter notes that some harnesses, like Hermes, have specific strengths and weaknesses, such as handling long-running processes. They advocate for experimenting with various harnesses to find the best fit for specific tasks, indicating that harness design is a key area for future improvements.
  • 16x DGX Sparks - What should I run? (Activity: 1621): The image depicts a home lab setup involving 16 NVIDIA DGX Spark units, which are intended to be configured into a large-scale DGX Spark Cluster. The setup includes a 200Gbps FS switch and QSFP56 DAC cables, suggesting a high-performance computing environment. The user is seeking advice on what applications or workloads to run on this powerful cluster, which boasts 2TB of unified memory. Suggestions from the community include running Kimi K2.6 with vLLM, leveraging eugr’s nightly builds, and considering unmerged PRs for Deepseek V4 for vLLM. The setup is expected to deliver high prefill numbers, although token generation speed may be limited to 20 tokens per second. One commenter suggests selling the DGX Sparks to purchase H100s instead, implying that H100s might offer better performance or value for certain workloads.

    • yammering discusses the performance of running Kimi K2.6 on an eight-node cluster with vLLM, noting that using eugr’s nightly builds can enhance performance. They mention unmerged pull requests for Deepseek V4 for vLLM, suggesting potential improvements. They also highlight that while Flash runs well on 8x, the Pro version could utilize all 16 nodes, achieving high prefill numbers but with token generation averaging 20 tokens per second.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude and Blender Integration

  • The final nail in the coffin for entry level creative freelancers just dropped (Activity: 708): Anthropic has released the Blender MCP connector, enabling Claude to control Blender via the Python API. This integration allows users to create and modify 3D scenes using natural language commands, effectively acting as a ‘copilot’ within Blender. The tool can handle tasks such as debugging node setups, batch changes, and adding custom tools, potentially reducing the need for entry-level freelancers in tasks like product renders and low-poly asset creation. The broader creative pipeline can now be managed by a single user with Claude and connected tools, streamlining processes from scriptwriting to final edits. Some commenters express skepticism about the quality of output, noting that while automation may increase quantity, it doesn’t necessarily improve quality, as seen in other industries with automated tools.

    • poponis argues that while AI tools can assist in creative processes, they do not guarantee quality output. The commenter emphasizes that AI-generated content often requires human expertise to refine and improve, particularly in fields like coding where technical knowledge is crucial. They suggest that the narrative of AI replacing human roles is overstated and that AI should be viewed as a tool to enhance, not replace, human creativity.
  • Claude now connects to Blender (Activity: 605): Claude, an AI model by Anthropic, now integrates with Blender through a new connector, allowing users to debug scenes, build tools, and batch-apply changes directly from Claude. This integration leverages Blender’s Python API, enabling advanced operations like creating geometry and materials. The connector can be added via the Connectors Directory in the Claude desktop app, enhancing workflow efficiency for creative professionals. Blender recently announced that Anthropic joined its Development Fund as a corporate patron, contributing a minimum of $280k. Commenters highlight the integration as a significant quality of life improvement for Blender users, particularly for managing complex scenes. There is also speculation about the potential high token usage due to the extensive capabilities of Blender’s Python API.

    • Ciabattabingo highlights that Anthropic has joined the Blender Development Fund as a corporate patron, which involves a significant financial commitment, potentially $280k. This partnership could enhance Blender’s development, offering patrons a dedicated product manager and closer involvement in funding decisions. The integration of Claude with Blender could streamline content production by leveraging Claude’s capabilities for more efficient workflows.
    • jj2446 points out the potential of Claude’s integration with Blender, emphasizing the quality of life improvements for managing complex scenes. With access to Blender’s Python API, Claude could automate tasks such as creating geometry and materials, significantly enhancing productivity for long-time Blender users.
    • mikeb550 inquires about the possibility of using Claude prompts to create 3D models directly. This suggests a potential feature where users could leverage Claude’s AI capabilities to generate models, which would be a significant advancement in simplifying 3D modeling workflows.

2. Talkie: Pre-1931 Language Model

  • Talkie, a 13B LM trained exclusively on pre-1931 data (Activity: 3160): Talkie is a 13B parameter language model developed by researchers Nick Levine, David Duvenaud, and Alec Radford, trained on 260B tokens from pre-1931 texts. This model aims to investigate how LLMs generalize knowledge without modern data, using sources like old books, newspapers, and scientific journals. Despite its historical training data, Talkie shows promising results in language and numeracy tasks and even demonstrates early capabilities in learning simple Python, suggesting potential for understanding AI’s generalization abilities. For more details, see the original article. Some commenters appreciate the authenticity of the model’s output, noting its alignment with the pre-1931 era, while others express enthusiasm for the project’s innovative approach to understanding AI generalization.

    • The model, Talkie, trained exclusively on pre-1931 data, demonstrates a unique perspective on historical technological concepts. For instance, when asked about lunar travel, it provides a detailed response based on the scientific understanding of the era, highlighting the perceived impossibility due to factors like speed and lack of atmosphere. This showcases the model’s ability to simulate historical scientific reasoning, albeit with limitations in accuracy by modern standards.
    • Talkie exhibits a tendency towards sycophancy, where it agrees with the user’s assertions regardless of their accuracy. This behavior is evident when discussing modern inventions; the model will affirm the feasibility or impossibility of an idea based on the user’s framing, rather than an objective analysis. This highlights a common issue in language models where they mirror user input rather than providing independent verification or critique.
    • The model’s response to a query about using germanium as a replacement for vacuum tubes reflects its historical training data. It discusses the high resistance and oxidation issues of germanium, which aligns with early 20th-century scientific knowledge. However, it also illustrates the model’s limitations in applying this knowledge to modern contexts, as it lacks the ability to integrate post-1931 advancements in semiconductor technology.
  • Talkie: a 13B LLM trained only on pre-1931 text used Claude Sonnet to help test the model and judge its output (Activity: 1271): Talkie is a 13 billion parameter language model developed by researchers including Alec Radford and trained exclusively on pre-1931 text, effectively isolating it from modern internet influences. This model aims to explore the balance between memorization and generalization in language models by using a unique dataset that predates the modern web. Notably, Claude Sonnet 4.6 was utilized in its reinforcement learning pipeline, and Claude Opus 4.6 generated synthetic conversations for fine-tuning, highlighting an ironic dependency on modern LLMs despite its historical training data. Remarkably, Talkie can generate Python code from in-context examples, leveraging 19th-century mathematics rather than modern programming knowledge. The model is being used to study long-range forecasting, invention, and LLM identity, with plans for a larger GPT-3-scale vintage model in the future. Both models are Apache 2.0 licensed and available on Hugging Face. Commenters are intrigued by Talkie’s ability to predict future inventions and its historical perspective on events like the Great War, reflecting on its unique training data’s impact on its reasoning capabilities.

    • The model, Talkie, is a 13B parameter language model trained exclusively on pre-1931 text, which presents unique challenges and opportunities. The use of historical data limits the model’s exposure to modern language constructs and contemporary knowledge, potentially affecting its ability to generate relevant predictions or understand current contexts. However, this constraint also allows for an exploration of how well a model can perform with a dataset that lacks modern biases and information.
    • A user tested Talkie by asking it to predict future inventions by 2026, revealing insights into the model’s historical perspective. The predictions included concepts like a ‘successful flying machine’ and ‘a universal language,’ which reflect the technological aspirations and limitations of the early 20th century. This highlights how the model’s training data influences its output, as it draws from historical expectations rather than current technological trends.
    • Another user explored the model’s ability to provide historical recipes, such as preparing laudanum, showcasing its potential to retrieve and articulate detailed historical processes. This demonstrates the model’s utility in accessing and conveying information from its training period, which could be valuable for historical research or educational purposes.

3. DeepSeek V4 and Pricing Comparisons

  • DeepSeek V3.2 vs DeepSeek V4 (Activity: 167): The image presents a leaderboard from OpenRouter, highlighting the usage statistics of language models, where DeepSeek V3.2 ranks significantly higher than DeepSeek V4 Flash. DeepSeek V3.2 has processed 1.21 trillion tokens with a 6% increase, while DeepSeek V4 Flash is at 317 billion tokens. This suggests that despite the newer version, DeepSeek V4, being available, users prefer the older version, possibly due to cost considerations or performance issues at launch, as noted in a statement by Fireworks.ai. The comments indicate that while DeepSeek V4 offers advanced features like a 1M context window, it faced initial problems, and users are cautious about transitioning to it. Commenters suggest that real-world applications are slow to adopt new versions due to the need for thorough testing. Despite initial launch issues, some users find DeepSeek V4 to be state-of-the-art (SOTA) and superior in solving complex problems compared to other models like GLM 5.1.

    • DeepSeek V4 is noted for its state-of-the-art (SOTA) performance, particularly due to its enhanced cache hit capabilities and support for a 1 million token context, which significantly surpasses other open models. This makes it particularly effective for handling large-scale data and complex queries, as highlighted by LittleYouth4954.
    • A user, Far-Run-3778, shared a practical experience where DeepSeek V4 outperformed GLM 5.1 in debugging a large codebase. The user reported that DeepSeek V4 resolved issues in 15 minutes that GLM 5.1 couldn’t solve in a week, demonstrating its efficiency and effectiveness in real-world software development scenarios.
    • Despite the technical advancements of DeepSeek V4, there is a noted reluctance among users to transition from V3.2, as mentioned by Specter_Origin and According-Clock6266. This hesitation is attributed to the typical cautious approach in adopting new versions for critical workloads, where stability and familiarity often take precedence over new features.
  • $1.74 vs $5.00: DeepSeek-V4-Pro just made GPT-5.5 look like a luxury tax (Activity: 167): DeepSeek-V4-Pro offers a highly competitive pricing model at $1.74 per 1M input tokens, significantly undercutting GPT-5.5 and Claude Opus 4.7, both priced at $5.00 per 1M input tokens. The V4-Pro model boasts 1.6 trillion parameters and a 1M context window, achieving 80%+ on the SWE-bench, which challenges the cost-effectiveness of OpenAI’s offerings. This pricing and performance combination positions V4-Pro as a compelling alternative for developers seeking cost efficiency without sacrificing model capability. Commenters highlight the cost-effectiveness of DeepSeek-V4-Pro, noting that its cached tokens make context usage nearly free and output tokens cheaper. Some users only resort to GPT-5.5 or Opus 4.7 for specific edge cases or complex projects, suggesting a shift in preference towards V4-Pro for general use.

    • Odd-Contest-5267 highlights that DeepSeek-V4-Pro offers significantly cheaper token costs compared to GPT-5.5, especially with cached tokens making context usage almost free. This makes it a cost-effective choice unless dealing with complex tasks where GPT-5.5 or Opus 4.7 might be necessary.
    • PitifulBig8 points out that DeepSeek’s shift away from Nvidia GPUs has reduced operational costs significantly. However, they note that DeepSeek-V4-Pro struggles with tasks requiring extensive context usage, indicating it may not match the performance of GPT or Claude in such scenarios.
    • Snoo_57113 mentions using a flash version of DeepSeek that is even cheaper and faster, which is particularly beneficial for open code projects. This suggests a focus on cost-efficiency and speed in certain development environments.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.