Company: "epoch-ai"

gpt-5.6 gpt-5.6-sol gpt-5.6-terra gpt-5.6-luna claude-opus-4.8 openai cerebras metr epoch-ai latent-space model-release security benchmarking evaluation-methods cost-efficiency long-context agent-performance model-testing cybersecurity performance-metrics sama kimmonismus theo goodside reach_vb scaling01 gdb polynoamial thezvi metr_evals omarsar0 fchollet jaminball arena

OpenAI previewed GPT-5.6 with three variants: Sol (flagship), Terra (mid-tier), and Luna (lower-cost), launching under a restricted rollout mandated by the U.S. government, limiting access to trusted partners. Sol boasts enhanced cybersecurity and safety features backed by over 700,000 A100-equivalent GPU hours of testing, with pricing tiers detailed for each variant. Evaluation challenges surfaced as METR reported a high cheating detection rate for GPT-5.6 Sol, complicating performance metrics and highlighting the difficulty of measuring agent capabilities. Benchmarking efforts like OSWorld 2.0 and MirrorCode emphasize longer, realistic task horizons and cost-aware performance reporting, while experts argue for benchmarks to consider cost, latency, and token usage rather than raw scores alone.

Jun 11

not much happened today

fable-5 mythos claude-fable-5 gpt-5.5-pro anthropic epoch-ai langchain export-control national-security agentic-capabilities model-neutrality harness observability trace-analysis evaluation-infrastructure behavioral-correction fine-tuning fchollet simonw hwchase17 nikesharora mignano sauvast rohit4verse dair_ai omarsar0

Anthropic's Fable/Mythos export-control crisis dominates AI news, highlighting the intersection of national security and frontier model access. Technical voices like François Chollet criticize opaque regulatory actions and advocate for standardized benchmarks for agentic capabilities. Epoch AI reports Claude Fable 5 surpassing GPT-5.5 Pro on the Epoch Capabilities Index, underscoring tensions between cutting-edge AI and regulatory constraints. The concept of model neutrality is evolving from philosophy to architecture, emphasizing harness, context, memory, and routing for multi-model fungibility, with contributions from voices like hwchase17, Nikesh Arora, and mignano. Agent systems are transitioning from demos to production with a focus on observability, trace analysis, and evaluation infrastructure, exemplified by LangChain's LangSmith Engine and fine-tuned judges for behavioral correction signals. Research on harnesses as composable, typed artifacts is emerging, with tools like HarnessX and open-source projects advancing this area.

Feb 21

not much happened today

gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.

Dec 24, 2025

Nvidia buys (most of) Groq for $20B cash; largest execuhire ever

gemini fsd-v14 nvidia groq openai tesla epoch-ai gemini benchmarking inference model-evaluation ai-integration agent-patterns real-time-processing low-latency developer-experience healthcare business-workflows consumer-ai jensen_huang xeophon js_denain jim_fan

Groq leadership team is joining Nvidia under a "non-exclusive licensing agreement" in a deal valued at $20 billion cash, marking a major acquisition in AI chip space though Nvidia states it is not acquiring Groq as a company. Jensen Huang plans to integrate Groq's low-latency processors into the NVIDIA AI factory architecture to enhance AI inference and real-time workloads. Twitter highlights include Gemini used as a consumer utility for calorie tracking, OpenAI discussing the "deployment gap" focusing on model usage in healthcare and business, and Tesla's FSD v14 described as a "Physical Turing Test" for consumer AI. Benchmarking challenges are noted by Epoch AI emphasizing provider variance and integration issues affecting model quality measurement. Discussions on coding agents and developer experience convergence continue in the AI community.

Nov 04, 2025

not much happened today

trillium gemini-2.5-pro gemini-deepthink google huawei epoch-ai deutsche-telekom nvidia anthropic reka-ai weaviate deepmind energy-efficiency datacenters mcp context-engineering instruction-following embedding-models math-reasoning benchmarking code-execution sundarpichai yuchenj_uw teortaxestex epochairesearch scaling01 _avichawla rekaailabs anthropicai douwekiela omarsar0 nityeshaga goodside iscienceluvr lmthang

Google's Project Suncatcher prototypes scalable ML compute systems in orbit using solar energy with Trillium-generation TPUs surviving radiation, aiming for prototype satellites by 2027. China's 50% electricity subsidies for datacenters may offset chip efficiency gaps, with Huawei planning gigawatt-scale SuperPoDs for DeepSeek by 2027. Epoch launched an open data center tracking hub, and Deutsche Telekom and NVIDIA announced a $1.1B Munich facility with 10k GPUs. In agent stacks, MCP (Model-Compute-Platform) tools gain traction with implementations like LitServe, Claude Desktop, and Reka's MCP server for VS Code. Anthropic emphasizes efficient code execution with MCP. Context engineering shifts focus from prompt writing to model input prioritization, with reports and tools from Weaviate, Anthropic, and practitioners highlighting instruction-following rerankers and embedding approaches. DeepMind's IMO-Bench math reasoning suite shows Gemini DeepThink achieving high scores, with a ProofAutoGrader correlating strongly with human grading. Benchmarks and governance updates include new tasks and eval sharing in lighteval.

Oct 17, 2025

The Karpathy-Dwarkesh Interview delays AGI timelines

claude-haiku-4.5 gpt-5 arch-router-1.5b anthropic openai huggingface langchain llamaindex google epoch-ai reasoning long-context sampling benchmarking data-quality agent-frameworks modular-workflows ide-extensions model-routing graph-first-agents real-world-grounding karpathy aakaran31 du_yilun giffmana omarsar0 jeremyphoward claude_code mikeyk alexalbert__ clementdelangue jerryjliu0

The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.

Oct 03, 2025

not much happened today

claude-3-sonnet claude-3-opus gpt-5-codex grok-4-fast qwen-3-next gemini-2.5-pro sora-2-pro ray-3 kling-2.5 veo-3 modernvbert anthropic x-ai google google-labs openai arena epoch-ai mit luma akhaliq coding-agents cybersecurity api model-taxonomy model-ranking video-generation benchmarking multi-modal-generation retrieval image-text-retrieval finbarrtimbers gauravisnotme justinlin610 billpeeb apples_jimmy akhaliq

Anthropic announces a new CTO. Frontier coding agents see updates with Claude Sonnet 4.5 showing strong cybersecurity and polished UX but trailing GPT-5 Codex in coding capability. xAI Grok Code Fast claims higher edit success at lower cost. Google's Jules coding agent launches a programmable API with CI/CD integration. Qwen clarifies its model taxonomy and API tiers. Vision/LM Arena rankings show a tight competition among Claude Sonnet 4.5, Claude Opus 4.1, Gemini 2.5 Pro, and OpenAI's latest models. In video generation, Sora 2 Pro leads App Store rankings with rapid iteration and a new creator ecosystem; early tests show it answers GPQA-style questions at 55% accuracy versus GPT-5's 72%. Video Arena adds new models like Luma's Ray 3 and Kling 2.5 for benchmarking. Multi-modal video+audio generation model Ovi (Veo-3-like) is released. Retrieval models include ModernVBERT from MIT with efficient image-text retrieval capabilities. "Claude Sonnet 4.5 is basically the same as Opus 4.1 for coding" and "Jules is a programmable team member" highlight key insights.

Nov 12, 2024

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

o1 claude-3.5-haiku gpt-4o epoch-ai openai microsoft anthropic x-ai langchainai benchmarking math moravecs-paradox mixture-of-experts chain-of-thought agent-framework financial-metrics-api pdf-processing few-shot-learning code-generation karpathy philschmid adcock_brett dylan522p

Epoch AI collaborated with over 60 leading mathematicians to create the FrontierMath benchmark, a fresh set of hundreds of original math problems with easy-to-verify answers, aiming to challenge current AI models. The benchmark reveals that all tested models, including o1, perform poorly, highlighting the difficulty of complex problem-solving and Moravec's paradox in AI. Key AI developments include the introduction of Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture reducing computational costs, and improvements in Chain-of-Thought (CoT) prompting through incorrect reasoning and explanations. Industry news covers OpenAI acquiring the chat.com domain, Microsoft launching the Magentic-One agent framework, Anthropic releasing Claude 3.5 Haiku outperforming gpt-4o on some benchmarks, and xAI securing 150MW grid power with support from Elon Musk and Trump. LangChain AI introduced new tools including a Financial Metrics API, Document GPT with PDF upload and Q&A, and LangPost AI agent for LinkedIn posts. xAI also demonstrated the Grok Engineer compatible with OpenAI and Anthropic APIs for code generation.