Person: "htihle"

gpt-5.6 claude-fable-5 openai model-stratification agentic-coding presentation benchmarking orchestration computer-use gui-automation reward-hacking instruction-following usage-limits model-costs reach_vb rasbt yuchenj_uw scaling01 simonw kimmonismus thsottiaux htihle teortaxestex mononofu omarsar0 hangsiin gdb mckbrando evi77ain

OpenAI rolled out GPT-5.6 featuring a new model stratification with tiers Luna / Terra / Sol and effort levels including Max and Ultra, introducing complex configuration options. The launch faced UX challenges with the ChatGPT Work / Codex split, prompting rapid corrective actions including usage-limit resets and UI improvements. Early benchmarks show GPT-5.6 excels in agentic coding, presentation, and science tasks, tying with Claude Fable 5 in Code Arena Frontend at about half the cost, and achieving a significant 500-point Elo gain in presentations. However, users noted instruction-following issues and concerns about jailbreakability. The major advancement is in orchestration and computer use, with Sol Ultra demonstrating strong planner and verifier capabilities, enabling high-throughput automation workflows. A notable operational challenge is the hidden cost explosion from spawned subagents inheriting premium settings, causing faster quota depletion.

Apr 27

not much happened today

gpt-5.5 gpt-5.4 opus-4.7 mimo-v2.5-pro mimo-v2.5 kimi-k2.6 codex copilot openai microsoft google amazon github xiaomi openai-devs vllm_project kimi-moonshot model-distribution cloud-computing benchmarking usage-based-billing model-orchestration open-source large-context-models agent-scaling coding model-training fp8 attention-mechanisms multi-agent-systems sama scaling01 kimmonismus ajassy simonw htihle arena gdb hangsiin eliebakouch _luofuli teortaxestex

OpenAI loosens its Azure exclusivity, allowing distribution across Google TPU, AWS Trainium, and Bedrock with commitments through 2032 and revenue share through 2030. GPT-5.5 shows improved benchmarks but is not uniformly dominant, ranking variably across coding, document, math, and vision tasks. GitHub's Copilot shifts to usage-based billing starting June 1, reflecting increased runtime costs. OpenAI open-sourced Symphony, an orchestration layer for issue tracking and Codex agents. Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro, large context models with up to 1M-token context and trillions of tokens trained, emphasizing complex agent and omni-modal capabilities. Kimi K2.6 leads OpenRouter's leaderboard, noted for coding and long-horizon agent capabilities with large-scale sub-agent coordination.

Mar 12

not much happened today

gpt-5.4 openai anthropic uber nous-research cursor_ai redisinc artificialanlys langchain-js agent-infrastructure mcp-protocol harnesses coding-agents evaluation-methodologies agent-ui-ux runtime-environments multi-axis-evaluation automation workflow-optimization open-agent-platforms provider-integration filesystem-checkpoints mattturck hwchase17 omarsar0 gergelyorosz htihle theprimeagen sydneyrunkle corbtt

Harnesses, agent infrastructure, and the MCP protocol are central themes, with emphasis on how harnesses, sandboxes, filesystem access, skills, memory, and observability shape agent UI/UX and runtime environments. Despite jokes about MCP's demise, it remains vital in production, notably used internally by Uber and supported by Anthropic. The coding-agent stack is evolving with CursorBench combining offline and online metrics to evaluate models on intelligence and efficiency, where GPT-5.4 leads in correctness and token efficiency. Agent-assisted development is splitting between automation-heavy workflows and "stay-in-the-loop" tooling, with OpenAI advancing Codex Automations featuring worktree vs. branch choices and UI customization. The open agent platform Hermes Agent v0.2.0 introduces full MCP client support, ACP server for editors, and expanded provider integrations including OpenAI OAuth.

Mar 02

not much happened today

qwen-3.5-0.8b qwen-3.5-2b qwen-3.5-4b qwen-3.5-9b codex-5.3 claude-3 alibaba ollama lm-studio openai anthropic multimodality reinforcement-learning long-context hybrid-attention on-device-ai model-deployment agent-reliability agent-observability coding-agents benchmarking runtime-optimization token-efficiency nrehiew_ kimmonismus lioronai danielhanchen theo htihle teortaxestex theprimeagen yuchenj_uw _lewtun saen_dev _philschmid omarsar0

Alibaba released the Qwen 3.5 series with models ranging from 0.8B to 9B parameters, featuring native multimodality, scaled reinforcement learning, and targeting edge and lightweight agent deployments. The models support very long context windows up to 262K tokens (extendable to 1M) and use a novel Gated DeltaNet hybrid attention architecture combining linear and full attention layers. Deployment examples include Ollama and LM Studio, with a notable 6-bit on-device demo on iPhone 17 Pro. Evaluators are cautioned that reasoning is disabled by default on smaller models. In coding agents, Codex 5.3 shows promising benchmark results on WeirdML with 79.3% accuracy, though availability and downtime remain critical challenges, especially highlighted by Claude outages. Agent reliability and observability are emphasized as cross-functional problems requiring clear success criteria and practical evaluation strategies. Studies show that using AGENTS.md and SKILL.md guardrails can significantly reduce runtime and token usage by mitigating worst-case thrashing in coding workflows.

Feb 21

not much happened today

gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.