All tags
Person: "htihle"
not much happened today
gpt-5.5 gpt-5.4 opus-4.7 mimo-v2.5-pro mimo-v2.5 kimi-k2.6 codex copilot openai microsoft google amazon github xiaomi openai-devs vllm_project kimi-moonshot model-distribution cloud-computing benchmarking usage-based-billing model-orchestration open-source large-context-models agent-scaling coding model-training fp8 attention-mechanisms multi-agent-systems sama scaling01 kimmonismus ajassy simonw htihle arena gdb hangsiin eliebakouch _luofuli teortaxestex
OpenAI loosens its Azure exclusivity, allowing distribution across Google TPU, AWS Trainium, and Bedrock with commitments through 2032 and revenue share through 2030. GPT-5.5 shows improved benchmarks but is not uniformly dominant, ranking variably across coding, document, math, and vision tasks. GitHub's Copilot shifts to usage-based billing starting June 1, reflecting increased runtime costs. OpenAI open-sourced Symphony, an orchestration layer for issue tracking and Codex agents. Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro, large context models with up to 1M-token context and trillions of tokens trained, emphasizing complex agent and omni-modal capabilities. Kimi K2.6 leads OpenRouter's leaderboard, noted for coding and long-horizon agent capabilities with large-scale sub-agent coordination.
not much happened today
gpt-5.4 openai anthropic uber nous-research cursor_ai redisinc artificialanlys langchain-js agent-infrastructure mcp-protocol harnesses coding-agents evaluation-methodologies agent-ui-ux runtime-environments multi-axis-evaluation automation workflow-optimization open-agent-platforms provider-integration filesystem-checkpoints mattturck hwchase17 omarsar0 gergelyorosz htihle theprimeagen sydneyrunkle corbtt
Harnesses, agent infrastructure, and the MCP protocol are central themes, with emphasis on how harnesses, sandboxes, filesystem access, skills, memory, and observability shape agent UI/UX and runtime environments. Despite jokes about MCP's demise, it remains vital in production, notably used internally by Uber and supported by Anthropic. The coding-agent stack is evolving with CursorBench combining offline and online metrics to evaluate models on intelligence and efficiency, where GPT-5.4 leads in correctness and token efficiency. Agent-assisted development is splitting between automation-heavy workflows and "stay-in-the-loop" tooling, with OpenAI advancing Codex Automations featuring worktree vs. branch choices and UI customization. The open agent platform Hermes Agent v0.2.0 introduces full MCP client support, ACP server for editors, and expanded provider integrations including OpenAI OAuth.
not much happened today
qwen-3.5-0.8b qwen-3.5-2b qwen-3.5-4b qwen-3.5-9b codex-5.3 claude-3 alibaba ollama lm-studio openai anthropic multimodality reinforcement-learning long-context hybrid-attention on-device-ai model-deployment agent-reliability agent-observability coding-agents benchmarking runtime-optimization token-efficiency nrehiew_ kimmonismus lioronai danielhanchen theo htihle teortaxestex theprimeagen yuchenj_uw _lewtun saen_dev _philschmid omarsar0
Alibaba released the Qwen 3.5 series with models ranging from 0.8B to 9B parameters, featuring native multimodality, scaled reinforcement learning, and targeting edge and lightweight agent deployments. The models support very long context windows up to 262K tokens (extendable to 1M) and use a novel Gated DeltaNet hybrid attention architecture combining linear and full attention layers. Deployment examples include Ollama and LM Studio, with a notable 6-bit on-device demo on iPhone 17 Pro. Evaluators are cautioned that reasoning is disabled by default on smaller models. In coding agents, Codex 5.3 shows promising benchmark results on WeirdML with 79.3% accuracy, though availability and downtime remain critical challenges, especially highlighted by Claude outages. Agent reliability and observability are emphasized as cross-functional problems requiring clear success criteria and practical evaluation strategies. Studies show that using AGENTS.md and SKILL.md guardrails can significantly reduce runtime and token usage by mitigating worst-case thrashing in coding workflows.
not much happened today
gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena
Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.