All tags
Person: "htihle"
not much happened today
qwen-3.5-0.8b qwen-3.5-2b qwen-3.5-4b qwen-3.5-9b codex-5.3 claude-3 alibaba ollama lm-studio openai anthropic multimodality reinforcement-learning long-context hybrid-attention on-device-ai model-deployment agent-reliability agent-observability coding-agents benchmarking runtime-optimization token-efficiency nrehiew_ kimmonismus lioronai danielhanchen theo htihle teortaxestex theprimeagen yuchenj_uw _lewtun saen_dev _philschmid omarsar0
Alibaba released the Qwen 3.5 series with models ranging from 0.8B to 9B parameters, featuring native multimodality, scaled reinforcement learning, and targeting edge and lightweight agent deployments. The models support very long context windows up to 262K tokens (extendable to 1M) and use a novel Gated DeltaNet hybrid attention architecture combining linear and full attention layers. Deployment examples include Ollama and LM Studio, with a notable 6-bit on-device demo on iPhone 17 Pro. Evaluators are cautioned that reasoning is disabled by default on smaller models. In coding agents, Codex 5.3 shows promising benchmark results on WeirdML with 79.3% accuracy, though availability and downtime remain critical challenges, especially highlighted by Claude outages. Agent reliability and observability are emphasized as cross-functional problems requiring clear success criteria and practical evaluation strategies. Studies show that using AGENTS.md and SKILL.md guardrails can significantly reduce runtime and token usage by mitigating worst-case thrashing in coding workflows.
not much happened today
gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena
Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.