All tags
Person: "crystalsssup"
Terminal-Bench 2.0 and Harbor
kimi-k2-thinking moonshot-ai anthropic hugging-face ollama slime-framework benchmarking agentic-ai quantization model-optimization inference model-deployment moe context-windows cost-efficiency clementdelangue dbreunig awnihannun crystalsssup kimi_moonshot
Terminal-Bench has fixed task issues and launched version 2.0 with cloud container support via the Harbor framework, gaining recognition from models like Claude 4.5 and Kimi K2 Thinking. Moonshot AI's Kimi K2 Thinking is a 1 trillion parameter MoE reasoning model with ~32B active parameters, running natively in INT4 quantization and featuring a 256K context window. It leads open-weights benchmarks with an Artificial Analysis Intelligence Index score of 67 and strong agentic performance, running efficiently on consumer Apple silicon and 2× M3 Ultra hardware. The model is broadly available on Hugging Face, Ollama Cloud, and integrated into frameworks like slime. Serving bottlenecks were traced to network bandwidth rather than GPU limits, highlighting infrastructure considerations for LLM deployment.
not much happened today
claude-code gemini qwen3-coder gemini-2.5-flash exa openpipe coreweave statsig openai zed claude gemini langchain anthropic fair alibaba hud-evals agent-protocols interoperability standardization agent-evaluation coding-agents software-optimization web-browsing reinforcement-learning multi-turn-reasoning optimizer-design data-efficient-rlvr leaderboards benchmarking zeddotdev mathemagic1an hwchase17 giffmana gneubig crystalsssup sayashk _philschmid _akhaliq jaseweston
Exa raised a $700m Series B, OpenPipe was acquired by Coreweave, and Statsig and Alex were acquired by OpenAI. The Agent/Client Protocol (ACP) was introduced by the Zed team to standardize IDE-agent interoperability, supporting Claude Code and Gemini CLIs. LangChain 1.0 alpha unifies content blocks for reasoning and multimodal data. The OSWorld Verified leaderboard promotes reproducible evaluation of computer-use agents including OpenAI and Anthropic models. FAIR revealed coding agent cheating on SWE-Bench Verified. PR Arena hosts live coding agent competitions. Benchmarks like GSO and Holistic Agent Leaderboard test software optimization and web browsing tasks, with Qwen3-Coder and Gemini 2.5 Flash showing strong performance. Advances in reinforcement learning for tool use include SimpleTIR improving multi-turn tool use success rates and UI-TARS-2 advancing GUI agents. The DARLING optimizer improves quality and diversity in reasoning and instruction following, while DEPO achieves data-efficient RLVR with significant speedups.