All tags
Topic: "ai-evaluation"
not much happened today
gpt-5.4 gpt-5.2 gemini-3.1-pro openai artificial-analysis gemini claude mit figma github benchmarking physics-reasoning agentic-coding hallucination-detection context-windows cost-efficiency agent-prompting scheduled-tasks loop-patterns ai-evaluation design-code-integration agent-orchestration open-source
OpenAI rolled out GPT-5.4, achieving tied #1 on the Artificial Analysis Intelligence Index with Gemini 3.1 Pro Preview scoring 57 (up from 51 for GPT-5.2 xhigh). GPT-5.4 features a larger ~1.05M token context window and higher per-token prices ($2.50/$15 vs $1.75/$14 for GPT-5.2), with strengths in physics reasoning (CritPt) and agentic coding (TerminalBench Hard) but a higher hallucination rate and ~28% higher benchmark run cost. The GPT-5.4 Pro variant shows a +10 point jump on CritPt reaching 30% but at an extreme output token cost of $180 / 1M tokens. Community benchmarks show GPT-5.4 excels in agentic/coding tasks but mixed feedback on reasoning efficiency and literalness compared to Claude. OpenAI updated agent prompting guidance for GPT-5.4 API users, emphasizing tool use, structured outputs, and verification loops. Claude Code added local scheduled tasks and loop patterns for agents. The MCP framework is highlighted as a connective tissue for AI evaluation and design-code round-trips, with Truesight MCP enabling AI evaluation like unit testing and Figma MCP server supporting bidirectional design-code integration. Open-source T3 Code launched as an agent orchestration coding app built on Codex CLI.
Cognition vs Anthropic: Don't Build Multi-Agents/How to Build Multi-Agents
claude cognition anthropic langchain huggingface microsoft llamaindex linkedin blackrock multi-agent-systems context-engineering agent-memory model-elicitation ai-evaluation deep-research-workflows framework-migration pydantic-schema walden_yan hwchase17 assaf_elovic sh_reya hamelhusain omarsar0 clefourrier jerryjliu0 akbirkhan
Within the last 24 hours, Cognition's Walden Yan advised "Don't Build Multi-Agents," while Anthropic shared their approach to building multi-agent systems with Claude's multi-agent research architecture. LangChain highlighted advances in context engineering and production AI agents used by LinkedIn and BlackRock. The community is engaging in a debate on multi-agent AI development. Additionally, Hugging Face announced deprecating TensorFlow and Flax support in favor of PyTorch. Research on agent memory and model elicitation techniques from LlamaIndex and Anthropic were also discussed.