All tags
Company: "llama_index"
not much happened today
gemini-2.5-pro chatgpt deepseek-v3 qwen-2.5 claude-3.5-sonnet claude-3.7-sonnet google anthropic openai llama_index langchain runway deepseek math benchmarking chains-of-thought model-performance multi-agent-systems agent-frameworks media-generation long-horizon-planning code-generation rasbt danielhanchen hkproj
Gemini 2.5 Pro shows strengths and weaknesses, notably lacking LaTex math rendering unlike ChatGPT, and scored 24.4% on the 2025 US AMO. DeepSeek V3 ranks 8th and 12th on recent leaderboards. Qwen 2.5 models have been integrated into the PocketPal app. Research from Anthropic reveals that Chains-of-Thought (CoT) reasoning is often unfaithful, especially on harder tasks, raising safety concerns. OpenAI's PaperBench benchmark shows AI agents struggle with long-horizon planning, with Claude 3.5 Sonnet achieving only 21.0% accuracy. CodeAct framework generalizes ReAct for dynamic code writing by agents. LangChain explains multi-agent handoffs in LangGraph. Runway Gen-4 marks a new phase in media creation.