All tags
Topic: "retrieval"
not much happened today
gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena
Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.
not much happened today
7m-tiny-recursive-model jamba-reasoning-3b qwen3-omni qwen-image-edit-2509 colbert-nano agentflow samsung lecuun ai21-labs alibaba coreweave weights-biases openpipe stanford recursive-reasoning density-estimation multimodality long-context retrieval serverless-reinforcement-learning agentic-systems model-efficiency reinforcement-learning transformers rasbt jm_alexia jiqizhixin randall_balestr corbtt shawnup _akhaliq
Samsung's 7M Tiny Recursive Model (TRM) achieves superior reasoning on ARC-AGI and Sudoku with fewer layers and MLP replacing self-attention. LeCun's team introduces JEPA-SCORE, enabling density estimation from encoders without retraining. AI21 Labs releases Jamba Reasoning 3B, a fast hybrid SSM-Transformer model supporting up to 64K context tokens. Alibaba's Qwen3 Omni/Omni Realtime offers a unified audio-video-text model with extensive language and speech support, outperforming Gemini 2.0 Flash on BigBench Audio. Alibaba also debuts Qwen Image Edit 2509, a top open-weight multi-image editing model. ColBERT Nano models demonstrate effective retrieval at micro-scale parameter sizes. In reinforcement learning, CoreWeave, Weights & Biases, and OpenPipe launch serverless RL infrastructure reducing costs and speeding training. Stanford's AgentFlow presents an in-the-flow RL system with a 7B backbone outperforming larger models on agentic tasks. This update highlights advances in recursive reasoning, density estimation, multimodal architectures, long-context modeling, retrieval, and serverless reinforcement learning.
not much happened today
claude-3-sonnet claude-3-opus gpt-5-codex grok-4-fast qwen-3-next gemini-2.5-pro sora-2-pro ray-3 kling-2.5 veo-3 modernvbert anthropic x-ai google google-labs openai arena epoch-ai mit luma akhaliq coding-agents cybersecurity api model-taxonomy model-ranking video-generation benchmarking multi-modal-generation retrieval image-text-retrieval finbarrtimbers gauravisnotme justinlin610 billpeeb apples_jimmy akhaliq
Anthropic announces a new CTO. Frontier coding agents see updates with Claude Sonnet 4.5 showing strong cybersecurity and polished UX but trailing GPT-5 Codex in coding capability. xAI Grok Code Fast claims higher edit success at lower cost. Google's Jules coding agent launches a programmable API with CI/CD integration. Qwen clarifies its model taxonomy and API tiers. Vision/LM Arena rankings show a tight competition among Claude Sonnet 4.5, Claude Opus 4.1, Gemini 2.5 Pro, and OpenAI's latest models. In video generation, Sora 2 Pro leads App Store rankings with rapid iteration and a new creator ecosystem; early tests show it answers GPQA-style questions at 55% accuracy versus GPT-5's 72%. Video Arena adds new models like Luma's Ray 3 and Kling 2.5 for benchmarking. Multi-modal video+audio generation model Ovi (Veo-3-like) is released. Retrieval models include ModernVBERT from MIT with efficient image-text retrieval capabilities. "Claude Sonnet 4.5 is basically the same as Opus 4.1 for coding" and "Jules is a programmable team member" highlight key insights.