All tags
Topic: "test-time-scaling"
not much happened today
muse-spark llama-4-maverick glm-5.1 deepseek-v3.2 meta-ai-fair zhipu-ai deepseek multimodality tool-use visual-chain-of-thought multi-agent-systems training-efficiency test-time-scaling parallel-inference image-to-code model-benchmarking model-architecture alexandr_wang shengjia_zhao jack_w_rae ananyaku _jasonwei artificialanlys valsai epochairesearch scale_ai matthuang omarsar0 skirano mattdeitke garrytan sebastian_raschka
Meta Superintelligence Labs launched Muse Spark, a natively multimodal reasoning model featuring tool use, visual chain of thought, and multi-agent orchestration. It is live on meta.ai and the Meta AI app with a private API preview and plans for open-sourcing future versions. Independent benchmarks rank Muse Spark highly, with strong performance on intelligence indices and efficiency, notably using over 10× less compute than Llama 4 Maverick. Key technical highlights include training efficiency, test-time scaling, and parallel multi-agent inference. Community testing shows strengths in image-to-code and one-shot game generation. Additionally, Zhipu AI's GLM-5.1 is recognized as a leading open-weight model with architecture similar to DeepSeek-V3.2.
OpenAI takes on Gemini's Deep Research
o3 o3-mini-high o3-deep-research-mini openai google-deepmind nyu uc-berkeley hku reinforcement-learning benchmarking inference-speed model-performance reasoning test-time-scaling agent-design sama danhendrycks ethan-mollick dan-shipper
OpenAI released the full version of the o3 agent, with a new Deep Research variant showing significant improvements on the HLE benchmark and achieving SOTA results on GAIA. The release includes an "inference time scaling" chart demonstrating rigorous research, though some criticism arose over public test set results. The agent is noted as "extremely simple" and currently limited to 100 queries/month, with plans for a higher-rate version. Reception has been mostly positive, with some skepticism. Additionally, advances in reinforcement learning were highlighted, including a simple test-time scaling technique called budget forcing that improved reasoning on math competitions by 27%. Researchers from Google DeepMind, NYU, UC Berkeley, and HKU contributed to these findings. The original Gemini Deep Research team will participate in the upcoming AI Engineer NYC event.