All tags
Company: "valsai"
Anthropic's Claude Opus 4.7
claude-opus-4.7 codex gpt-rosalind anthropic openai cursor replit perplexity-ai microsoft coding agentic-ai tokenization long-context benchmarking image-processing software-engineering computer-use plugin-integration multi-terminal-support ssh-access model-expansion bcherny kimmonismus scaling01 valsai artificialanlys natolambert nrehiew_
Anthropic launched Claude Opus 4.7, its most capable Opus model yet, featuring stronger coding and agentic performance, a new tokenizer, and improved long-context handling with a new xhigh reasoning tier. Benchmarks show substantial gains, including SWE-bench Pro 64.3%, SWE-bench Verified 87.6%, and TerminalBench 69.4%, with top rankings on Vals Index and GDPval-AA. Technical changes include a new tokenizer and increased image input resolution to 3.75MP. Some long-context benchmarks showed mixed results, with a shift in focus from MRCR to Graphwalks. Adoption was rapid across tools like Cursor, VS Code, Replit Agent, and Perplexity. Meanwhile, OpenAI expanded Codex into a broader computer agent with Mac computer use, in-app browser, image generation/editing, 90+ plugins, multi-terminal support, SSH remote devbox access, and richer file previews. A new vertical life-sciences model, GPT-Rosalind, was also introduced.
not much happened today
muse-spark llama-4-maverick glm-5.1 deepseek-v3.2 meta-ai-fair zhipu-ai deepseek multimodality tool-use visual-chain-of-thought multi-agent-systems training-efficiency test-time-scaling parallel-inference image-to-code model-benchmarking model-architecture alexandr_wang shengjia_zhao jack_w_rae ananyaku _jasonwei artificialanlys valsai epochairesearch scale_ai matthuang omarsar0 skirano mattdeitke garrytan sebastian_raschka
Meta Superintelligence Labs launched Muse Spark, a natively multimodal reasoning model featuring tool use, visual chain of thought, and multi-agent orchestration. It is live on meta.ai and the Meta AI app with a private API preview and plans for open-sourcing future versions. Independent benchmarks rank Muse Spark highly, with strong performance on intelligence indices and efficiency, notably using over 10× less compute than Llama 4 Maverick. Key technical highlights include training efficiency, test-time scaling, and parallel multi-agent inference. Community testing shows strengths in image-to-code and one-shot game generation. Additionally, Zhipu AI's GLM-5.1 is recognized as a leading open-weight model with architecture similar to DeepSeek-V3.2.
not much happened today
glm-4.7 glm-4.6 minimax-m2.1 gemma-3 gemma-scope-2 google-deepmind valsai minimax-ai ollama trae alibaba sophont prime-intellect interpretability sparse-autoencoders agent-workflows model-benchmarking medical-evaluation multi-agent-systems model-performance model-optimization reinforcement-learning tool-use function-calling context-windows ivanfioravanti awnihannun deedydas cline omarsar0 adonis_singh eliebakouch teortaxestex ibragim_bad callum_mcdougall neelnanda5
GLM-4.7 and MiniMax M2.1 open-weight model releases highlight day-0 ecosystem support, coding throughput, and agent workflows, with GLM-4.7 achieving a +9.5% improvement over GLM-4.6 and MiniMax M2.1 positioned as an OSS Claude-like MoE model with 230B total parameters and 200K context. Gemma Scope 2 from google-deepmind introduces sparse autoencoders and transcoders for interpretability across Gemma 3 models, aiming to provide shared infrastructure for safety and debugging. The Medmarks v0.1 open medical evaluation suite and leaderboard launch addresses the need for open medical benchmarking across 15+ environments, engaging clinicians and researchers.