All tags
Topic: "open-weights"
MCP -> Agentic AI Foudnation, Mistral Devstral 2
devstral-2 devstral-small-2 sonnet-4.3 deepseek-v3.2 qwen3-vl openai anthropic block mistral-ai alibaba linux-foundation deepseek agentic-ai coding-models reinforcement-learning model-performance model-optimization open-weights cli-tools multi-file-code-automation data-decontamination moe reward-models rl-stability guillaumelample b_roziere qtnx_ charliermarsh omarsar0 eliebakouch justinwaugh cwolferesearch pan
OpenAI Engineering sees a significant collaborative milestone with the launch of the Agentic AI Foundation under the Linux Foundation, uniting projects from Anthropic, OpenAI, and Block. Mistral released Devstral 2, a coding model with 123B parameters and open weights, offering a cost-effective alternative to Sonnet 4.3 and competitive performance against DeepSeek v3.2. The new Mistral Vibe CLI supports agentic coding workflows with rapid ecosystem integration. Alibaba introduced Soft Adaptive Policy Optimization (SAPO) for reinforcement learning tuning, improving stability and performance in Qwen3-VL across multiple tasks. Research highlights include the importance of data decontamination in RL and ongoing discussions on MoE RL stability and reward hacking mitigation.
Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights
flux-2 flux-2-dev claude-opus-4.5 gpt-5.1 gemini-3-pro black-forest-labs anthropic huggingface multi-reference-support variational-autoencoder image-generation open-weights agentic-coding token-efficiency benchmarking prompting model-performance
Black Forest Labs' FLUX.2 release features Multi-Reference Support for up to 4 Megapixel output and up to 10 images with consistency, including four form factors: Pro, Flex, Dev (32B Open Weight model), and Klein (TBA Open Weights). The new FLUX.2 - VAE introduces a variational autoencoder optimizing learnability, quality, and compression. Meanwhile, Anthropic's Claude Opus 4.5 demonstrates strong performance and efficiency, scoring 70 on Artificial Analysis, tying with GPT-5.1 high and trailing Gemini 3 Pro (73). Opus 4.5 excels in agentic coding benchmarks and research evaluations, with notable token efficiency and reduced running costs. "Opus 4.5 leads Gemini 3 Pro on SWE-Bench Verified and tops the AICodeKing leaderboard," and it shows strong QA and systematic review capabilities. Anthropic also released a dense prompting guide for Opus 4.5.
Kimi K2 - SOTA Open MoE proves that Muon can scale to 15T tokens/1T params
kimi-k2 kimi-k2-1t deepseek-v3 grok-4 devstral-2507 gpt-4.1 sonnet-4 moonshot-ai alibaba tencent deepseek x-ai mistral-ai weights-biases hugging-face mixture-of-experts model-training model-optimization optimizer benchmarking long-context model-performance open-weights model-release yuchenj_uw andrew_n_carr scaling01 novita_labs teknium1 aravsrinivas mparakhin simonw
Moonshot AI has released Kimi K2, a 1 trillion parameter Mixture-of-Experts model trained on 15.5 trillion tokens using the new MuonClip optimizer, achieving state-of-the-art results on benchmarks like SWE-Bench Verified (65.8%) and TAU2 (58.4%). This model is competitive with GPT-4.1 and Sonnet 4 on non-thinking tasks and is available under an MIT license. Meanwhile, xAI announced Grok-4, noted for its "LEAST censored frontier model" status and strong long-context performance but criticized for rushed post-training. Mistral AI updated its Devstral 2507 models with improved performance and cost efficiency. The community is excited about the potential of the MuonClip optimizer, which may surpass the long-standing AdamW optimizer in machine learning.
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.