All tags
Person: "danielhanchen"
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
not much happened today
gemini-2.5-pro chatgpt deepseek-v3 qwen-2.5 claude-3.5-sonnet claude-3.7-sonnet google anthropic openai llama_index langchain runway deepseek math benchmarking chains-of-thought model-performance multi-agent-systems agent-frameworks media-generation long-horizon-planning code-generation rasbt danielhanchen hkproj
Gemini 2.5 Pro shows strengths and weaknesses, notably lacking LaTex math rendering unlike ChatGPT, and scored 24.4% on the 2025 US AMO. DeepSeek V3 ranks 8th and 12th on recent leaderboards. Qwen 2.5 models have been integrated into the PocketPal app. Research from Anthropic reveals that Chains-of-Thought (CoT) reasoning is often unfaithful, especially on harder tasks, raising safety concerns. OpenAI's PaperBench benchmark shows AI agents struggle with long-horizon planning, with Claude 3.5 Sonnet achieving only 21.0% accuracy. CodeAct framework generalizes ReAct for dynamic code writing by agents. LangChain explains multi-agent handoffs in LangGraph. Runway Gen-4 marks a new phase in media creation.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
DeepSeek's Open Source Stack
qwen-qwq-32b start character-3 gemini gemini-2.0 mercury-coder gpt-4.5 jamba-mini-1.6 gemini-2.0-flash gpt-4o-mini mistral-small-3 mistral-ocr deepseek pyspur hugging-face togethercompute hedra-labs google-deepmind deeplearningai openai ai21-labs mistral-ai fine-tuning benchmarking multimodality code-generation diffusion-models model-performance model-optimization ocr embedding-models context-windows runtime-limits _akhaliq lmarena_ai reach_vb danielhanchen _philschmid aidan_mclau vikhyatk jerryjliu0
DeepSeek's Open Source Week was summarized by PySpur, highlighting multiple interesting releases. The Qwen QwQ-32B model was fine-tuned into START, excelling in PhD-level science QA and math benchmarks. Character-3, an omnimodal AI video generation model by Hedra Labs and Together AI, enables realistic animated content creation. Google DeepMind introduced the Gemini embedding model with an 8k context window, ranking #1 on MMTEB, alongside the Gemini 2.0 Code Executor supporting Python libraries and auto-fix features. Inception Labs' Mercury Coder is a diffusion-based code generation model offering faster token processing. OpenAI released GPT-4.5, their largest model yet but with less reasoning ability than some competitors. AI21 Labs launched Jamba Mini 1.6, noted for superior output speed compared to Gemini 2.0 Flash, GPT-4o mini, and Mistral Small 3. A new dataset of 1.9M scanned pages was released for OCR benchmarking, with Mistral OCR showing competitive but not top-tier document parsing performance compared to LLM/LVM-powered methods. "Cracked engineers are all you need."
not much happened today
claude-3.7-sonnet claude-3.7 deepseek-r1 o3-mini deepseek-v3 gemini-2.0-pro gpt-4o qwen2.5-coder-32b-instruct anthropic perplexity-ai amazon google-cloud deepseek_ai coding reasoning model-benchmarking agentic-workflows context-window model-performance open-source moe model-training communication-libraries fp8 nvlink rdma cli-tools skirano omarsar0 reach_vb artificialanlys terryyuezhuo _akhaliq _philschmid catherineols goodside danielhanchen
Claude 3.7 Sonnet demonstrates exceptional coding and reasoning capabilities, outperforming models like DeepSeek R1, O3-mini, and GPT-4o on benchmarks such as SciCode and LiveCodeBench. It is available on platforms including Perplexity Pro, Anthropic, Amazon Bedrock, and Google Cloud, with pricing at $3/$15 per million tokens. Key features include a 64k token thinking mode, 200k context window, and the CLI-based coding assistant Claude Code. Meanwhile, DeepSeek released DeepEP, an open-source communication library optimized for MoE model training and inference with support for NVLink, RDMA, and FP8. These updates highlight advancements in coding AI and efficient model training infrastructure.
DeepSeek Janus and Meta SpiRit-LM: Decoupled Image and Expressive Voice Omnimodality
nemotron-70b claude claude-3.5-sonnet gpt-4o deepseek meta-ai-fair wandb nvidia anthropic hugging-face perplexity-ai multimodality image-generation speech-synthesis fine-tuning model-merging benchmarking open-source model-optimization reinforcement-learning bindureddy aravsrinivas danielhanchen clementdelangue cwolferesearch
DeepSeek Janus and Meta SpiRit-LM are two notable multimodality AI models recently released, showcasing advances in image generation and speech synthesis respectively. DeepSeek Janus separates vision encoders for image understanding and generation, achieving better results in both tasks. Meta's SpiRit-LM introduces an expressive speech and writing model generating pitch and style units, improving over standard TTS. Additionally, W&B Weave offers comprehensive LLM observability and multimodality fine-tuning tools. Industry updates include Nvidia's Nemotron 70b model underperforming, Meta open-sourcing Movie Gen Bench for media generation benchmarking, Perplexity launching internal search with multi-step reasoning, and Anthropic updating Claude apps. Open source progress includes Hugging Face's gradient accumulation fix in transformers and advocacy for open source AI to prevent Big Tech dominance. "Model merging for combining skills of multiple models" is also highlighted.