All tags
Person: "yuchenj_uw"
not much happened today
qwen3-coder-480b-a35b-instruct kimi-k2 alibaba openrouterai togethercompute vllm_project unslothai white-house code-generation benchmarking model-integration context-windows open-source national-security infrastructure ai-policy fchollet clementdelangue scaling01 aravsrinivas rasbt gregkamradt yuchenj_uw
Alibaba announced the release of Qwen3-Coder-480B-A35B-Instruct, an open agentic code model with 480B parameters and 256K context length, praised for rapid development and strong coding performance. Benchmark claims of 41.8% on ARC-AGI-1 faced skepticism from Fran ois Chollet and others due to reproducibility issues. The model quickly integrated into ecosystems like vLLM, Dynamic GGUFs, and OpenRouterAI. The White House unveiled a new AI Action Plan emphasizing Innovation, Infrastructure, and International Diplomacy, linking AI leadership to national security and prioritizing compute access for the Department of Defense. The plan sparked debate on open vs. closed-source AI, with calls from Clement Delangue to embrace open science to maintain US AI competitiveness.
not much happened today
kimi-k2 grok-4 gpt-5 gemini-2.5 gemini-embedding cognition windsurf moonshot-ai x-ai openai google stanfordnlp huggingface mixture-of-experts model-training model-performance fine-tuning benchmarking agentic-ai model-bugs embedding-models sama hardmaru jeremyphoward akhaliq teortaxestex yuchenj_uw demishassabis
Cognition is acquiring the remaining assets of Windsurf after a significant weekend deal. Moonshot AI released Kimi K2, an open-source, MIT-licensed agentic model with 1 Trillion total / 32B active parameters using a Mixture-of-Experts architecture, trained on 15.5 Trillion tokens with the MuonClip optimizer, showing top performance on benchmarks like EQ-Bench and Creative Writing. xAI launched Grok-4, ranking 5th on IQ Bench but with notable quirks including a bug causing it to respond only with "Heavy" and a high frequency of Elon Musk mentions. Rumors about OpenAI delaying an open-source model release surfaced, with speculation about CEO sama's PR strategy and a possible GPT-5 launch in September. The Gemini 2.5 paper was released with 3,295 authors, and Google introduced its Gemini Embedding model, topping the MTEB leaderboard.
Kimi K2 - SOTA Open MoE proves that Muon can scale to 15T tokens/1T params
kimi-k2 kimi-k2-1t deepseek-v3 grok-4 devstral-2507 gpt-4.1 sonnet-4 moonshot-ai alibaba tencent deepseek x-ai mistral-ai weights-biases hugging-face mixture-of-experts model-training model-optimization optimizer benchmarking long-context model-performance open-weights model-release yuchenj_uw andrew_n_carr scaling01 novita_labs teknium1 aravsrinivas mparakhin simonw
Moonshot AI has released Kimi K2, a 1 trillion parameter Mixture-of-Experts model trained on 15.5 trillion tokens using the new MuonClip optimizer, achieving state-of-the-art results on benchmarks like SWE-Bench Verified (65.8%) and TAU2 (58.4%). This model is competitive with GPT-4.1 and Sonnet 4 on non-thinking tasks and is available under an MIT license. Meanwhile, xAI announced Grok-4, noted for its "LEAST censored frontier model" status and strong long-context performance but criticized for rushed post-training. Mistral AI updated its Devstral 2507 models with improved performance and cost efficiency. The community is excited about the potential of the MuonClip optimizer, which may surpass the long-standing AdamW optimizer in machine learning.
Grok 4: xAI succeeds in going from 0 to new SOTA LLM in 2 years
grok-4 grok-4-heavy claude-4-opus xai perplexity-ai langchain cursor cline model-releases benchmarking long-context model-pricing model-integration voice performance scaling gpu-optimization elonmusk aravsrinivas igor_babuschkin yuchenj_uw
xAI launched Grok 4 and Grok 4 Heavy, large language models rumored to have 2.4 trillion parameters and trained with 100x more compute than Grok 2 on 100k H100 GPUs. Grok 4 achieved new state-of-the-art results on benchmarks like ARC-AGI-2 (15.9%), HLE (50.7%), and Vending-Bench, outperforming models such as Claude 4 Opus. The model supports a 256K context window and is priced at $3.00/M input tokens and $15.00/M output tokens. It is integrated into platforms like Cursor, Cline, LangChain, and Perplexity Pro/Max. The launch was accompanied by a controversial voice mode and sparked industry discussion about xAI's rapid development pace, with endorsements from figures like Elon Musk and Arav Srinivas.
not much happened today
o3-mini o1-mini llama hunyuan-a13b ernie-4.5 ernie-4.5-21b-a3b qwen3-30b-a3b gemini-2.5-pro meta-ai-fair openai tencent microsoft baidu gemini superintelligence ai-talent job-market open-source-models multimodality mixture-of-experts quantization fp8-training model-benchmarking model-performance model-releases api model-optimization alexandr_wang shengjia_zhao jhyuxm ren_hongyu shuchaobi saranormous teortaxesTex mckbrando yuchenj_uw francoisfleuret quanquangu reach_vb philschmid
Meta has poached top AI talent from OpenAI, including Alexandr Wang joining as Chief AI Officer to work towards superintelligence, signaling a strong push for the next Llama model. The AI job market shows polarization with high demand and compensation for top-tier talent, while credentials like strong GitHub projects gain importance. The WizardLM team moved from Microsoft to Tencent to develop open-source models like Hunyuan-A13B, highlighting shifts in China's AI industry. Rumors suggest OpenAI will release a new open-source model in July, potentially surpassing existing ChatGPT models. Baidu open-sourced multiple variants of its ERNIE 4.5 model series, featuring advanced techniques like 2-bit quantization, MoE router orthogonalization loss, and FP8 training, with models ranging from 0.3B to 424B parameters. Gemini 2.5 Pro returned to the free tier of the Gemini API, enabling developers to explore its features.
not much happened today
deepseek-r1-0528 pali-gemma-2 gemma-3 shieldgemma-2 txgemma gemma-3-qat gemma-3n-preview medgemma dolphingemma signgemma claude-4 opus-4 claude-sonnet-4 codestral-embed bagel qwen nemotron-cortexa gemini-2.5-pro deepseek-ai huggingface gemma claude bytedance qwen nemotron sakana-ai-labs benchmarking model-releases multimodality code-generation model-performance long-context reinforcement-learning model-optimization open-source yuchenj_uw _akhaliq clementdelangue osanseviero alexalbert__ guillaumelample theturingpost lmarena_ai epochairesearch scaling01 nrehiew_ ctnzr
DeepSeek R1 v2 model released with availability on Hugging Face and inference partners. The Gemma model family continues prolific development including PaliGemma 2, Gemma 3, and others. Claude 4 and its variants like Opus 4 and Claude Sonnet 4 show top benchmark performance, including new SOTA on ARC-AGI-2 and WebDev Arena. Codestral Embed introduces a 3072-dimensional code embedder. BAGEL, an open-source multimodal model by ByteDance, supports reading, reasoning, drawing, and editing with long mixed contexts. Benchmarking highlights include Nemotron-CORTEXA topping SWEBench and Gemini 2.5 Pro performing on VideoGameBench. Discussions on random rewards effectiveness focus on Qwen models. "Opus 4 NEW SOTA ON ARC-AGI-2. It's happening - I was right" and "Claude 4 launch has dev moving at a different pace" reflect excitement in the community.
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
Llama 4's Controversial Weekend Release
llama-4 llama-3 llama-3-2 meta mixture-of-experts early-fusion attention-mechanisms fp8-training training-data benchmarking model-performance model-release multimodality open-models ahmad_al_dahle ylecun reach_vb yuchenj_uw
Meta released Llama 4, featuring two new medium-size MoE open models and a promised 2 Trillion parameter "behemoth" model, aiming to be the largest open model ever. The release included advanced training techniques like Chameleon-like early fusion with MetaCLIP, interleaved chunked attention without RoPE, native FP8 training, and training on up to 40 trillion tokens. Despite the hype, the release faced criticism for lack of transparency compared to Llama 3, implementation issues, and poor performance on some benchmarks. Meta leadership, including Ahmad Al Dahle, denied allegations of training on test sets. The smallest Scout model at 109B parameters is too large for consumer GPUs, and the claimed 10 million token context is disputed. The community response has been mixed, with some praising the openness and others pointing out discrepancies and quality concerns.
not much happened today
gpt-4.5 gpt-4 gpt-4o o1 claude-3.5-sonnet claude-3.7 claude-3-opus deepseek-v3 grok-3 openai anthropic perplexity-ai deepseek scaling01 model-performance humor emotional-intelligence model-comparison pricing context-windows model-size user-experience andrej-karpathy jeremyphoward abacaj stevenheidel yuchenj_uw aravsrinivas dylan522p random_walker
GPT-4.5 sparked mixed reactions on Twitter, with @karpathy noting users preferred GPT-4 in a poll despite his personal favor for GPT-4.5's creativity and humor. Critics like @abacaj highlighted GPT-4.5's slowness and questioned its practical value and pricing compared to other models. Performance-wise, GPT-4.5 ranks above GPT-4o but below o1 and Claude 3.5 Sonnet, with Claude 3.7 outperforming it on many tasks yet GPT-4.5 praised for its humor and "vibes." Speculation about GPT-4.5's size suggests around 5 trillion parameters. Discussions also touched on pricing disparities, with Perplexity Deep Research at $20/month versus ChatGPT at $200/month. The emotional intelligence and humor of models like Claude 3.7 were also noted.
not much happened today
helium-1 qwen-2.5 phi-4 sky-t1-32b-preview o1 codestral-25.01 phi-3 mistral llama-3 gpt-3.5 llama-3 gpt-3.5 llmquoter kyutai-labs lmstudio mistralai llamaindex huggingface langchainai hyperbolic-labs replit fchollet philschmid multilinguality token-level-distillation context-windows model-performance open-source reasoning coding retrieval-augmented-generation hybrid-retrieval multiagent-systems video large-video-language-models dynamic-ui voice-interaction gpu-rentals model-optimization semantic-deduplication model-inference reach_vb awnihannun lior_on_ai sophiamyang omarsar0 skirano yuchenj_uw fchollet philschmid
Helium-1 Preview by kyutai_labs is a 2B-parameter multilingual base LLM outperforming Qwen 2.5, trained on 2.5T tokens with a 4096 context size using token-level distillation from a 7B model. Phi-4 (4-bit) was released in lmstudio on an M4 max, noted for speed and performance. Sky-T1-32B-Preview is a $450 open-source reasoning model matching o1's performance with strong benchmark scores. Codestral 25.01 by mistralai is a new SOTA coding model supporting 80+ programming languages and offering 2x speed.
Innovations include AutoRAG for optimizing retrieval-augmented generation pipelines, Agentic RAG for autonomous query reformulation and critique, Multiagent Finetuning using societies of models like Phi-3, Mistral, LLaMA-3, and GPT-3.5 for reasoning improvements, and VideoRAG incorporating video content into RAG with LVLMs.
Applications include a dynamic UI AI chat app by skirano on Replit, LangChain tools like DocTalk for voice PDF conversations, AI travel agent tutorials, and news summarization agents. Hyperbolic Labs offers competitive GPU rentals including H100, A100, and RTX 4090. LLMQuoter enhances RAG accuracy by identifying key quotes.
Infrastructure updates include MLX export for LLM inference from Python to C++ by fchollet and SemHash semantic text deduplication by philschmid.
$1150m for SSI, Sakana, You.com + Claude 500m context
olmo llama2-13b-chat claude claude-3.5-sonnet safe-superintelligence sakana-ai you-com perplexity-ai anthropic ai2 mixture-of-experts model-architecture model-training gpu-costs retrieval-augmented-generation video-generation ai-alignment enterprise-ai agentic-ai command-and-control ilya-sutskever mervenoyann yuchenj_uw rohanpaul_ai ctojunior omarsar0
Safe Superintelligence raised $1 billion at a $5 billion valuation, focusing on safety and search approaches as hinted by Ilya Sutskever. Sakana AI secured a $100 million Series A funding round, emphasizing nature-inspired collective intelligence. You.com pivoted to a ChatGPT-like productivity agent after a $50 million Series B round, while Perplexity AI raised over $250 million this summer. Anthropic launched Claude for Enterprise with a 500 million token context window. AI2 released a 64-expert Mixture-of-Experts (MoE) model called OLMo, outperforming Llama2-13B-Chat. Key AI research trends include efficient MoE architectures, challenges in AI alignment and GPU costs, and emerging AI agents for autonomous tasks. Innovations in AI development feature command and control for video generation, Retrieval-Augmented Generation (RAG) efficiency, and GitHub integration under Anthropic's Enterprise plan. "Our logo is meant to invoke the idea of a school of fish coming together and forming a coherent entity from simple rules as we want to make use of ideas from nature such as evolution and collective intelligence in our research."