All tags
Person: "yuchenj_uw"
not much happened today
deepseek-r1-0528 pali-gemma-2 gemma-3 shieldgemma-2 txgemma gemma-3-qat gemma-3n-preview medgemma dolphingemma signgemma claude-4 opus-4 claude-sonnet-4 codestral-embed bagel qwen nemotron-cortexa gemini-2.5-pro deepseek-ai huggingface gemma claude bytedance qwen nemotron sakana-ai-labs benchmarking model-releases multimodality code-generation model-performance long-context reinforcement-learning model-optimization open-source yuchenj_uw _akhaliq clementdelangue osanseviero alexalbert__ guillaumelample theturingpost lmarena_ai epochairesearch scaling01 nrehiew_ ctnzr
DeepSeek R1 v2 model released with availability on Hugging Face and inference partners. The Gemma model family continues prolific development including PaliGemma 2, Gemma 3, and others. Claude 4 and its variants like Opus 4 and Claude Sonnet 4 show top benchmark performance, including new SOTA on ARC-AGI-2 and WebDev Arena. Codestral Embed introduces a 3072-dimensional code embedder. BAGEL, an open-source multimodal model by ByteDance, supports reading, reasoning, drawing, and editing with long mixed contexts. Benchmarking highlights include Nemotron-CORTEXA topping SWEBench and Gemini 2.5 Pro performing on VideoGameBench. Discussions on random rewards effectiveness focus on Qwen models. "Opus 4 NEW SOTA ON ARC-AGI-2. It's happening - I was right" and "Claude 4 launch has dev moving at a different pace" reflect excitement in the community.
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
Llama 4's Controversial Weekend Release
llama-4 llama-3 llama-3-2 meta mixture-of-experts early-fusion attention-mechanisms fp8-training training-data benchmarking model-performance model-release multimodality open-models ahmad_al_dahle ylecun reach_vb yuchenj_uw
Meta released Llama 4, featuring two new medium-size MoE open models and a promised 2 Trillion parameter "behemoth" model, aiming to be the largest open model ever. The release included advanced training techniques like Chameleon-like early fusion with MetaCLIP, interleaved chunked attention without RoPE, native FP8 training, and training on up to 40 trillion tokens. Despite the hype, the release faced criticism for lack of transparency compared to Llama 3, implementation issues, and poor performance on some benchmarks. Meta leadership, including Ahmad Al Dahle, denied allegations of training on test sets. The smallest Scout model at 109B parameters is too large for consumer GPUs, and the claimed 10 million token context is disputed. The community response has been mixed, with some praising the openness and others pointing out discrepancies and quality concerns.
not much happened today
gpt-4.5 gpt-4 gpt-4o o1 claude-3.5-sonnet claude-3.7 claude-3-opus deepseek-v3 grok-3 openai anthropic perplexity-ai deepseek scaling01 model-performance humor emotional-intelligence model-comparison pricing context-windows model-size user-experience andrej-karpathy jeremyphoward abacaj stevenheidel yuchenj_uw aravsrinivas dylan522p random_walker
GPT-4.5 sparked mixed reactions on Twitter, with @karpathy noting users preferred GPT-4 in a poll despite his personal favor for GPT-4.5's creativity and humor. Critics like @abacaj highlighted GPT-4.5's slowness and questioned its practical value and pricing compared to other models. Performance-wise, GPT-4.5 ranks above GPT-4o but below o1 and Claude 3.5 Sonnet, with Claude 3.7 outperforming it on many tasks yet GPT-4.5 praised for its humor and "vibes." Speculation about GPT-4.5's size suggests around 5 trillion parameters. Discussions also touched on pricing disparities, with Perplexity Deep Research at $20/month versus ChatGPT at $200/month. The emotional intelligence and humor of models like Claude 3.7 were also noted.
not much happened today
helium-1 qwen-2.5 phi-4 sky-t1-32b-preview o1 codestral-25.01 phi-3 mistral llama-3 gpt-3.5 llama-3 gpt-3.5 llmquoter kyutai-labs lmstudio mistralai llamaindex huggingface langchainai hyperbolic-labs replit fchollet philschmid multilinguality token-level-distillation context-windows model-performance open-source reasoning coding retrieval-augmented-generation hybrid-retrieval multiagent-systems video large-video-language-models dynamic-ui voice-interaction gpu-rentals model-optimization semantic-deduplication model-inference reach_vb awnihannun lior_on_ai sophiamyang omarsar0 skirano yuchenj_uw fchollet philschmid
Helium-1 Preview by kyutai_labs is a 2B-parameter multilingual base LLM outperforming Qwen 2.5, trained on 2.5T tokens with a 4096 context size using token-level distillation from a 7B model. Phi-4 (4-bit) was released in lmstudio on an M4 max, noted for speed and performance. Sky-T1-32B-Preview is a $450 open-source reasoning model matching o1's performance with strong benchmark scores. Codestral 25.01 by mistralai is a new SOTA coding model supporting 80+ programming languages and offering 2x speed.
Innovations include AutoRAG for optimizing retrieval-augmented generation pipelines, Agentic RAG for autonomous query reformulation and critique, Multiagent Finetuning using societies of models like Phi-3, Mistral, LLaMA-3, and GPT-3.5 for reasoning improvements, and VideoRAG incorporating video content into RAG with LVLMs.
Applications include a dynamic UI AI chat app by skirano on Replit, LangChain tools like DocTalk for voice PDF conversations, AI travel agent tutorials, and news summarization agents. Hyperbolic Labs offers competitive GPU rentals including H100, A100, and RTX 4090. LLMQuoter enhances RAG accuracy by identifying key quotes.
Infrastructure updates include MLX export for LLM inference from Python to C++ by fchollet and SemHash semantic text deduplication by philschmid.
$1150m for SSI, Sakana, You.com + Claude 500m context
olmo llama2-13b-chat claude claude-3.5-sonnet safe-superintelligence sakana-ai you-com perplexity-ai anthropic ai2 mixture-of-experts model-architecture model-training gpu-costs retrieval-augmented-generation video-generation ai-alignment enterprise-ai agentic-ai command-and-control ilya-sutskever mervenoyann yuchenj_uw rohanpaul_ai ctojunior omarsar0
Safe Superintelligence raised $1 billion at a $5 billion valuation, focusing on safety and search approaches as hinted by Ilya Sutskever. Sakana AI secured a $100 million Series A funding round, emphasizing nature-inspired collective intelligence. You.com pivoted to a ChatGPT-like productivity agent after a $50 million Series B round, while Perplexity AI raised over $250 million this summer. Anthropic launched Claude for Enterprise with a 500 million token context window. AI2 released a 64-expert Mixture-of-Experts (MoE) model called OLMo, outperforming Llama2-13B-Chat. Key AI research trends include efficient MoE architectures, challenges in AI alignment and GPU costs, and emerging AI agents for autonomous tasks. Innovations in AI development feature command and control for video generation, Retrieval-Augmented Generation (RAG) efficiency, and GitHub integration under Anthropic's Enterprise plan. "Our logo is meant to invoke the idea of a school of fish coming together and forming a coherent entity from simple rules as we want to make use of ideas from nature such as evolution and collective intelligence in our research."