All tags
Company: "google-deepmind"
Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro
o3 o3-pro gpt-4.1 claude-4-sonnet gemini-2.5-pro magistral-small magistral-medium mistral-small-3.1 openai anthropic google-deepmind mistral-ai perplexity-ai reasoning token-efficiency price-cut benchmarking open-source model-releases context-windows gpu-optimization swyx sama scaling01 polynoamial nrehiew_ kevinweil gdb flavioad stevenheidel aravsrinivas
OpenAI announced an 80% price cut for its o3 model, making it competitively priced with GPT-4.1 and rivaling Anthropic's Claude 4 Sonnet and Google's Gemini 2.5 Pro. Alongside, o3-pro was released as a more powerful and reliable variant, though early benchmarks showed mixed performance relative to cost. Mistral AI launched its Magistral reasoning models, including an open-source 24B parameter version optimized for efficient deployment on consumer GPUs. The price reduction and new model releases signal intensified competition in reasoning-focused large language models, with notable improvements in token efficiency and cost-effectiveness.
AI Engineer World's Fair Talks Day 1
gemini-2.5 gemma claude-code mistral cursor anthropic openai aie google-deepmind meta-ai-fair agent-based-architecture open-source model-memorization scaling-laws quantization mixture-of-experts language-model-memorization model-generalization langgraph model-architecture
Mistral launched a new Code project, and Cursor released version 1.0. Anthropic improved Claude Code plans, while ChatGPT announced expanded connections. The day was dominated by AIE keynotes and tracks including GraphRAG, RecSys, and Tiny Teams. On Reddit, Google open-sourced the DeepSearch stack for building AI agents with Gemini 2.5 and LangGraph, enabling flexible agent architectures and integration with local LLMs like Gemma. A new Meta paper analyzed language model memorization, showing GPT-style transformers store about 3.5–4 bits/parameter and exploring the transition from memorization to generalization, with implications for Mixture-of-Experts models and quantization effects.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.
not much happened today
claude-4 claude-4-opus claude-4-sonnet gemini-2.5-pro gemma-3n imagen-4-ultra anthropic google-deepmind openai codebase-understanding coding agentic-performance multimodality text-to-speech video-generation model-integration benchmarking memory-optimization cline amanrsanger ryanpgreenblatt johnschulman2 alexalbert__ nearcyan mickeyxfriedman jeremyphoward gneubig teortaxesTex scaling01 artificialanlys philschmid
Anthropic's Claude 4 models (Opus 4, Sonnet 4) demonstrate strong coding abilities, with Sonnet 4 achieving 72.7% on SWE-bench and Opus 4 at 72.5%. Claude Sonnet 4 excels in codebase understanding and is considered SOTA on large codebases. Criticism arose over Anthropic's handling of ASL-3 security requirements. Demand for Claude 4 is high, with integration into IDEs and support from Cherry Studio and FastHTML. Google DeepMind introduced Gemini 2.5 Pro Deep Think and Gemma 3n, a mobile multimodal model reducing RAM usage by nearly 3x. Google's Imagen 4 Ultra ranks third in the Artificial Analysis Image Arena, available on Vertex AI Studio. Google also promoted Google Beam, an AI video model for immersive 3D experiences, and new text-to-speech models with multi-speaker support. The GAIA benchmark shows Claude 4 Opus and Sonnet leading in agentic performance.
OpenAI buys Jony Ive's io for $6.5b, LMArena lands $100m seed from a16z
gemini-2.5-pro gemini-diffusion openai lmarena a16z mistral-ai google google-deepmind multimodality reasoning code-generation math model-fine-tuning ai-assistants voice memory-optimization sundar_pichai
OpenAI confirmed a partnership with Jony Ive to develop consumer hardware. LMArena secured a $100 million seed round from a16z. Mistral launched a new code model fine-tune. Google DeepMind announced multiple updates at Google I/O 2024, including over a dozen new models and 20 AI products. Key highlights include the release of Gemini 2.5 Pro and Gemini Diffusion, featuring advanced multimodal reasoning, coding, and math capabilities, and integration of Gemini in Google Chrome as an AI browsing assistant. Deep Think enhanced reasoning mode and Project Astra improvements were also introduced, focusing on voice output, memory, and computer control for a universal AI assistant.
Google I/O: new Gemini native voice, Flash, DeepThink, AI Mode (DeepSearch+Mariner+Astra)
gemini-2.5-pro gemini-2.5 google google-deepmind ai-assistants reasoning generative-ai developer-tools ai-integration model-optimization ai-application model-updates ai-deployment model-performance demishassabis philschmid jack_w_rae
Google I/O 2024 showcased significant advancements with Gemini 2.5 Pro and Deep Think reasoning mode from google-deepmind, emphasizing AI-driven transformations and developer opportunities. GeminiApp aims to become a universal AI assistant on the path to AGI, with new features like AI Mode in Google Search expanding generative AI access. The event included multiple keynotes and updates on over a dozen models and 20+ AI products, highlighting Google's leadership in AI innovation. Influential voices like demishassabis and philschmid provided insights and recaps, while the launch of Jules as a competitor to Codex/Devin was noted.
ChatGPT Codex, OpenAI's first cloud SWE agent
codex-1 openai-o3 codex-mini gemma-3 blip3-o qwen-2.5 marigold-iid deepseek-v3 lightlab gemini-2.0 lumina-next openai runway salesforce qwen deepseek google google-deepmind j1 software-engineering parallel-processing multimodality diffusion-models depth-estimation scaling-laws reinforcement-learning fine-tuning model-performance multi-turn-conversation reasoning audio-processing sama kevinweil omarsar0 iscienceluvr akhaliq osanseviero c_valenzuelab mervenoyann arankomatsuzaki jasonwei demishassabis philschmid swyx teortaxestex jaseweston
OpenAI launched Codex, a cloud-based software engineering agent powered by codex-1 (an optimized version of OpenAI o3) available in research preview for Pro, Enterprise, and Team ChatGPT users, featuring parallel task execution like refactoring and bug fixing. The Codex CLI was enhanced with quick sign-in and a new low-latency model, codex-mini. Gemma 3 is highlighted as the best open model runnable on a single GPU. Runway released the Gen-4 References API for style transfer in generation. Salesforce introduced BLIP3-o, a unified multimodal model family using diffusion transformers for CLIP image features. The Qwen 2.5 models (1.5B and 3B versions) were integrated into the PocketPal app with various chat templates. Marigold IID, a new state-of-the-art open-source depth estimation model, was released.
In research, DeepSeek shared insights on scaling and hardware for DeepSeek-V3. Google unveiled LightLab, a diffusion-based light source control in images. Google DeepMind's AlphaEvolve uses Gemini 2.0 to discover new math and reduce costs without reinforcement learning. Omni-R1 studied audio's role in fine-tuning audio LLMs. Qwen proposed a parallel scaling law inspired by classifier-free guidance. Salesforce released Lumina-Next on the Qwen base, outperforming Janus-Pro. A study found LLM performance degrades in multi-turn conversations due to unreliability. J1 is incentivizing LLM-as-a-Judge thinking via reinforcement learning. A new Qwen study correlates question and strategy similarity to predict reasoning strategies.
Gemini's AlphaEvolve agent uses Gemini 2.0 to find new Math and cuts Gemini cost 1% — without RL
gemini gpt-4.1 gpt-4o-mini o3 o4-mini google-deepmind openai algorithm-discovery coding-agents matrix-multiplication optimization reinforcement-learning model-weights training-efficiency safety-evaluations instruction-following coding-tasks model-releases _philschmid scott_swingle alex_dimakis henry jason_wei kevinweil michpokrass scaling01 gdb
Deepmind's AlphaEvolve, a 2025 update to AlphaTensor and FunSearch, is a Gemini-powered coding agent for algorithm discovery that designs faster matrix multiplication algorithms, solves open math problems, and improves data center and AI training efficiency. It achieves a 23% faster kernel speedup in Gemini training and surpasses state-of-the-art on 20% of applied problems, including improvements on the Minimum Overlap Problem and Kissing number problem. Unlike Deep-RL, it optimizes code pieces rather than model weights. Meanwhile, OpenAI released GPT-4.1 in ChatGPT, specializing in coding and instruction following, with a faster alternative GPT-4.1 mini replacing GPT-4o mini for all users. OpenAI also launched the Safety Evaluations Hub and the OpenAI to Z Challenge using o3/o4 mini and GPT-4.1 models to discover archaeological sites. "Maybe midtrain + good search is all you need for AI for scientific innovation" - Jason Wei.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.
AI Engineer World's Fair: Second Run, Twice The Fun
gemini-2.5-pro google-deepmind waymo tesla anthropic braintrust retrieval-augmentation graph-databases recommendation-systems software-engineering-agents agent-reliability reinforcement-learning voice image-generation video-generation infrastructure security evaluation ai-leadership enterprise-ai mcp tiny-teams product-management design-engineering robotics foundation-models coding web-development demishassabis
The 2025 AI Engineer World's Fair is expanding with 18 tracks covering topics like Retrieval + Search, GraphRAG, RecSys, SWE-Agents, Agent Reliability, Reasoning + RL, Voice AI, Generative Media, Infrastructure, Security, and Evals. New focuses include MCP, Tiny Teams, Product Management, Design Engineering, and Robotics and Autonomy featuring foundation models from Waymo, Tesla, and Google. The event highlights the growing importance of AI Architects and enterprise AI leadership. Additionally, Demis Hassabis announced the Gemini 2.5 Pro Preview 'I/O edition', which leads coding and web development benchmarks on LMArena.
Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model
gemini-2.5-pro claude-3.7-sonnet llama-nemotron qwen3 google-deepmind nvidia alibaba hugging-face multimodality coding reasoning model-release speech-recognition recommender-systems benchmarking demishassabis _philschmid lmarena_ai scaling01 fchollet
Gemini 2.5 Pro has been updated with enhanced multimodal image-to-code capabilities and dominates the WebDev Arena Leaderboard, surpassing Claude 3.7 Sonnet in coding and other tasks. Nvidia released the Llama-Nemotron model family on Hugging Face, noted for efficient reasoning and inference. Alibaba's Qwen3 models range from 0.6B to 235B parameters, including dense and MoE variants. KerasRS was released by Fran ois Chollet as a new recommender system library compatible with JAX, PyTorch, and TensorFlow, optimized for TPUs. These updates highlight advancements in coding, reasoning, and speech recognition models.
Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
qwen-3 qwen3-235b-a22b qwen3-30b-a3b deepseek-r1 o1 o3-mini grok-3 gemini-2.5-pro alibaba google-deepmind deepseek mistral-ai mixture-of-experts reinforcement-learning benchmarking model-release model-architecture long-context multi-agent-systems inference dataset-release awnihannun prince_canuma actuallyisaak oriolvinyalsml iscienceluvr reach_vb teortaxestex omarsar0
Qwen 3 has been released by Alibaba featuring a range of models including two MoE variants, Qwen3-235B-A22B and Qwen3-30B-A3B, which demonstrate competitive performance against top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The models introduce an "enable_thinking=True" mode with advanced soft switching for inference scaling. The release is notable for its Apache 2.0 license and broad inference platform support including MCP. The dataset improvements and multi-stage RL post-training contribute to performance gains. Meanwhile, Gemini 2.5 Pro from Google DeepMind shows strong coding and long-context reasoning capabilities, and DeepSeek R2 is anticipated soon. Twitter discussions highlight Qwen3's finegrained MoE architecture, large context window, and multi-agent system applications.
Grok 3 & 3-mini now API Available
grok-3 grok-3-mini gemini-2.5-flash o3 o4-mini llama-4-maverick gemma-3-27b openai llamaindex google-deepmind epochairesearch goodfireai mechanize agent-development agent-communication cli-tools reinforcement-learning model-evaluation quantization-aware-training model-compression training-compute hybrid-reasoning model-benchmarking
Grok 3 API is now available, including a smaller version called Grok 3 mini, which offers competitive pricing and full reasoning traces. OpenAI released a practical guide for building AI agents, while LlamaIndex supports the Agent2Agent protocol for multi-agent communication. Codex CLI is gaining traction with new features and competition from Aider and Claude Code. GoogleDeepMind launched Gemini 2.5 Flash, a hybrid reasoning model topping the Chatbot Arena leaderboard. OpenAI's o3 and o4-mini models show emergent behaviors from large-scale reinforcement learning. EpochAIResearch updated its methodology, removing Maverick from high FLOP models as Llama 4 Maverick training compute drops. GoodfireAI announced a $50M Series A for its Ember neural programming platform. Mechanize was founded to build virtual work environments and automation benchmarks. GoogleDeepMind's Quantisation Aware Training for Gemma 3 models reduces model size significantly, with open source checkpoints available.
GPT 4.1: The New OpenAI Workhorse
gpt-4.1 gpt-4.1-mini gpt-4.1-nano gpt-4o gemini-2.5-pro openai llama-index perplexity-ai google-deepmind coding instruction-following long-context benchmarks model-pricing model-integration model-deprecation sama kevinweil omarsar0 aidan_mclau danhendrycks polynoamial scaling01 aravsrinivas lmarena_ai
OpenAI released GPT-4.1, including GPT-4.1 mini and GPT-4.1 nano, highlighting improvements in coding, instruction following, and handling long contexts up to 1 million tokens. The model achieves a 54 score on SWE-bench verified and shows a 60% improvement over GPT-4o on internal benchmarks. Pricing for GPT-4.1 nano is notably low at $0.10/1M input and $0.40/1M output. GPT-4.5 Preview is being deprecated in favor of GPT-4.1. Integration support includes Llama Index with day 0 support. Some negative feedback was noted for GPT-4.1 nano. Additionally, Perplexity's Sonar API ties with Gemini-2.5 Pro for the top spot in the LM Search Arena leaderboard. New benchmarks like MRCR and GraphWalks were introduced alongside updated prompting guides and cookbooks.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
not much happened today
gpt-4o deepseek-v3 claude-3.7-sonnet o3-mini gemini-2.5-pro openai deepseek anthropic google-deepmind togethercompute hypertecgroup coreweave cursor-ai windsurf-ai coding instruction-following image-generation policy-compliance long-context audio-processing video-processing gpu-clusters ai-infrastructure api-access sama kevinweil joannejang nrehiew_ giffmana _philschmid scaling01 saranormous
GPT-4o was praised for its improved coding, instruction following, and freedom, becoming the leading non-reasoning coding model surpassing DeepSeek V3 and Claude 3.7 Sonnet in coding benchmarks, though it still lags behind reasoning models like o3-mini. Concerns about policy compliance in image generation were noted, with efforts to improve adherence. Gemini 2.5 Pro was highlighted for its advanced audio and video understanding, long context capabilities, and integration with platforms like Cursor AI and Windsurf AI. AI infrastructure developments include a partnership between Together AI and Hypertec Group to deliver large-scale GPU clusters, and CoreWeave's IPO was celebrated for advancing AI infrastructure. GPU and TPU usage is expected to increase significantly. "GPT-4o's transparency and background generation feature" and "Gemini 2.5 Pro scored above 50% on Simple-Bench AI Explanation" were key highlights.
OpenAI adopts MCP
gemini-2.5-pro gemini-1.5-pro gemini-2.0-flash qwen-2.5-omni-7b deepseek-v3-0324 deepseek-r1 openai google-deepmind alibaba togethercompute model-benchmarking multimodality reasoning scaling-laws model-quantization synthetic-data model-performance context-windows speech-recognition translation audio-processing video-processing swyx
OpenAI announced support for MCP, a significant technical update. Google's Gemini 2.5 Pro leads benchmarks with top scores in MMLU-Pro (86%), GPQA Diamond (83%), and AIME 2024 (88%), featuring a 1 million token context window and multimodal inputs. Alibaba's Qwen 2.5 Omni 7B was released as a fully multimodal, interactive, open-source model with a novel "thinker-talker" architecture supporting voice and video chat. DeepSeek V3-0324 outperforms its predecessor on multiple benchmarks. Research on reasoning features in large language models using sparse autoencoders was highlighted, alongside a study on scaling laws of synthetic data showing performance plateaus near 300B tokens. Discussions also covered the fastest output speeds of Gemini models and concerns about over-reliance on benchmarks for intelligence measurement. Swyx will curate the Data Council AI Engineering Track in April.
Gemini 2.5 Pro + 4o Native Image Gen
gemini-2.5-pro gpt-4o google-deepmind openai lmarena_ai autoregressive-models multimodality reasoning coding instruction-following model-release leaderboards noam-shazeer allan-jabri gabe-goh
Gemini 2.5 Pro from Google DeepMind has become the new top AI model, surpassing Grok 3 by 40 LMarena points, with contributions from Noam Shazeer integrating Flash Thinking techniques. It is available as a free, rate-limited experimental model. Meanwhile, OpenAI released GPT 4o Native Images, an autoregressive image generation model with detailed insights shared by Allan Jabri and credits to Gabe Goh. Gemini 2.5 Pro excels in reasoning, coding, STEM, multimodal tasks, and instruction following, topping the LMarena leaderboard significantly. It is accessible via Google AI Studio and the Gemini App.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
DeepSeek's Open Source Stack
qwen-qwq-32b start character-3 gemini gemini-2.0 mercury-coder gpt-4.5 jamba-mini-1.6 gemini-2.0-flash gpt-4o-mini mistral-small-3 mistral-ocr deepseek pyspur hugging-face togethercompute hedra-labs google-deepmind deeplearningai openai ai21-labs mistral-ai fine-tuning benchmarking multimodality code-generation diffusion-models model-performance model-optimization ocr embedding-models context-windows runtime-limits _akhaliq lmarena_ai reach_vb danielhanchen _philschmid aidan_mclau vikhyatk jerryjliu0
DeepSeek's Open Source Week was summarized by PySpur, highlighting multiple interesting releases. The Qwen QwQ-32B model was fine-tuned into START, excelling in PhD-level science QA and math benchmarks. Character-3, an omnimodal AI video generation model by Hedra Labs and Together AI, enables realistic animated content creation. Google DeepMind introduced the Gemini embedding model with an 8k context window, ranking #1 on MMTEB, alongside the Gemini 2.0 Code Executor supporting Python libraries and auto-fix features. Inception Labs' Mercury Coder is a diffusion-based code generation model offering faster token processing. OpenAI released GPT-4.5, their largest model yet but with less reasoning ability than some competitors. AI21 Labs launched Jamba Mini 1.6, noted for superior output speed compared to Gemini 2.0 Flash, GPT-4o mini, and Mistral Small 3. A new dataset of 1.9M scanned pages was released for OCR benchmarking, with Mistral OCR showing competitive but not top-tier document parsing performance compared to LLM/LVM-powered methods. "Cracked engineers are all you need."
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
deepseek-native-sparse-attention r1-1776 paligemma-2-mix muse baichuan-m1-14b stripedhyena-2 huggingface deepseek perplexity-ai google-deepmind microsoft baichuan stripedhyena gpu-training scaling multimodality vision model-training foundation-models medical-llm genome-modeling robotic-manipulation interactive-content eliebakouch nouamanetazi lvwerra thom-wolf proftomyeh alex-wang aravsrinivas _akhaliq _philschmid mervenoyann reach_vb arankomatsuzaki maximelabonne
Huggingface released "The Ultra-Scale Playbook: Training LLMs on GPU Clusters," an interactive blogpost based on 4000 scaling experiments on up to 512 GPUs, providing detailed insights into modern GPU training strategies. DeepSeek introduced the Native Sparse Attention (NSA) model, gaining significant community attention, while Perplexity AI launched R1-1776, an uncensored and unbiased version of DeepSeek's R1 model. Google DeepMind unveiled PaliGemma 2 Mix, a multi-task vision-language model available in 3B, 10B, and 28B sizes. Microsoft introduced Muse, a generative AI model trained on the game Bleeding Edge, and presented Magma, a foundation model for multimodal AI agents excelling in UI navigation and robotic manipulation. Baichuan-M1-14B was announced as a state-of-the-art medical LLM trained on 20T tokens, and a fully open-source 40B genome modeling model using StripedHyena 2 architecture was also released. "Making your own gaming experience is coming sooner than you'd think," noted in relation to Muse.
not much happened today
zonos-v0.1 audiobox-aesthetics moshi sonar llama-3-70b gpt-4o-mini claude-3.5-haiku gpt-4o claude-3.5-sonnet deepseek-r1-distilled-qwen-1.5b reasonflux-32b o1-preview zyphra-ai meta-ai-fair kyutai-labs perplexity-ai cerebras uc-berkeley brilliant-labs google-deepmind text-to-speech speech-to-speech benchmarking model-performance reinforcement-learning math real-time-processing open-source cross-platform-integration multilinguality zero-shot-learning danhendrycks
Zyphra AI launched Zonos-v0.1, a leading open-weight text-to-speech model supporting multiple languages and zero-shot voice cloning. Meta FAIR released the open-source Audiobox Aesthetics model trained on 562 hours of audio data. Kyutai Labs introduced Moshi, a real-time speech-to-speech system with low latency. Perplexity AI announced the Sonar model based on Llama 3.3 70b, outperforming top models like GPT-4o and Claude 3.5 Sonnet with 1200 tokens/second speed, powered by Cerebras infrastructure. UC Berkeley open-sourced a 1.5B model trained with reinforcement learning that beats o1-preview on math tasks. ReasonFlux-32B achieved 91.2% on the MATH benchmark, outperforming OpenAI o1-preview. CrossPoster, an AI agent for cross-platform posting, was released using LlamaIndex workflows. Brilliant Labs integrated the Google DeepMind Gemini Live API into smart glasses for real-time translation and object identification.
not much happened today
deepseek-r1 alphageometry-2 claude deepseek openai google-deepmind anthropic langchain adyen open-source reasoning agentic-ai javascript model-release memes ai-development benchmarking akhaliq lmthang aymericroucher vikhyatk swyx
DeepSeek-R1 surpasses OpenAI in GitHub stars, marking a milestone in open-source AI with rapid growth in community interest. AlphaGeometry2 achieves gold-medalist level performance with an 84% solving rate on IMO geometry problems, showcasing significant advancements in AI reasoning. LangChain releases a tutorial for building AI agents in JavaScript, enhancing developer capabilities in agent deployment. Reflections on Anthropic's Claude model reveal early access and influence on AI development timelines. Lighthearted AI humor includes calls to ban second-order optimizers and challenges in web development longevity. The AI Engineer Summit 2025 workshops were announced, continuing community engagement and education.
s1: Simple test-time scaling (and Kyutai Hibiki)
qwen-2.5-32b gemini-2.0-flash smollm2 granite-vision-3.1-2b google-deepmind qwen gemini hugging-face ibm deepseek reasoning fine-tuning scaling-laws open-source-models data-centric-training vision multilingual-models language-model-reasoning niklas-muennighoff
"Wait" is all you need introduces a novel reasoning model finetuned from Qwen 2.5 32B using just 1000 questions with reasoning traces distilled from Gemini 2.0 Flash Thinking, enabling controllable test-time compute by appending "Wait" to extend reasoning. Lead author Niklas Muennighoff, known for work on Bloom, StarCoder, and BIG-bench, highlights this method's efficiency and its reproduction of the famous o1 scaling chart. Additionally, Kyutai Moshi's Hibiki project demonstrates impressive offline French-English live translation on iPhone. Recent AI model releases include DeepSeek R1 and R3 open source models, potentially marking a major open-source milestone, Hugging Face's SmolLM2 emphasizing data-centric training for small LMs, and IBM's Granite-Vision-3.1-2B, a small vision-language model with strong performance. Key research papers spotlight LIMO for minimal demonstration reasoning achieving high accuracy on AIME and MATH benchmarks, and Token-Assisted Reasoning mixing latent and text tokens to improve language model reasoning.
Gemini 2.0 Flash GA, with new Flash Lite, 2.0 Pro, and Flash Thinking
gemini-2.0-flash gemini-2.0-flash-lite gemini-2.0-pro-experimental gemini-1.5-pro deepseek-r1 gpt-2 llama-3-1 google-deepmind hugging-face anthropic multimodality context-windows cost-efficiency pretraining fine-tuning reinforcement-learning transformer tokenization embeddings mixture-of-experts andrej-karpathy jayalammar maartengr andrewyng nearcyan
Google DeepMind officially launched Gemini 2.0 models including Flash, Flash-Lite, and Pro Experimental, with Gemini 2.0 Flash outperforming Gemini 1.5 Pro while being 12x cheaper and supporting multimodal input and a 1 million token context window. Andrej Karpathy released a 3h31m video deep dive into large language models, covering pretraining, fine-tuning, and reinforcement learning with examples like GPT-2 and Llama 3.1. A free course on Transformer architecture was introduced by Jay Alammar, Maarten Gr, and Andrew Ng, focusing on tokenizers, embeddings, and mixture-of-expert models. DeepSeek-R1 reached 1.2 million downloads on Hugging Face with a detailed 36-page technical report. Anthropic increased rewards to $10K and $20K for their jailbreak challenge, while BlueRaven extension was updated to hide Twitter metrics for unbiased engagement.
How To Scale Your Model, by DeepMind
qwen-0.5 google-deepmind deepseek hugging-face transformers inference high-performance-computing robotics sim2real mixture-of-experts reinforcement-learning bias-mitigation rust text-generation open-source omarsar0 drjimfan tairanhe99 guanyashi lioronai _philschmid awnihannun clementdelangue
Researchers at Google DeepMind (GDM) released a comprehensive "little textbook" titled "How To Scale Your Model" covering modern Transformer architectures, inference optimizations beyond O(N^2) attention, and high-performance computing concepts like rooflines. The resource includes practical problems and real-time comment engagement. On AI Twitter, several key updates include the open-sourced humanoid robotics model ASAP inspired by athletes like Cristiano Ronaldo, LeBron James, and Kobe Bryant; a new paper on Mixture-of-Agents proposing the Self-MoA method for improved LLM output aggregation; training of reasoning LLMs using the GRPO algorithm from DeepSeek demonstrated on Qwen 0.5; findings on bias in LLMs used as judges highlighting the need for multiple independent evaluations; and the release of mlx-rs, a Rust library for machine learning with examples including Mistral text generation. Additionally, Hugging Face launched an AI app store featuring over 400,000 apps with 2,000 new daily additions and 2.5 million weekly visits, enabling AI-powered app search and categorization.
OpenAI takes on Gemini's Deep Research
o3 o3-mini-high o3-deep-research-mini openai google-deepmind nyu uc-berkeley hku reinforcement-learning benchmarking inference-speed model-performance reasoning test-time-scaling agent-design sama danhendrycks ethan-mollick dan-shipper
OpenAI released the full version of the o3 agent, with a new Deep Research variant showing significant improvements on the HLE benchmark and achieving SOTA results on GAIA. The release includes an "inference time scaling" chart demonstrating rigorous research, though some criticism arose over public test set results. The agent is noted as "extremely simple" and currently limited to 100 queries/month, with plans for a higher-rate version. Reception has been mostly positive, with some skepticism. Additionally, advances in reinforcement learning were highlighted, including a simple test-time scaling technique called budget forcing that improved reasoning on math competitions by 27%. Researchers from Google DeepMind, NYU, UC Berkeley, and HKU contributed to these findings. The original Gemini Deep Research team will participate in the upcoming AI Engineer NYC event.
OpenAI launches Operator, its first Agent
operator deepseek-r1 videollama-3 llama-4 o1 claude openai anthropic deepseek-ai google-deepmind perplexity-ai computer-using-agent reasoning multimodality performance-benchmarks open-source ai-safety benchmarking video-generation model-evaluation sam-altman swyx
OpenAI launched Operator, a premium computer-using agent for web tasks like booking and ordering, available now for Pro users in the US with an API promised. It features long horizon remote VMs up to 20 minutes and video export, showing state-of-the-art agent performance but not yet human-level. Anthropic had launched a similar agent 3 months earlier as an open source demo. DeepSeek AI unveiled DeepSeek R1, an open-source reasoning model excelling on the Humanity's Last Exam dataset, outperforming models like LLaMA 4 and OpenAI's o1. Google DeepMind open-sourced VideoLLaMA 3, a multimodal foundation model for image and video understanding. Perplexity AI released Perplexity Assistant for Android with reasoning and search capabilities. The Humanity's Last Exam dataset contains 3,000 questions testing AI reasoning, with current models scoring below 10% accuracy, indicating room for improvement. OpenAI's Computer-Using Agent (CUA) shows improved performance on OSWorld and WebArena benchmarks but still lags behind humans. Anthropic AI introduced Citations for safer AI responses. Sam Altman and Swyx commented on Operator's launch and capabilities.
not much happened today
deepseek-v3 llama-3-1-405b gpt-4o gpt-5 minimax-01 claude-3-haiku cosmos-nemotron-34b openai deep-learning-ai meta-ai-fair google-deepmind saama langchain nvidia mixture-of-experts coding math scaling visual-tokenizers diffusion-models inference-time-scaling retrieval-augmented-generation ai-export-restrictions security-vulnerabilities prompt-injection gpu-optimization fine-tuning personalized-medicine clinical-trials ai-agents persistent-memory akhaliq
DeepSeek-V3, a 671 billion parameter mixture-of-experts model, surpasses Llama 3.1 405B and GPT-4o in coding and math benchmarks. OpenAI announced the upcoming release of GPT-5 on April 27, 2023. MiniMax-01 Coder mode in ai-gradio enables building a chess game in one shot. Meta research highlights trade-offs in scaling visual tokenizers. Google DeepMind improves diffusion model quality via inference-time scaling. The RA-DIT method fine-tunes LLMs and retrievers for better RAG responses. The U.S. proposes a three-tier export restriction system on AI chips and models, excluding countries like China and Russia. Security vulnerabilities in AI chatbots involving CSRF and prompt injection were revealed. Concerns about superintelligence and weapons-grade AI models were expressed. ai-gradio updates include NVIDIA NIM compatibility and new models like cosmos-nemotron-34b. LangChain integrates with Claude-3-haiku for AI agents with persistent memory. Triton Warp specialization optimizes GPU usage for matrix multiplication. Meta's fine-tuned Llama models, OpenBioLLM-8B and OpenBioLLM-70B, target personalized medicine and clinical trials.
ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,
modernbert gemini-2.0-flash-thinking o1 llama answerdotai lightonio hugging-face google-deepmind openai meta-ai-fair figure encoder-only-models long-context alternating-attention natural-language-understanding reasoning robotics-simulation physics-engine humanoid-robots model-performance model-releases jeremyphoward alec-radford philschmid drjimfan bindureddy
Answer.ai/LightOn released ModernBERT, an updated encoder-only model with 8k token context, trained on 2 trillion tokens including code, with 139M/395M parameters and state-of-the-art performance on retrieval, NLU, and code tasks. It features Alternating Attention layers mixing global and local attention. Gemini 2.0 Flash Thinking debuted as #1 in Chatbot Arena, and the O1 model scored top in reasoning benchmarks. Llama downloads surpassed 650 million, doubling in 3 months. OpenAI launched desktop app integrations with voice capabilities. Figure delivered its first humanoid robots commercially. Advances in robotics simulation and a new physics engine Genesis claiming 430,000x faster than real-time were highlighted.
Genesis: Generative Physics Engine for Robotics (o1-mini version)
o1 o1-preview gpt-4o claude-3.5-sonnet gemini-2.0-pro llama-3-3b llama-3-70b openai google-deepmind meta-ai-fair hugging-face function-calling structured-outputs vision performance-benchmarks sdk webrtc reasoning math code-generation transformer-architecture model-training humanoid-robots search model-efficiency dataset-sharing aidan_mclau sundarpichai adcock_brett
OpenAI launched the o1 model API featuring function calling, structured outputs, vision support, and developer messages, achieving 60% fewer reasoning tokens than its preview. The model excels in math and code with a 0.76 LiveBench Coding score, outperforming Sonnet 3.5. Beta SDKs for Go and Java and WebRTC support with 60% lower prices were also released. Google Gemini 2.0 Pro (Gemini Exp 1206) deployment accelerated, showing improved coding, math, and reasoning performance. Meta AI FAIR introduced research on training transformers directly on raw bytes using dynamic entropy-based patching. Commercial humanoid robots were successfully deployed by an industry player. Hugging Face researchers demonstrated that their 3B Llama model can outperform the 70B Llama model on MATH-500 accuracy using search techniques, highlighting efficiency gains with smaller models. Concerns about reproducibility and domain-specific limitations were noted.
OpenAI Voice Mode Can See Now - After Gemini Does
gemini-2.0-flash claude claude-3.5-sonnet llama-3-70b llama-3 mistral-large gpt-4o openai google-deepmind anthropic togethercompute scale-ai meta-ai-fair mistral-ai multimodality real-time-streaming roleplay prompt-handling model-comparison model-training creative-writing model-censorship code-execution developer-ecosystem ai-humor bindureddy
OpenAI launched Realtime Video shortly after Gemini, which led to less impact due to Gemini's earlier arrival with lower cost and fewer rate limits. Google DeepMind released Gemini 2.0 Flash featuring enhanced multimodal capabilities and real-time streaming. Anthropic introduced Clio, a system analyzing real-world usage of Claude models. Together Computing acquired CodeSandbox to launch a code interpreter tool. Discussions highlighted Meta's Llama 3.3-70B for its advanced roleplay and prompt handling abilities, outperforming models like Mistral Large and GPT-4o in expressiveness and censorship. The AI community also engaged in humorous takes on AI outages and model competition, with ChatGPT adding a Santa mode for holiday interactions. "Anthropic is capturing the developer ecosystem, Gemini has AI enthusiast mindshare, ChatGPT reigns over AI dabblers" was a noted observation from the community.
o1 API, 4o/4o-mini in Realtime API + WebRTC, DPO Finetuning
o1-2024-12-17 o1 o1-pro 4o 4o-mini gemini-2-0-flash claude-3.5-sonnet claude-3.5 openai google google-deepmind function-calling structured-outputs vision reasoning webrtc realtime-api preference-tuning fine-tuning api model-performance aidan_mclau kevinweil simonw michpokrass morgymcg juberti
OpenAI launched the o1 API with enhanced features including vision inputs, function calling, structured outputs, and a new
reasoning_effort
parameter, achieving 60% fewer reasoning tokens on average. The o1 pro variant is confirmed as a distinct implementation coming soon. Improvements to the Realtime API with WebRTC integration offer easier usage, longer sessions (up to 30 minutes), and significantly reduced pricing (up to 10x cheaper with mini models). DPO Preference Tuning for fine-tuning is introduced, currently available for the 4o model. Additional updates include official Go and Java SDKs and OpenAI DevDay videos. The news also highlights discussions on Google Gemini 2.0 Flash model's performance reaching 83.6% accuracy. Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights
apollo-1b apollo-3b apollo-7b veo-2 imagen-3 llama-3-70b llama-3b command-r7b llama-1b llama-8b chatgpt meta-ai-fair hugging-face google-deepmind openai figure-ai klarna cohere notion video-understanding scaling-consistency benchmarking temporal-ocr egocentric-perception spatial-perception reasoning video-generation physics-simulation voice-features map-integration language-expansion test-time-compute-scaling humanoid-robots ai-integration search-optimization self-recognition self-preference-bias akhaliq _lewtun clementdelangue adcock_brett rohanpaul_ai swyx shaneguML
Meta released Apollo, a new family of state-of-the-art video-language models available in 1B, 3B, and 7B sizes, featuring "Scaling Consistency" for efficient scaling and introducing ApolloBench, which speeds up video understanding evaluation by 41× across five temporal perception categories. Google Deepmind launched Veo 2, a 4K video generation model with improved physics and camera control, alongside an enhanced Imagen 3 image model. OpenAI globally rolled out ChatGPT search with advanced voice and map features and discussed a potential $2,000/month "ChatGPT Max" tier. Research highlights include achieving Llama 70B performance using Llama 3B via test-time compute scaling and expanding Command R7B language support from 10 to 23 languages. Industry updates feature Figure AI delivering humanoid robots commercially and Klarna reducing workforce through AI. Notion integrated Cohere Rerank for better search. Studies reveal LLMs can recognize their own writing style and show self-preference bias. Discussions note video processing progress outpacing text due to better signal-per-compute and data evaluation.
Google wakes up: Gemini 2.0 et al
gemini-2.0-flash gemini-1.5-pro gemini-exp-1206 claude-3.5-sonnet opus google-deepmind openai apple multimodality agent-development multilinguality benchmarking model-releases demis-hassabis sundar-pichai paige-bailey bindureddy
Google DeepMind launched Gemini 2.0 Flash, a new multimodal model outperforming Gemini 1.5 Pro and o1-preview, featuring vision and voice APIs, multilingual capabilities, and native tool use. It powers new AI agents like Project Astra and Project Mariner, with Project Mariner achieving state-of-the-art 83.5% on the WebVoyager benchmark. OpenAI announced ChatGPT integration with Apple devices, enabling Siri access and visual intelligence features. Claude 3.5 Sonnet is noted as a distilled version of Opus. The AI community's response at NeurIPS 2024 has been overwhelmingly positive, signaling a strong comeback for Google in AI innovation. Key topics include multimodality, agent development, multilinguality, benchmarking, and model releases.
ChatGPT Canvas GA
llama-3-70b llama-3-1-8b tgi-v3 deepseek-v2.5-1210 coconut openai deepseek-ai meta-ai-fair huggingface cognition-labs hyperbolic google-deepmind code-execution gpt-integration model-finetuning gradient-checkpointing context-length latent-space-reasoning performance-optimization gpu-memory-optimization kubernetes gpu-marketplace ai-capabilities employment-impact neurips-2024 ai-scaling humor arav_srinivas sama jonathan-frankle dylan
OpenAI launched ChatGPT Canvas to all users, featuring code execution and GPT integration, effectively replacing Code Interpreter with a Google Docs-like interface. Deepseek AI announced their V2.5-1210 update improving performance on MATH-500 (82.8%) and LiveCodebench. Meta AI Fair introduced COCONUT, a new continuous latent space reasoning paradigm. Huggingface released TGI v3, processing 3x more tokens and running 13x faster than vLLM on long prompts. Cognition Labs released Devin, an AI developer building Kubernetes operators. Hyperbolic raised $12M Series A to build an open AI platform with an H100 GPU marketplace. Discussions included AI capabilities and employment impact, and NeurIPS 2024 announcements with Google DeepMind demos and a debate on AI scaling. On Reddit, Llama 3.3-70B supports 90K context length finetuning using Unsloth with gradient checkpointing and Apple's Cut Cross Entropy (CCE) algorithm, fitting on 41GB VRAM. Llama 3.1-8B reaches 342K context lengths with Unsloth, surpassing native limits.
Meta Llama 3.3: 405B/Nova Pro performance at 70B price
llama-3-70b llama-3.3-70b gpt-4o gemini-exp-1206 meta-ai-fair openai google-deepmind hugging-face llamacloud reinforcement-learning fine-tuning model-performance document-processing pricing-models alignment online-rl sama steven-heidel aidan_mclau lmarena_ai oriolvinyalsml jerryjliu0
Meta AI released Llama 3.3 70B, matching the performance of the 405B model with improved efficiency using "a new alignment process and progress in online RL techniques". OpenAI announced Reinforcement Fine-Tuning (RFT) for building expert models with limited data, offering alpha access to researchers and enterprises. Google DeepMind's Gemini-Exp-1206 leads benchmarks, tying with GPT-4o in coding performance. LlamaCloud enhanced document processing with table extraction and analytics. Discussions on OpenAI's pricing plans continue in the community.
not much happened today
o1-full sora gpt-4.5 gpt-4 claude-3.5-sonnet llama-3-1-nemotron-51b llama-3-1 llama-3 nemotron-51b openai google-deepmind anthropic nvidia huggingface vision model-performance neural-architecture-search model-optimization multimodality model-release model-training reinforcement-learning image-generation lucas-beyer alexander-kolesnikov xiaohua-zhai aidan_mclau giffmana joannejang sama
OpenAI announced their "12 Days of OpenAI" event with daily livestreams and potential releases including the O1 full model, Sora video model, and GPT-4.5. Google DeepMind released the GenCast weather model capable of 15-day forecasts in 8 minutes using TPU chips, and launched Genie 2, a model generating playable 3D worlds from single images. Leading vision researchers Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai moved from DeepMind to OpenAI, which is opening a Zürich office. Criticism arose over OpenAI's strategy and model quality compared to Anthropic and Claude 3.5 Sonnet. On Reddit, a modified llama.cpp supports Nvidia's Llama-3_1-Nemotron-51B, matching performance of larger 70B models via NAS optimization.
Olympus has dropped (aka, Amazon Nova Micro|Lite|Pro|Premier|Canvas|Reel)
amazon-nova claude-3 llama-3-70b gemini-1.5-flash gpt-4o amazon anthropic google-deepmind sakana-ai-labs multimodality benchmarking model-merging model-performance model-architecture model-optimization population-based-learning philschmid bindureddy
Amazon announced the Amazon Nova family of multimodal foundation models at AWS Re:Invent, available immediately with no waitlist in configurations like Micro, Lite, Pro, Canvas, and Reel, with Premier and speech-to-speech coming next year. These models offer 2-4x faster token speeds and are 25%-400% cheaper than competitors like Anthropic Claude models, positioning Nova as a serious contender in AI engineering. Pricing undercuts models such as Google DeepMind Gemini Flash 8B, and some Nova models extend context length up to 300k tokens. However, benchmarking controversy exists as some evaluations show Nova scoring below Llama-3 70B in LiveBench AI metrics. Separately, CycleQD was introduced by Sakana AI Labs, using evolutionary computation for population-based model merging to develop niche LLM agents.
not much happened to end the week
gemini deepseek-r1 o1 chatgpt gpt-4 claude-3.5-sonnet o1-preview o1-mini gpt4o qwq-32b google-deepmind deeplearningai amazon tesla x-ai alibaba ollama multimodality benchmarking quantization reinforcement-learning ai-safety translation reasoning interpretability model-comparison humor yoshua-bengio kevinweil ylecun
AI News for 11/29/2024-11/30/2024 covers key updates including the Gemini multimodal model advancing in musical structure understanding, a new quantized SWE-Bench for benchmarking at 1.3 bits per task, and the launch of the DeepSeek-R1 model focusing on transparent reasoning as an alternative to o1. The establishment of the 1st International Network of AI Safety Institutes highlights global collaboration on AI safety. Industry updates feature Amazon's Olympus AI model, Tesla's Optimus, and experiments with ChatGPT as a universal translator. Community reflections emphasize the impact of large language models on daily life and medical AI applications. Discussions include scaling sparse autoencoders to gpt-4 and the need for transparency in reasoning LLMs. The report also notes humor around ChatGPT's French nickname.
LMSys killed Model Versioning (gpt 4o 1120, gemini exp 1121)
gpt-4o-2024-11-20 gemini-exp-1121 deepseek-r1 openai google-deepmind anthropic deepseek mistral-ai model-release model-ranking open-source vision coding reasoning market-competition
AI News for 11/21/2024-11/22/2024 highlights the intense frontier lab race with OpenAI's gpt-4o-2024-11-20 and Google DeepMind's gemini-exp-1121 trading top spots on the Lmsys leaderboard. The trend of using date-based model identifiers instead of traditional versioning is noted across leading labs including Anthropic. DeepSeek R1 is gaining attention as a potent open-source alternative, especially in the context of the AI competition between China and the US. Gemini-Exp-1121 is praised for improvements in vision, coding, and reasoning, while MistralAI expands with a new Palo Alto office, signaling growth and hiring.
DeepSeek-R1 claims to beat o1-preview AND will be open sourced
deepseek-r1-lite-preview o1-preview hopper blackwell alphaqubit deepseek nvidia google-deepmind reasoning benchmarking quantum-error-correction quantum-computing model-performance model-release yann-lecun
DeepSeek has released DeepSeek-R1-Lite-Preview, an open-source reasoning model achieving o1-preview-level performance on math benchmarks with transparent thought processes, showing promise in real-time problem-solving. NVIDIA reported a record $35.1 billion revenue in Q3 with 112% year-on-year data center growth, driven by Hopper and Blackwell architectures, the latter offering 2.2x performance improvement. Google DeepMind introduced AlphaQubit, a quantum computing system improving error correction and outperforming leading decoders, though challenges remain in scaling and speed. The AI community continues to focus on reasoning models, benchmarking, and quantum error correction advancements.
GitHub Copilot Strikes Back
claude-3-5-sonnet gemini-1.5-pro o1-preview gemini-flash-8b github anthropic google-deepmind openai weights-biases model-picker-ui multi-model-integration natural-language-applications deployment-free-hosting model-prompting multimodal-observability audio-tracing codebase-optimization price-performance-ratio cassidy-williams fchollet rohanpaul_ai jxmnop
GitHub's tenth annual Universe conference introduced the Multi-model Copilot featuring Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.5 Pro, and OpenAI's o1-preview models in a new picker UI, allowing developers to choose from multiple companies' models. The event also showcased GitHub Spark, an AI-native tool for building natural language applications with deployment-free hosting and integrated model prompting. Additionally, GitHub updated its Copilot Workspace with new agents and security Autofix features. Weights & Biases launched Weave with multimodal observability supporting audio, text, and images, integrating the OpenAI Realtime API. Twitter recaps highlighted tinygrad's codebase optimization and discussions on GenAI adoption and Gemini Flash-8B's cost efficiency at $0.0375 per million tokens.
not much happened this weekend
claude-3.5-sonnet llama-3 llama-3-8b notebookllama min-omni-2 moondream openai anthropic hugging-face mistral-ai google-deepmind langchain deepmind microsoft pattern-recognition reinforcement-learning prompt-optimization text-to-speech model-optimization tensor-parallelism hyperparameters multimodal modal-alignment multimodal-fine-tuning ai-productivity privacy generative-ai rag retrieval-augmentation enterprise-text-to-sql amanda-askell philschmid stasbekman francois-fleuret mervenoyann reach_vb dzhng aravsrinivas sama lateinteraction andrew-y-ng bindureddy jerryjliu0
Moondream, a 1.6b vision language model, secured seed funding, highlighting a trend in moon-themed tiny models alongside Moonshine (27-61m ASR model). Claude 3.5 Sonnet was used for AI Twitter recaps. Discussions included pattern recognition vs. intelligence in LLMs, reinforcement learning for prompt optimization, and NotebookLlama, an open-source NotebookLM variant using LLaMA models for tasks like text-to-speech. Advances in model optimization with async-TP in PyTorch for tensor parallelism and hyperparameter tuning were noted. Mini-Omni 2 demonstrated multimodal capabilities across image, audio, and text for voice conversations with emphasis on modal alignment and multimodal fine-tuning. AI productivity tools like an AI email writer and LlamaCloud-based research assistants were introduced. Emphasis on practical skill development and privacy-conscious AI tool usage with Llama3-8B was highlighted. Generative AI tools such as #AIPythonforBeginners and GenAI Agents with LangGraph were shared. Business insights covered rapid execution in AI product development and emerging AI-related job roles. Challenges in enterprise-grade text-to-SQL and advanced retrieval methods were discussed with tutorials on RAG applications using LangChain and MongoDB.
Not much (in AI) happened this weekend
llama-3.1-8b llama-3.2 chatgpt movie-gen openai meta-ai-fair google-deepmind microsoft x-ai spacex harvard nvidia long-context feature-prediction-loss ai-agents privacy text-to-video text-to-image humanoid-robots gpu-deployment media-foundation-models ai-research-labs sam-altman yann-lecun rasbt bindureddy andrej-karpathy soumithchintala svpino adcock_brett rohanpaul_ai
OpenAI introduced an "edit this area" feature for image generation, praised by Sam Altman. Yann LeCun highlighted a NYU paper improving pixel generation with feature prediction loss using pre-trained visual encoders like DINOv2. Long-context LLMs such as llama-3.1-8b and llama-3.2 variants now support up to 131k tokens, offering alternatives to RAG systems. Bindu Reddy announced AI agents capable of building and deploying code from English instructions, signaling AI's replacement of SQL and potential impact on Python. SpaceX's successful Starship rocket catch was celebrated by Andrej Karpathy and others, with Soumith Chintala praising SpaceX's efficient, low-bureaucracy research approach. Privacy concerns arose from Harvard students' AI glasses, I-XRAY, which can reveal personal information. Meta AI FAIR's Movie Gen model advances media foundation models with high-quality text-to-image and video generation, including synced audio. Humanoid robots like Ameca and Azi now engage in expressive conversations using ChatGPT. xAI rapidly deployed 100K Nvidia H100 GPUs in 19 days, with CEO Jensen Huang commending Elon Musk. Leading AI research labs compared include Meta-FAIR, Google DeepMind, and Microsoft Research. Skepticism about LLM intelligence was voiced by Sam Pino, emphasizing limitations in novel problem-solving despite strong memorization.
Contextual Document Embeddings: `cde-small-v1`
llama-3 cde-small-v1 gemini-1.5-flash-8b chatgpt meta-ai-fair openai google-deepmind weights-biases togethercompute contextual-embeddings contextual-batching video-generation synthetic-data model-efficiency training-techniques rag algorithmic-efficiency jack-morris sasha-rush tim-brooks demis-hassabis karina-nguyen
Meta announced a new text-to-video model, Movie Gen, claiming superior adaptation of Llama 3 to video generation compared to OpenAI's Sora Diffusion Transformers, though no release is available yet. Researchers Jack Morris and Sasha Rush introduced the cde-small-v1 model with a novel contextual batching training technique and contextual embeddings, achieving strong performance with only 143M parameters. OpenAI launched Canvas, a collaborative interface for ChatGPT with synthetic data training. Google DeepMind welcomed Tim Brooks to work on video generation and world simulators. Google released Gemini 1.5 Flash-8B, improving cost and rate limits with algorithmic efficiency.
Liquid Foundation Models: A New Transformers alternative + AINews Pod 2
llama-3-2 gemini-1.5-pro-002 gemini-1.5-flash-002 liquid-ai meta-ai-fair google-deepmind openai reinforcement-learning multimodality model-efficiency foundation-models audio-processing model-deployment open-source ylecun svpino
Liquid.ai emerged from stealth with three subquadratic foundation models demonstrating superior efficiency compared to state space models and Apple’s on-device and server models, backed by a $37M seed round. Meta AI announced Llama 3.2 with multimodal vision-enabled models and lightweight text-only variants for mobile. Google DeepMind introduced production-ready Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 models with improved pricing and rate limits, alongside AlphaChip, an AI-driven chip design system using reinforcement learning for rapid superhuman layouts. OpenAI enhanced ChatGPT Plus and Teams with Advanced Voice Mode featuring Custom Instructions, Memory, and new nature-inspired voices. California Governor vetoed SB-1047 AI regulation bill, celebrated by AI community figures like ylecun and svpino as a win for open-source AI. Google upgraded NotebookLM with audio overviews supporting YouTube and audio files, turning documents into AI-generated podcasts. "Open source in AI is thriving," noted ylecun, highlighting 1 million models on Github and HuggingFace.
not much happened today
llama-3-2 llama-3 molmo meta-ai-fair google-deepmind hugging-face on-device-ai multimodality chip-design retrieval-augmented-generation rag benchmarking reliability ai-regulation free-speech pytorch-optimization demis-hassabis clementdelangue svpino awnihannun osanseviero omarsar0 sarahookr ylecun
Meta released Llama 3.2, including lightweight 1B and 3B models for on-device AI with capabilities like summarization and retrieval-augmented generation. Molmo, a new multimodal model, was introduced with a large dense captioning dataset. Google DeepMind announced AlphaChip, an AI-driven chip design method improving TPU and CPU designs. Hugging Face surpassed 1 million free public models, highlighting the value of smaller specialized models. Discussions covered challenges in scaling RAG applications, the future of on-device AI running ChatGPT-level models, reliability issues in larger LLMs, and new Elo benchmarking accepted at NeurIPS 2024. AI ethics and regulation topics included free speech responsibilities and California's SB-1047 bill potentially affecting open-source AI. "AlphaChip transformed computer chip design," and "ChatGPT-level AI on mobile devices predicted within a year."
not much happened today
llama-3-2 llama-3 gemma-2 phi-3-5-mini claude-3-haiku gpt-4o-mini molmo gemini-1.5 gemini meta-ai-fair openai allenai google-deepmind multimodality model-optimization benchmarks ai-safety model-distillation pruning adapter-layers open-source-models performance context-windows mira-murati demis-hassabis ylecun sama
Meta AI released Llama 3.2 models including 1B, 3B text-only and 11B, 90B vision variants with 128K token context length and adapter layers for image-text integration. These models outperform competitors like Gemma 2 and Phi 3.5-mini, and are supported on major platforms including AWS, Azure, and Google Cloud. OpenAI CTO Mira Murati announced her departure. Allen AI released Molmo, an open-source multimodal model family outperforming proprietary systems. Google improved Gemini 1.5 with Flash and Pro models. Meta showcased Project Orion AR glasses and hinted at a Quest 3S priced at $300. Discussions covered new benchmarks for multimodal models, model optimization, and AI safety and alignment.
not much happened today
o1-preview o1-mini qwen-2.5 gpt-4o deepseek-v2.5 gpt-4-turbo-2024-04-09 grin llama-3-1-405b veo kat openai qwen deepseek-ai microsoft kyutai-labs perplexity-ai together-ai meta-ai-fair google-deepmind hugging-face google anthropic benchmarking math coding instruction-following model-merging model-expressiveness moe voice voice-models generative-video competition open-source model-deployment ai-agents hyung-won-chung noam-brown bindureddy akhaliq karpathy aravsrinivas fchollet cwolferesearch philschmid labenz ylecun
OpenAI's o1-preview and o1-mini models lead benchmarks in Math, Hard Prompts, and Coding. Qwen 2.5 72B model shows strong performance close to GPT-4o. DeepSeek-V2.5 tops Chinese LLMs, rivaling GPT-4-Turbo-2024-04-09. Microsoft's GRIN MoE achieves good results with 6.6B active parameters. Moshi voice model from Kyutai Labs runs locally on Apple Silicon Macs. Perplexity app introduces voice mode with push-to-talk. LlamaCoder by Together.ai uses Llama 3.1 405B for app generation. Google DeepMind's Veo is a new generative video model for YouTube Shorts. The 2024 ARC-AGI competition increases prize money and plans a university tour. A survey on model merging covers 50+ papers for LLM alignment. The Kolmogorov–Arnold Transformer (KAT) paper proposes replacing MLP layers with KAN layers for better expressiveness. Hugging Face Hub integrates with Google Cloud Vertex AI Model Garden for easier open-source model deployment. Agent.ai is introduced as a professional network for AI agents. "Touching grass is all you need."
a quiet weekend
o1 datagemma aloha demostart firefly-ai-video-model pixtral-12b gamegen-o openai google-deepmind adobe mistral-ai tencent supermaven 11x cohere anthropic latent-space-university stanford microsoft mila notre-dame reinforcement-learning chain-of-thought reasoning robotics diffusion-models multimodality video-generation model-training reflection-tuning mathematical-reasoning model-benchmarking fine-tuning george-hotz terence-tao adcock_brett rohanpaul_ai bindureddy fchollet philschmid
OpenAI released the new o1 model, leveraging reinforcement learning and chain-of-thought prompting to excel in reasoning benchmarks, achieving an IQ-like score of 120. Google DeepMind introduced DataGemma to reduce hallucinations by connecting LLMs with real-world data, and unveiled ALOHA and DemoStart for robot dexterity using diffusion methods. Adobe previewed its Firefly AI Video Model with text-to-video and generative extend features. Mistral launched the multimodal Pixtral 12B model, and Tencent presented the GameGen-O open-world video game generation model. Several research papers from Stanford, OpenAI, Microsoft, Mila, and Notre Dame focus on advanced reasoning, self-verification, and reflection tuning techniques. Experts like Terence Tao and George Hotz have shared mixed but optimistic views on o1's capabilities. Seed funding rounds include Supermaven ($12M) and 11x ($24M).
Summer of Code AI: $1.6b raised, 1 usable product
ltm-2 llama-3-1-405b gemini-advanced cognition poolside codeium magic google-deepmind nvidia google-cloud long-context model-efficiency custom-hardware cuda training-stack gpu-scaling neural-world-models diffusion-models quantization nat-friedman ben-chess rohan-paul
Code + AI is emphasized as a key modality in AI engineering, highlighting productivity and verifiability benefits. Recent major funding rounds include Cognition AI raising $175M, Poolside raising $400M, Codeium AI raising $150M, and Magic raising $320M. Magic announced their LTM-2 model with a 100 million token context window, boasting efficiency improvements over Llama 3.1 405B by about 1000x cheaper in sequence-dimension algorithm and drastically lower memory requirements. Magic's stack is built from scratch with custom CUDA and no open-source foundations, partnered with Google Cloud and powered by NVIDIA H100 and GB200 GPUs, aiming to scale to tens of thousands of GPUs. Google DeepMind revealed updates to Gemini Advanced with customizable expert "Gems." Neural Game Engines like GameNGen can run DOOM in a diffusion model trained on 0.9B frames. The content also references LLM quantization research by Rohan Paul.
Cerebras Inference: Faster, Better, AND Cheaper
llama-3.1-8b llama-3.1-70b gemini-1.5-flash gemini-1.5-pro cogvideox-5b mamba-2 rene-1.3b llama-3.1 gemini-1.5 claude groq cerebras cursor google-deepmind anthropic inference-speed wafer-scale-chips prompt-caching model-merging benchmarking open-source-models code-editing model-optimization jeremyphoward sam-altman nat-friedman daniel-gross swyx
Groq led early 2024 with superfast LLM inference speeds, achieving ~450 tokens/sec for Mixtral 8x7B and 240 tokens/sec for Llama 2 70B. Cursor introduced a specialized code edit model hitting 1000 tokens/sec. Now, Cerebras claims the fastest inference with their wafer-scale chips, running Llama3.1-8b at 1800 tokens/sec and Llama3.1-70B at 450 tokens/sec at full precision, with competitive pricing and a generous free tier. Google's Gemini 1.5 models showed significant benchmark improvements, especially Gemini-1.5-Flash and Gemini-1.5-Pro. New open-source models like CogVideoX-5B and Mamba-2 (Rene 1.3B) were released, optimized for consumer hardware. Anthropic's Claude now supports prompt caching, improving speed and cost efficiency. "Cerebras Inference runs Llama3.1 20x faster than GPU solutions at 1/5 the price."
not much happened today
grok-2 claude-3.5-sonnet claude-3.5 gpt-4 chatgpt-4o-latest anthropic x-ai google-deepmind openai mistral-ai meta-ai-fair salesforce box prompt-caching model-performance vision fine-tuning multilinguality ai-safety design-automation document-processing ai-agents ai-integration ai-job-market ai-acceleration humor demis-hassabis francois-chollet
Anthropic rolled out prompt caching in its API, reducing input costs by up to 90% and latency by 80%, enabling instant fine-tuning with longer prompts. xAI released Grok-2, a new model competing with frontier models from Google DeepMind, OpenAI, Anthropic, Mistral AI, and Meta AI Fair, supporting vision and text inputs and integrating external image generation models. Claude 3.5 Sonnet is reported to outperform GPT-4 in coding and reasoning, while ChatGPT-4o-latest shows reasoning improvements. François Chollet proposed a theory defining intelligence as the efficiency of operationalizing past information for future tasks. The Aya project involves 3000 collaborators building multilingual AI datasets. Demis Hassabis discussed AI hype and safe AI development in a podcast. Tools like Dora AI for Figma and Box's AI API enhance design automation and document processing. Salesforce released DEI, an open AI software engineering agents framework with a 55% resolve rate on SWE-Bench Lite. Industry trends highlight rapid AI integration, networking importance in the AI job market, and potential OpenAI GPT-4 expansion in response to competitors. Memes include humor about Apple Vision Pro.
not much happened today
llama-3 llama-3-1 grok-2 claude-3.5-sonnet gpt-4-turbo nous-research nvidia salesforce goodfire-ai anthropic x-ai google-deepmind box langchain fine-tuning prompt-caching mechanistic-interpretability model-performance multimodality agent-frameworks software-engineering-agents api document-processing text-generation model-releases vision image-generation efficiency scientific-discovery fchollet demis-hassabis
GPT-5 delayed again amid a quiet news day. Nous Research released Hermes 3 finetune of Llama 3 base models, rivaling FAIR's instruct tunes but sparking debate over emergent existential crisis behavior with 6% roleplay data. Nvidia introduced Minitron finetune of Llama 3.1. Salesforce launched a DEI agent scoring 55% on SWE-Bench Lite. Goodfire AI secured $7M seed funding for mechanistic interpretability work. Anthropic rolled out prompt caching in their API, cutting input costs by up to 90% and latency by 80%, aiding coding assistants and large document processing. xAI released Grok-2, matching Claude 3.5 Sonnet and GPT-4 Turbo on LMSYS leaderboard with vision+text inputs and image generation integration. Claude 3.5 Sonnet reportedly outperforms GPT-4 in coding and reasoning. François Chollet defined intelligence as efficient operationalization of past info for future tasks. Salesforce's DEI framework surpasses individual agent performance. Google DeepMind's Demis Hassabis discussed AGI's role in scientific discovery and safe AI development. Dora AI plugin generates landing pages in under 60 seconds, boosting web team efficiency. Box AI API beta enables document chat, data extraction, and content summarization. LangChain updated Python & JavaScript integration docs.
Grok 2! and ChatGPT-4o-latest confuses everybody
gpt-4o grok-2 claude-3.5-sonnet flux-1 stable-diffusion-3 gemini-advanced openai x-ai black-forest-labs google-deepmind benchmarking model-performance tokenization security-vulnerabilities multi-agent-systems research-automation text-to-image conversational-ai model-integration ylecun rohanpaul_ai karpathy
OpenAI quietly released a new GPT-4o model in ChatGPT, distinct from the API version, reclaiming the #1 spot on Lmsys arena benchmarks across multiple categories including math, coding, and instruction-following. Meanwhile, X.ai launched Grok 2, outperforming Claude 3.5 Sonnet and previous GPT-4o versions, with plans for enterprise API release. Grok 2 integrates Black Forest Labs' Flux.1, an open-source text-to-image model surpassing Stable Diffusion 3. Google DeepMind announced Gemini Advanced with enhanced conversational features and Pixel device integration. AI researcher ylecun highlighted LLM limitations in learning and creativity, while rohanpaul_ai discussed an AI Scientist system generating publishable ML research at low cost. karpathy warned of security risks in LLM tokenizers akin to SQL injection.
Too Cheap To Meter: AI prices cut 50-70% in last 30 days
gpt-4o gpt-4o-mini llama-3-1-405b mistral-large-2 gemini-1.5-flash deepseek-v2 sonnet-3.5 exaone-3.0 minicpm-v-2.6 claude-3.5 gpt-4o-2024-08-06 llamaindex together-ai deepinfra deepseek-ai mistral-ai google-deepmind lg-ai-research llamaindex llamaindex llamaindex price-cuts context-caching instruction-tuning vision benchmarks pytorch attention-mechanisms reinforcement-learning-from-human-feedback compute-optimal-scaling rohanpaul_ai akhaliq mervenoyann sophiamyang chhillee karpathy
Gemini 1.5 Flash has cut prices by approximately 70%, offering a highly competitive free tier of 1 million tokens per minute at $0.075/mtok, intensifying the AI model price war. Other significant price reductions include GPT-4o (~50% cut to $2.50/mtok), GPT-4o mini (70-98.5% cut to $0.15/mtok), Llama 3.1 405b (46% cut to $2.7/mtok), and Mistral Large 2 (62% cut to $3/mtok). Deepseek v2 introduced context caching, reducing input token costs by up to 90% to $0.014/mtok. New model releases include Llama 3.1 405b, Sonnet 3.5, EXAONE-3.0 (7.8B instruction-tuned by LG AI Research), and MiniCPM V 2.6 (vision-language model combining SigLIP 400M and Qwen2-7B). Benchmarks show Mistral Large performing well on ZebraLogic and Claude-3.5 leading LiveBench. FlexAttention, a new PyTorch API, simplifies and optimizes attention mechanisms. Andrej Karpathy analyzed RLHF, highlighting its limitations compared to traditional reinforcement learning. Google DeepMind research on compute-optimal scaling was also summarized.
GPT4o August + 100% Structured Outputs for All (GPT4o August edition)
gpt-4o-2024-08-06 llama-3-1-405b llama-3 claude-3.5-sonnet gemini-1.5-pro gpt-4o yi-large-turbo openai meta-ai-fair google-deepmind yi-large nvidia groq langchain jamai langsmith structured-output context-windows model-pricing benchmarking parameter-efficient-expert-retrieval retrieval-augmented-generation mixture-of-experts model-performance ai-hardware model-deployment filtering multi-lingual vision john-carmack jonathan-ross rohanpaul_ai
OpenAI released the new gpt-4o-2024-08-06 model with 16k context window and 33-50% lower pricing than the previous 4o-May version, featuring a new Structured Output API that improves output quality and reduces retry costs. Meta AI launched Llama 3.1, a 405-billion parameter model surpassing GPT-4 and Claude 3.5 Sonnet on benchmarks, alongside expanding the Llama Impact Grant program. Google DeepMind quietly released Gemini 1.5 Pro, outperforming GPT-4o, Claude-3.5, and Llama 3.1 on LMSYS benchmarks and leading the Vision Leaderboard. Yi-Large Turbo was introduced as a cost-effective upgrade priced at $0.19 per million tokens. In hardware, NVIDIA H100 GPUs were highlighted by John Carmack for their massive AI workload power, and Groq announced plans to deploy 108,000 LPUs by Q1 2025. New AI tools and techniques include RAG (Retrieval-Augmented Generation), the JamAI Base platform for Mixture of Agents systems, and LangSmith's enhanced filtering capabilities. Google DeepMind also introduced PEER (Parameter Efficient Expert Retrieval) architecture.
Execuhires: Tempting The Wrath of Khan
gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai
Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.
Rombach et al: FLUX.1 [pro|dev|schnell], $31m seed for Black Forest Labs
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b flux-1 stability-ai google-deepmind nvidia text-to-image text-to-video model-benchmarking open-weight-models model-distillation safety-classifiers sparse-autoencoders ai-coding-tools rohanpaul_ai fchollet bindureddy clementdelangue ylecun svpino
Stability AI co-founder Rombach launched FLUX.1, a new text-to-image model with three variants: pro (API only), dev (open-weight, non-commercial), and schnell (Apache 2.0). FLUX.1 outperforms Midjourney and Ideogram based on Black Forest Labs' ELO score and plans to expand into text-to-video. Google DeepMind released Gemma-2 2B, a 2 billion parameter open-source model that outperforms larger models like GPT-3.5-Turbo-0613 and Mixtral-8x7b on Chatbot Arena, optimized with NVIDIA TensorRT-LLM. The release includes safety classifiers (ShieldGemma) and sparse autoencoder analysis (Gemma Scope). Discussions highlight benchmarking discrepancies and US government support for open-weight AI models. Critiques of AI coding tools' productivity gains were also noted.
Gemma 2 2B + Scope + Shield
gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering
Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold
gemini alphageometry-2 alphaproof llama-3-1-405b llama-3-70b llama-3-8b mistral-large-2 google-deepmind meta-ai-fair mistral-ai neurosymbolic-ai mathematical-reasoning synthetic-data knowledge-sharing model-fine-tuning alpha-zero multilinguality context-windows model-scaling benchmarking performance-comparison tim-gowers guillaume-lample osanseviero
Search+Verifier highlights advances in neurosymbolic AI during the 2024 Math Olympics. Google DeepMind's combination of AlphaProof and AlphaGeometry 2 solved four out of six IMO problems, with AlphaProof being a finetuned Gemini model using an AlphaZero approach, and AlphaGeometry 2 trained on significantly more synthetic data with a novel knowledge-sharing mechanism. Despite impressive results, human judges noted the AI required much longer time than human competitors. Meanwhile, Meta AI released Llama 3.1 with a 405B parameter model and smaller variants, and Mistral AI launched Mistral Large 2 with 123B parameters and 128k context windows, outperforming Llama 3.1 on coding tasks and multilingual benchmarks. This marks significant progress in AI mathematical reasoning, model scaling, and multilingual capabilities.
That GPT-4o Demo
gpt-4o gemma-2 meta-code-llama openai google-deepmind meta-ai-fair voice-generation ocr screen-sharing vision code-understanding model-customization efficiency textual-intelligence multimodal-agents sft distillation rlhf model-merging model-optimization safety romain-huet fchollet
Romain Huet demonstrated an unreleased version of GPT-4o on ChatGPT Desktop showcasing capabilities like low latency voice generation, whisper tone moderation, camera mode streaming video to GPT-4o, rapid OCR, screen sharing with ChatGPT for programming help, clipboard reading, and vision-based code conversation. OpenAI's four investment areas highlighted include textual intelligence, efficiency/cost, model customization, and multimodal agents. Google DeepMind released Gemma 2 models in 9B and 27B sizes trained on 8T and 13T tokens respectively, using SFT, distillation, RLHF, and model merging, optimized for TPUv5e with strong performance and safety measures. Meta AI announced the Meta LLM Compiler built on Meta Code Llama with enhanced code optimization and compiler features.
Gemma 2: The Open Model for Everyone
gemma-2 qwen-72b mixtral-8x22b-instruct claude-3.5-sonnet google-deepmind alibaba mistral-ai anthropic knowledge-distillation attention-mechanisms multilingual-models multimodality model-training model-optimization memory-optimization fine-tuning kathleen-kenealy daniel-han
Gemma 2, a 27B parameter model from google-deepmind, was released with innovations like 1:1 local-global attention alternation and logit soft-capping, leveraging knowledge distillation to train smaller models on over 50× the compute-optimal token quantity. The model supports multilingual and multimodal capabilities, with fine-tuning success on over 200 Indic language variants. The Open LLM Leaderboard highlights alibaba's Qwen 72B as the top model, with mistral-ai's Mixtral-8x22B-Instruct also ranking highly. Anthropic launched Claude 3.5 Sonnet, improving intelligence at mid-tier cost and speed. Research on eliminating matrix multiplication in LLMs promises significant memory savings without performance loss. Kathleen Kenealy and Daniel Han provided insights on Gemma 2's tokenizer and attention scaling respectively.
There's Ilya!
chameleon-7b chameleon-34b deepseek-coder-v2 gpt-4-turbo claude-3-opus voco-llama safe-superintelligence-inc openai anthropic meta deepseek google-deepmind parallel-decoding code-generation quantization training-dynamics vision benchmarks datasets image-captioning reasoning memory-optimization ilya-sutskever jan-leike ylecun akhaliq philschmid rohanpaul_ai mervenoyann fchollet
Ilya Sutskever has co-founded Safe Superintelligence Inc shortly after leaving OpenAI, while Jan Leike moved to Anthropic. Meta released new models including Chameleon 7B and 34B with mixed-modal input and unified token space quantization. DeepSeek-Coder-V2 shows code capabilities comparable to GPT-4 Turbo, supporting 338 programming languages and 128K context length. Consistency Large Language Models (CLLMs) enable parallel decoding generating multiple tokens per step. Grokked Transformers demonstrate reasoning through training dynamics affecting memory formation and generalization. VoCo-LLaMA compresses vision tokens with LLMs improving video temporal correlation understanding. The BigCodeBench benchmark evaluates LLMs on 1,140 coding tasks across 139 Python libraries, topped by DeepSeek-Coder-V2 and Claude 3 Opus. PixelProse is a large 16M image-caption dataset with reduced toxicity.
Contextual Position Encoding (CoPE)
cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas
Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.
ALL of AI Engineering in One Place
claude-3-sonnet claude-3 openai google-deepmind anthropic mistral-ai cohere hugging-face adept midjourney character-ai microsoft amazon nvidia salesforce mastercard palo-alto-networks axa novartis discord twilio tinder khan-academy sourcegraph mongodb neo4j hasura modular cognition anysphere perplexity-ai groq mozilla nous-research galileo unsloth langchain llamaindex instructor weights-biases lambda-labs neptune datastax crusoe covalent qdrant baseten e2b octo-ai gradient-ai lancedb log10 deepgram outlines crew-ai factory-ai interpretability feature-steering safety multilinguality multimodality rag evals-ops open-models code-generation gpus agents ai-leadership
The upcoming AI Engineer World's Fair in San Francisco from June 25-27 will feature a significantly expanded format with booths, talks, and workshops from top model labs like OpenAI, DeepMind, Anthropic, Mistral, Cohere, HuggingFace, and Character.ai. It includes participation from Microsoft Azure, Amazon AWS, Google Vertex, and major companies such as Nvidia, Salesforce, Mastercard, Palo Alto Networks, and more. The event covers 9 tracks including RAG, multimodality, evals/ops, open models, code generation, GPUs, agents, AI in Fortune 500, and a new AI leadership track. Additionally, Anthropic shared interpretability research on Claude 3 Sonnet, revealing millions of interpretable features that can be steered to modify model behavior, including safety-relevant features related to bias and unsafe content, though more research is needed for practical applications. The event offers a discount code for AI News readers.
Skyfall
gemini-1.5-pro gemini-1.5-flash yi-1.5 kosmos-2.5 paligemma falcon-2 deepseek-v2 hunyuan-dit gemini-1.5 gemini-1.5-flash yi-1.5 google-deepmind yi-ai microsoft hugging-face langchain maven multimodality mixture-of-experts transformer model-optimization long-context model-performance model-inference fine-tuning local-ai scaling-laws causal-models hallucination-detection model-distillation model-efficiency hamel-husain dan-becker clement-delangue philschmid osanseviero arankomatsuzaki jason-wei rohanpaul_ai
Between 5/17 and 5/20/2024, key AI updates include Google DeepMind's Gemini 1.5 Pro and Flash models, featuring sparse multimodal MoE architecture with up to 10M context and a dense Transformer decoder that is 3x faster and 10x cheaper. Yi AI released Yi-1.5 models with extended context windows of 32K and 16K tokens. Other notable releases include Kosmos 2.5 (Microsoft), PaliGemma (Google), Falcon 2, DeepSeek v2 lite, and HunyuanDiT diffusion model. Research highlights feature an Observational Scaling Laws paper predicting model performance across families, a Layer-Condensed KV Cache technique boosting inference throughput by up to 26×, and the SUPRA method converting LLMs into RNNs for reduced compute costs. Hugging Face expanded local AI capabilities enabling on-device AI without cloud dependency. LangChain updated its v0.2 release with improved documentation. The community also welcomed a new LLM Finetuning Discord by Hamel Husain and Dan Becker for Maven course users. "Hugging Face is profitable, or close to profitable," enabling $10 million in free shared GPUs for developers.
Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model
chameleon gpt-4o gemini-1.5-flash claude-3 meta-ai-fair openai google-deepmind anthropic reddit multimodality early-fusion benchmarking model-training tokenization streaming tool-use vision coding hallucination-detection model-performance armen-aghajanyan sama alexandr-wang abacaj alexalbert__
Meta AI FAIR introduced Chameleon, a new multimodal model family with 7B and 34B parameter versions trained on 10T tokens of interleaved text and image data enabling "early fusion" multimodality that can natively output any modality. While reasoning benchmarks are modest, its "omnimodality" approach competes well with pre-GPT4o multimodal models. OpenAI launched GPT-4o, a model excelling in benchmarks like MMLU and coding tasks, with strong multimodal capabilities but some regression in ELO scores and hallucination issues. Google DeepMind announced Gemini 1.5 Flash, a small model with 1M context window and flash performance, highlighting convergence trends between OpenAI and Google models. Anthropic updated Claude 3 with streaming support, forced tool use, and vision tool integration for multimodal knowledge extraction. OpenAI also partnered with Reddit, raising industry attention.
Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing
gpt-4 gpt-4o gpt-4-turbo gpt-4o-mini llama bloom stable-diffusion cursor openai anthropic google-deepmind huggingface speculative-decoding code-edits multimodality image-generation streaming tool-use fine-tuning benchmarking mmlu model-performance evaluation synthetic-data context-windows sama abacaj imjaredz erhartford alexalbert svpino maximelabonne _philschmid
Cursor, an AI-native IDE, announced a speculative edits algorithm for code editing that surpasses GPT-4 and GPT-4o in accuracy and latency, achieving speeds of over 1000 tokens/s on a 70b model. OpenAI released GPT-4o with multimodal capabilities including audio, vision, and text, noted to be 2x faster and 50% cheaper than GPT-4 turbo, though with mixed coding performance. Anthropic introduced streaming, forced tool use, and vision features for developers. Google DeepMind unveiled Imagen Video and Gemini 1.5 Flash, a small model with a 1M-context window. HuggingFace is distributing $10M in free GPUs for open-source AI models like Llama, BLOOM, and Stable Diffusion. Evaluation insights highlight challenges with LLMs on novel problems and benchmark saturation, with new benchmarks like MMLU-Pro showing significant drops in top model performance.
Not much happened today
gpt-4o gemini-1.5-pro gemini-1.5-flash imagen-3 veo reka-core qwen-1.5-110b openai google-deepmind anthropic rekailabs alibaba salesforce multimodality long-context model-releases reinforcement-learning model-benchmarking text-to-image video-generation ai-assistants ilya-sutskever jakub-pachocki mike-krieger sama
Ilya Sutskever steps down as Chief Scientist at OpenAI after nearly a decade, with Jakub Pachocki named as his successor. Google DeepMind announces Gemini 1.5 Pro and Gemini 1.5 Flash models featuring 2 million token context and improved multimodal capabilities, alongside demos of Project Astra AI assistant, Imagen 3 text-to-image model, and Veo generative video model. GPT-4o tops the VHELM leaderboard and outperforms competitors on LMSYS Chatbot Arena. Reka Core multimodal model with 128K context and Alibaba's Qwen1.5-110B open-source model are released. Salesforce shares an online RLHF recipe.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."
LMSys advances Llama 3 eval analysis
llama-3-70b llama-3 claude-3-sonnet alphafold-3 lmsys openai google-deepmind isomorphic-labs benchmarking model-behavior prompt-complexity model-specification molecular-structure-prediction performance-analysis leaderboards demis-hassabis sam-altman miranda-murati karina-nguyen joanne-jang john-schulman
LMSys is enhancing LLM evaluation by categorizing performance across 8 query subcategories and 7 prompt complexity levels, revealing uneven strengths in models like Llama-3-70b. DeepMind released AlphaFold 3, advancing molecular structure prediction with holistic modeling of protein-DNA-RNA complexes, impacting biology and genetics research. OpenAI introduced the Model Spec, a public standard to clarify model behavior and tuning, inviting community feedback and aiming for models to learn directly from it. Llama 3 has reached top leaderboard positions on LMSys, nearly matching Claude-3-sonnet in performance, with notable variations on complex prompts. The analysis highlights the evolving landscape of model benchmarking and behavior shaping.
OpenAI's PR Campaign?
alphafold-3 xlstm gpt-4 openai microsoft google-deepmind memory-management model-spec scaling multimodality performance transformers dynamic-memory model-architecture demis-hassabis sama joanne-jang omarsar0 arankomatsuzaki drjimfan
OpenAI faces user data deletion backlash over its new partnership with StackOverflow amid GDPR complaints and US newspaper lawsuits, while addressing election year concerns with efforts like the Media Manager tool for content opt-in/out by 2025 and source link attribution. Microsoft develops a top-secret airgapped GPT-4 AI service for US intelligence agencies. OpenAI releases the Model Spec outlining responsible AI content generation policies, including NSFW content handling and profanity use, emphasizing clear distinctions between bugs and design decisions. Google DeepMind announces AlphaFold 3, a state-of-the-art model predicting molecular structures with high accuracy, showcasing cross-domain AI techniques. New research on xLSTM proposes scaling LSTMs to billions of parameters, competing with transformers in performance and scaling. Microsoft introduces vAttention, a dynamic memory management method for efficient large language model serving without PagedAttention.
DeepSeek-V2 beats Mixtral 8x22B with >160 experts at HALF the cost
deepseek-v2 llama-3-120b llama-3-400b gpt-4 mistral phi claude gemini mai-1 med-gemini deepseek-ai mistral-ai microsoft openai scale-ai tesla nvidia google-deepmind mixture-of-experts multi-head-attention model-inference benchmarking overfitting robotics teleoperation open-source multimodality hallucination-detection fine-tuning medical-ai model-training erhartford maximelabonne bindureddy adcock_brett drjimfan clementdelangue omarsar0 rohanpaul_ai
DeepSeek V2 introduces a new state-of-the-art MoE model with 236B parameters and a novel Multi-Head Latent Attention mechanism, achieving faster inference and surpassing GPT-4 on AlignBench. Llama 3 120B shows strong creative writing skills, while Microsoft is reportedly developing a 500B parameter LLM called MAI-1. Research from Scale AI highlights overfitting issues in models like Mistral and Phi, whereas GPT-4, Claude, Gemini, and Llama maintain benchmark robustness. In robotics, Tesla Optimus advances with superior data collection and teleoperation, LeRobot marks a move toward open-source robotics AI, and Nvidia's DrEureka automates robot skill training. Multimodal LLM hallucinations are surveyed with new mitigation strategies, and Google's Med-Gemini achieves SOTA on medical benchmarks with fine-tuned multimodal models.
Cohere Command R+, Anthropic Claude Tool Use, OpenAI Finetuning
c4ai-command-r-plus claude-3 gpt-3.5-turbo gemini mistral-7b gemma-2 claude-3-5 llama-3 vicuna cohere anthropic openai microsoft stability-ai opera-software meta-ai-fair google-deepmind mistral-ai tool-use multilingual-models rag fine-tuning quantum-computing audio-generation local-inference context-windows model-size-analysis model-comparison
Cohere launched Command R+, a 104B dense model with 128k context length focusing on RAG, tool-use, and multilingual capabilities across 10 key languages. It supports Multi-Step Tool use and offers open weights for research. Anthropic introduced tool use in beta for Claude, supporting over 250 tools with new cookbooks for practical applications. OpenAI enhanced its fine-tuning API with new upgrades and case studies from Indeed, SK Telecom, and Harvey, promoting DIY fine-tuning and custom model training. Microsoft achieved a quantum computing breakthrough with an 800x error rate improvement and the most usable qubits to date. Stability AI released Stable Audio 2.0, improving audio generation quality and control. The Opera browser added local inference support for large language models like Meta's Llama, Google's Gemma, and Vicuna. Discussions on Reddit highlighted Gemini's large context window, analysis of GPT-3.5-Turbo model size, and a battle simulation between Claude 3 and ChatGPT using local 7B models like Mistral and Gemma.
Shipping and Dipping: Inflection + Stability edition
inflection-ai-2.5 stable-diffusion-3 claude-3-haiku claude-3-sonnet claude-3-opus tacticai inflection-ai stability-ai microsoft nvidia google-deepmind anthropic executive-departures gpu-acceleration ai-assistants geometric-deep-learning ai-integration ai-cost-reduction ai-job-displacement ai-healthcare model-release mustafa-suleyman
Inflection AI and Stability AI recently shipped major updates (Inflection AI 2.5 and Stable Diffusion 3) but are now experiencing significant executive departures, signaling potential consolidation in the GPU-rich startup space. Mustafa Suleyman has joined Microsoft AI as CEO, overseeing consumer AI products like Copilot, Bing, and Edge. Microsoft Azure is collaborating with NVIDIA on the Grace Blackwell 200 Superchip. Google DeepMind announced TacticAI, an AI assistant for football tactics developed with Liverpool FC, using geometric deep learning and achieving 90% expert approval in blind tests. Anthropic released Claude 3 Haiku and Claude 3 Sonnet on Google Cloud's Vertex AI, with Claude 3 Opus coming soon. Concerns about AI job displacement arise as NVIDIA introduces AI nurses that outperform humans at bedside manner at 90% lower cost.
One Year of Latent Space
gemini-1.5 gemma-7b mistral-next opus-v1 orca-2-13b nous-hermes-2-dpo-7b google-deepmind nous-research mistral-ai hugging-face nvidia langchain jetbrains ai-ethics bias-mitigation fine-tuning performance-optimization model-merging knowledge-transfer text-to-3d ai-hallucination hardware-optimization application-development vulnerability-research jim-keller richard-socher
Latent Space podcast celebrated its first anniversary, reaching #1 in AI Engineering podcasts and 1 million unique readers on Substack. The Gemini 1.5 image generator by Google DeepMind sparked controversy over bias and inaccurate representation, leading to community debates on AI ethics. Discussions in TheBloke and LM Studio Discords highlighted AI's growing role in creative industries, especially game development and text-to-3D tools. Fine-tuning and performance optimization of models like Gemma 7B and Mistral-next were explored in Nous Research AI and Mistral Discords, with shared solutions including learning rates and open-source tools. Emerging trends in AI hardware and application development were discussed in CUDA MODE and LangChain AI Discords, including critiques of Nvidia's CUDA by Jim Keller and advancements in reducing AI hallucinations hinted by Richard Socher.
Sora pushes SOTA
gemini-1.5 sora h20-gpt mistral-7b llama-13b mistralcasualml mixtral-instruct yi-models openai google-deepmind nvidia mistral-ai h2oai multimodality gpu-power-management long-context model-merging fine-tuning retrieval-augmented-generation role-play-model-optimization cross-language-integration training-loss synthetic-data-generation coding-support
Discord communities analyzed over 20 guilds, 312 channels, and 10550 messages reveal intense discussions on AI developments. Key highlights include the Dungeon Master AI assistant for Dungeons and Dragons using models like H20 GPT, GPU power supply debates involving 3090 and 3060 GPUs, and excitement around Google's Gemini 1.5 with its 1 million token context window and OpenAI's Sora model. Challenges with large world models (LWM) multimodality, GPT-assisted coding, and role-play model optimization with Yi models and Mixtral Instruct were discussed. Technical issues like model merging errors with MistralCasualML, fine-tuning scripts like AutoFineTune, and cross-language engineering via JSPyBridge were also prominent. NVIDIA's Chat with RTX feature leveraging retrieval-augmented generation (RAG) on 30+ series GPUs was compared to LMStudio's support for Mistral 7b and Llama 13b models. The community is cautiously optimistic about these frontier models' applications in media and coding.
12/26/2023: not much happened today
llava exllama2 meta-ai-fair google-deepmind gpu-offloading vram-utilization model-conversion moe-models multimodality model-performance hardware-configuration model-saving chatml installation-issues music-generation
LM Studio users extensively discussed its performance, installation issues on macOS, and upcoming features like Exllama2 support and multimodality with the Llava model. Conversations covered GPU offloading, vRAM utilization, MoE model expert selection, and model conversion compatibility. The community also addressed inefficient help requests referencing the blog 'Don't Ask to Ask, Just Ask'. Technical challenges with ChromaDB Plugin, server vs desktop hardware performance, and saving model states with Autogen were highlighted. Discussions included comparisons with other chatbots and mentions of AudioCraft from meta-ai-fair and MusicLM from google-deepmind for music generation.
12/18/2023: Gaslighting Mistral for fun and profit
gpt-4-turbo gpt-3.5-turbo claude-2.1 claude-instant-1 gemini-pro gpt-4.5 dalle-3 openai anthropic google-deepmind prompt-engineering api model-performance ethics role-play user-experience ai-impact-on-jobs ai-translation technical-issues sam-altman
OpenAI Discord discussions reveal comparisons among language models including GPT-4 Turbo, GPT-3.5 Turbo, Claude 2.1, Claude Instant 1, and Gemini Pro, with GPT-4 Turbo noted for user-centric explanations. Rumors about GPT-4.5 remain unconfirmed, with skepticism prevailing until official announcements. Users discuss technical challenges like slow responses and API issues, and explore role-play prompt techniques to enhance model performance. Ethical concerns about AI's impact on academia and employment are debated. Future features for Dalle 3 and a proposed new GPT model are speculated upon, while a school project seeks help using the OpenAI API. The community also touches on AI glasses and job market implications of AI adoption.
12/16/2023: ByteDance suspended by OpenAI
claude-2.1 gpt-4-turbo gemini-1.5-pro gpt-5 gpt-4.5 gpt-4 openai google-deepmind anthropic hardware gpu api-costs coding model-comparison subscription-issues payment-processing feature-confidentiality ai-art-generation organizational-productivity model-speculation
The OpenAI Discord community discussed hardware options like Mac racks and the A6000 GPU, highlighting their value for AI workloads. They compared Claude 2.1 and GPT 4 Turbo on coding tasks, with GPT 4 Turbo outperforming Claude 2.1. The benefits of the Bard API for gemini pro were noted, including a free quota of 60 queries per minute. Users shared experiences with ChatGPT Plus membership issues, payment problems, and speculated about the upcoming GPT-5 and the rumored GPT-4.5. Discussions also covered the confidentiality of the Alpha feature, AI art generation policies, and improvements in organizational work features. The community expressed mixed feelings about GPT-4's performance and awaited future model updates.