All tags
Topic: "model-distillation"
The new OpenAI Agents Platform
reka-flash-3 o1-mini claude-3-7-sonnet llama-3-3-70b sonic-2 qwen-chat olympiccoder openai reka-ai hugging-face deepseek togethercompute alibaba ai-agents api model-releases fine-tuning reinforcement-learning model-training model-inference multimodality voice-synthesis gpu-clusters model-distillation performance-optimization open-source sama reach_vb
OpenAI introduced a comprehensive suite of new tools for AI agents, including the Responses API, Web Search Tool, Computer Use Tool, File Search Tool, and an open-source Agents SDK with integrated observability tools, marking a significant step towards the "Year of Agents." Meanwhile, Reka AI open-sourced Reka Flash 3, a 21B parameter reasoning model that outperforms o1-mini and powers their Nexus platform, with weights available on Hugging Face. The OlympicCoder series surpassed Claude 3.7 Sonnet and much larger models on competitive coding benchmarks. DeepSeek built a 32K GPU cluster capable of training V3-level models in under a week and is exploring AI distillation. Hugging Face announced Cerebras inference support, achieving over 2,000 tokens/s on Llama 3.3 70B, 70x faster than leading GPUs. Reka's Sonic-2 voice AI model delivers 40ms latency via the Together API. Alibaba's Qwen Chat enhanced its multimodal interface with video understanding up to 500MB, voice-to-text, guest mode, and expanded file uploads. Sama praised OpenAI's new API as "one of the most well-designed and useful APIs ever."
not much happened today
gpt-4.5 claude-3.7-sonnet deepseek-r1 smolagents-codeagent gpt-4o llama-3-8b tinyr1-32b-preview r1-searcher forgetting-transformer nanomoe openai deepseek hugging-face mixture-of-experts reinforcement-learning kv-cache-compression agentic-ai model-distillation attention-mechanisms model-compression minimax model-pretraining andrej-karpathy cwolferesearch aymericroucher teortaxestex jonathanross321 akhaliq
The AI news recap highlights several key developments: nanoMoE, a PyTorch implementation of a mid-sized Mixture-of-Experts (MoE) model inspired by Andrej Karpathy's nanoGPT, enables pretraining on commodity hardware within a week. An agentic leaderboard ranks LLMs powering smolagents CodeAgent, with GPT-4.5 leading, followed by Claude-3.7-Sonnet. Discussions around DeepSeek-R1 emphasize AI model commoditization, with DeepSeek dubbed the "OpenAI of China." Q-Filters offer a training-free method for KV cache compression in autoregressive models, achieving 32x compression with minimal perplexity loss. The PokéChamp minimax language agent, powered by GPT-4o and Llama-3-8b, demonstrates strong performance in Pokémon battles. Other notable models include TinyR1-32B-Preview with Branch-Merge Distillation, R1-Searcher incentivizing search capability via reinforcement learning, and the Forgetting Transformer using a Forget Gate in softmax attention. These advancements reflect ongoing innovation in model architectures, compression, reinforcement learning, and agentic AI.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
Bespoke-Stratos + Sky-T1: The Vicuna+Alpaca moment for reasoning
sky-t1-32b-preview qwen-2.5-32b r1 o1-preview gpt-4o claude-3-sonnet bespoke-stratos-32b gemini-2.0-flash-thinking berkeley usc deepseek bespoke-labs google llmsys stanford lm-sys reasoning supervised-finetuning reinforcement-learning multimodality model-distillation context-windows code-execution model-repeatability behavioral-self-awareness rlhf teortaxestex cwolferesearch madiator chakraai philschmid abacaj omarsar0
Reasoning Distillation has emerged as a key technique, with Berkeley/USC researchers releasing Sky-T1-32B-Preview, a finetuned model of Qwen 2.5 32B using 17k reasoning traces for just $450, matching benchmarks of o1-preview. DeepSeek introduced R1, a model surpassing o1-preview and enabling distillation to smaller models like a 1.5B Qwen to match gpt-4o and claude-3-sonnet levels. Bespoke Labs further distilled R1 on Qwen, outperforming o1-preview with fewer samples. This progress suggests that "SFT is all you need" for reasoning without major architecture changes. Additionally, DeepSeek-R1 uses pure reinforcement learning with supervised finetuning to accelerate convergence and shows strong reasoning and multimodal capabilities. Google's Gemini 2.0 Flash Thinking model boasts a 1 million token context window, code execution, and excels in math, science, and multimodal reasoning. Critiques highlight challenges in model repeatability, behavioral self-awareness, and RLHF limitations in reasoning robustness.
Project Stargate: $500b datacenter (1.7% of US GDP) and Gemini 2 Flash Thinking 2
gemini-2.0-flash deepseek-r1 qwen-32b openai softbank oracle arm microsoft nvidia huggingface deepseek-ai long-context quantization code-interpretation model-distillation open-source agi-research model-performance memory-optimization noam-shazeer liang-wenfeng
Project Stargate, a US "AI Manhattan project" led by OpenAI and Softbank, supported by Oracle, Arm, Microsoft, and NVIDIA, was announced with a scale comparable to the original Manhattan project costing $35B inflation adjusted. Despite Microsoft's reduced role as exclusive compute partner, the project is serious but not immediately practical. Meanwhile, Noam Shazeer revealed a second major update to Gemini 2.0 Flash Thinking, enabling 1M token long context usable immediately. Additionally, AI Studio introduced a new code interpreter feature. On Reddit, DeepSeek R1, a distillation of Qwen 32B, was released for free on HuggingChat, sparking discussions on self-hosting, performance issues, and quantization techniques. DeepSeek's CEO Liang Wenfeng highlighted their focus on fundamental AGI research, efficient MLA architecture, and commitment to open-source development despite export restrictions, positioning DeepSeek as a potential alternative to closed-source AI trends.
DeepSeek R1: o1-level open weights model and a simple recipe for upgrading 1.5B models to Sonnet/4o level
deepseek-r1 deepseek-v3 qwen-2.5 llama-3.1 llama-3.3-70b deepseek ollama qwen llama reinforcement-learning fine-tuning model-distillation model-optimization reasoning reward-models multi-response-sampling model-training
DeepSeek released DeepSeek R1, a significant upgrade over DeepSeek V3 from just three weeks prior, featuring 8 models including full-size 671B MoE models and multiple distillations from Qwen 2.5 and Llama 3.1/3.3. The models are MIT licensed, allowing finetuning and distillation. Pricing is notably cheaper than o1 by 27x-50x. The training process used GRPO (reward for correctness and style outcomes) without relying on PRM, MCTS, or reward models, focusing on reasoning improvements through reinforcement learning. Distilled models can run on Ollama and show strong capabilities like writing Manim code. The release emphasizes advances in reinforcement-learning, fine-tuning, and model-distillation with a novel RL framework from DeepSeekMath.
Moondream 2025.1.9: Structured Text, Enhanced OCR, Gaze Detection in a 2B Model
o1 vdr-2b-multi-v1 llava-mini openai llamaindex langchainai qdrant genmoai vision model-efficiency structured-output gaze-detection reasoning model-distillation multimodality embedding-models gan diffusion-models self-attention training-optimizations development-frameworks api cross-language-deployment semantic-search agentic-document-processing developer-experience philschmid saranormous jxmnop reach_vb iscienceluvr multimodalart arohan adcock_brett awnihannun russelljkaplan ajayj_
Moondream has released a new version that advances VRAM efficiency and adds structured output and gaze detection, marking a new frontier in vision model practicality. Discussions on Twitter highlighted advancements in reasoning models like OpenAI's o1, model distillation techniques, and new multimodal embedding models such as vdr-2b-multi-v1 and LLaVA-Mini, which significantly reduce computational costs. Research on GANs and decentralized diffusion models showed improved stability and performance. Development tools like MLX and vLLM received updates for better portability and developer experience, while frameworks like LangChain and Qdrant enable intelligent data workflows. Company updates include new roles and team expansions at GenmoAI. "Efficiency tricks are all you need."
DeepSeek v3: 671B finegrained MoE trained for $5.5m USD of compute on 15T tokens
deepseek-v3 gpt-4o claude-3.5-sonnet llama-3 deepseek-ai hugging-face openai anthropic mixture-of-experts model-training model-optimization reinforcement-learning chain-of-thought multi-token-prediction synthetic-data model-distillation fine-tuning attention-mechanisms gpu-optimization nrehiew_ denny_zhou
DeepSeek-V3 has launched with 671B MoE parameters and trained on 14.8T tokens, outperforming GPT-4o and Claude-3.5-sonnet in benchmarks. It was trained with only 2.788M H800 GPU hours, significantly less than Llama-3's 30.8M GPU-hours, showcasing major compute efficiency and cost reduction. The model is open-source and deployed via Hugging Face with API support. Innovations include native FP8 mixed precision training, Multi-Head Latent Attention scaling, distillation from synthetic reasoning data, pruning and healing for MoEs with up to 256 experts, and a new multi-token prediction objective enabling lookahead token planning. Research highlights also cover the OREO method and Natural Language Reinforcement Learning (NLRL) for multi-step reasoning and agent control.
s{imple|table|calable} Consistency Models
llama-3-70b llama-3-405b llama-3-1 stable-diffusion-3.5 gpt-4 stability-ai tesla cerebras cohere langchain model-distillation diffusion-models continuous-time-consistency-models image-generation ai-hardware inference-speed multilingual-models yang-song
Model distillation significantly accelerates diffusion models, enabling near real-time image generation with only 1-4 sampling steps, as seen in BlinkShot and Flux Schnell. Research led by Yang Song introduced simplified continuous-time consistency models (sCMs), achieving under 10% FID difference in just 2 steps and scaling up to 1.5B parameters for higher quality. On AI hardware, Tesla is deploying a 50k H100 cluster potentially capable of completing GPT-4 training in under three weeks, while Cerebras Systems set a new inference speed record on Llama 3.1 70B with their wafer-scale AI chips. Stability AI released Stable Diffusion 3.5 and its Turbo variant, and Cohere launched new multilingual models supporting 23 languages with state-of-the-art performance. LangChain also announced ecosystem updates.
OpenAI Realtime API and other Dev Day Goodies
gpt-4o-realtime-preview gpt-4o openai livekit agora twilio grab automat voice-activity-detection function-calling ephemeral-sessions auto-truncation vision-fine-tuning model-distillation prompt-caching audio-processing
OpenAI launched the gpt-4o-realtime-preview Realtime API featuring text and audio token processing with pricing details and future plans including vision and video support. The API supports voice activity detection modes, function calling, and ephemeral sessions with auto-truncation for context limits. Partnerships with LiveKit, Agora, and Twilio enhance audio components and AI virtual agent voice calls. Additionally, OpenAI introduced vision fine-tuning with only 100 examples improving mapping accuracy for Grab and RPA success for Automat. Model distillation and prompt caching features were also announced, including free eval inference for users opting to share data.
not much happened today
llama-3-2 llama-3 gemma-2 phi-3-5-mini claude-3-haiku gpt-4o-mini molmo gemini-1.5 gemini meta-ai-fair openai allenai google-deepmind multimodality model-optimization benchmarks ai-safety model-distillation pruning adapter-layers open-source-models performance context-windows mira-murati demis-hassabis ylecun sama
Meta AI released Llama 3.2 models including 1B, 3B text-only and 11B, 90B vision variants with 128K token context length and adapter layers for image-text integration. These models outperform competitors like Gemma 2 and Phi 3.5-mini, and are supported on major platforms including AWS, Azure, and Google Cloud. OpenAI CTO Mira Murati announced her departure. Allen AI released Molmo, an open-source multimodal model family outperforming proprietary systems. Google improved Gemini 1.5 with Flash and Pro models. Meta showcased Project Orion AR glasses and hinted at a Quest 3S priced at $300. Discussions covered new benchmarks for multimodal models, model optimization, and AI safety and alignment.
Rombach et al: FLUX.1 [pro|dev|schnell], $31m seed for Black Forest Labs
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b flux-1 stability-ai google-deepmind nvidia text-to-image text-to-video model-benchmarking open-weight-models model-distillation safety-classifiers sparse-autoencoders ai-coding-tools rohanpaul_ai fchollet bindureddy clementdelangue ylecun svpino
Stability AI co-founder Rombach launched FLUX.1, a new text-to-image model with three variants: pro (API only), dev (open-weight, non-commercial), and schnell (Apache 2.0). FLUX.1 outperforms Midjourney and Ideogram based on Black Forest Labs' ELO score and plans to expand into text-to-video. Google DeepMind released Gemma-2 2B, a 2 billion parameter open-source model that outperforms larger models like GPT-3.5-Turbo-0613 and Mixtral-8x7b on Chatbot Arena, optimized with NVIDIA TensorRT-LLM. The release includes safety classifiers (ShieldGemma) and sparse autoencoder analysis (Gemma Scope). Discussions highlight benchmarking discrepancies and US government support for open-weight AI models. Critiques of AI coding tools' productivity gains were also noted.
Skyfall
gemini-1.5-pro gemini-1.5-flash yi-1.5 kosmos-2.5 paligemma falcon-2 deepseek-v2 hunyuan-dit gemini-1.5 gemini-1.5-flash yi-1.5 google-deepmind yi-ai microsoft hugging-face langchain maven multimodality mixture-of-experts transformer model-optimization long-context model-performance model-inference fine-tuning local-ai scaling-laws causal-models hallucination-detection model-distillation model-efficiency hamel-husain dan-becker clement-delangue philschmid osanseviero arankomatsuzaki jason-wei rohanpaul_ai
Between 5/17 and 5/20/2024, key AI updates include Google DeepMind's Gemini 1.5 Pro and Flash models, featuring sparse multimodal MoE architecture with up to 10M context and a dense Transformer decoder that is 3x faster and 10x cheaper. Yi AI released Yi-1.5 models with extended context windows of 32K and 16K tokens. Other notable releases include Kosmos 2.5 (Microsoft), PaliGemma (Google), Falcon 2, DeepSeek v2 lite, and HunyuanDiT diffusion model. Research highlights feature an Observational Scaling Laws paper predicting model performance across families, a Layer-Condensed KV Cache technique boosting inference throughput by up to 26×, and the SUPRA method converting LLMs into RNNs for reduced compute costs. Hugging Face expanded local AI capabilities enabling on-device AI without cloud dependency. LangChain updated its v0.2 release with improved documentation. The community also welcomed a new LLM Finetuning Discord by Hamel Husain and Dan Becker for Maven course users. "Hugging Face is profitable, or close to profitable," enabling $10 million in free shared GPUs for developers.