All tags
Person: "reach_vb"
not much happened today
codex claude-4-opus claude-4-sonnet gemini-2.5-pro gemini-2.5 qwen-2.5-vl qwen-3 playdiffusion openai anthropic google perplexity-ai bing playai suno hugging-face langchain-ai qwen mlx assemblyai llamacloud fine-tuning model-benchmarking text-to-video agentic-ai retrieval-augmented-generation open-source-models speech-editing audio-processing text-to-speech ultra-low-latency multimodality public-notebooks sama gdb kevinweil lmarena_ai epochairesearch reach_vb wightmanr deeplearningai mervenoyann awnihannun jordirib1 aravsrinivas omarsar0 lioronai jerryjliu0 nerdai tonywu_71 _akhaliq clementdelangue _mfelfel
OpenAI rolled out Codex to ChatGPT Plus users with internet access and fine-grained controls, improving memory features for free users. Anthropic's Claude 4 Opus and Sonnet models lead coding benchmarks, while Google's Gemini 2.5 Pro and Flash models gain recognition with new audio capabilities. Qwen 2.5-VL and Qwen 3 quantizations are noted for versatility and support. Bing Video Creator launched globally enabling text-to-video generation, and Perplexity Labs sees increased demand for travel search. New agentic AI tools and RAG innovations include LlamaCloud and FedRAG. Open-source releases include Holo-1 for web navigation and PlayAI's PlayDiffusion for speech editing. Audio and multimodal advances feature Suno's music editing upgrades, Google's native TTS in 24+ languages, and Universal Streaming's ultra-low latency speech-to-text. Google NotebookLM now supports public notebooks. "Codex's internet access brings tradeoffs, with explicit warnings about risk" and "Gemini 2.5 Pro is cited as a daily driver by users".
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.
not much happened today
kernelllm-8b gpt-4o deepseek-v3 mistral-medium-3 qwen3 blip3-o xgen-small anisora stable-audio-open-small alphaevolve meta-ai-fair mistral-ai qwen deepseek salesforce bilibili stability-ai google benchmarking model-performance multilinguality hardware-optimization multimodality image-generation video-generation text-to-audio model-parallelism chain-of-thought instruction-following reasoning mitigation-strategies reach_vb lmarena_ai theadimeline adcock_brett jxmnop dair_ai omarsar0
Meta released KernelLLM 8B, outperforming GPT-4o and DeepSeek V3 on KernelBench-Triton Level 1. Mistral Medium 3 debuted strongly in multiple benchmarks. Qwen3 models introduced a unified framework with multilingual support. DeepSeek-V3 features hardware-aware co-design. BLIP3-o family released for multimodal tasks using diffusion transformers. Salesforce launched xGen-Small models excelling in long-context and math benchmarks. Bilibili released AniSORA for anime video generation. Stability AI open-sourced Stable Audio Open Small optimized for Arm devices. Google’s AlphaEvolve coding agent improved Strassen's algorithm for the first time since 1969. Research shows chain-of-thought reasoning can harm instruction-following ability, with mitigation strategies like classifier-selective reasoning being most effective, but reasoning techniques show high variance and limited generalization. "Chain-of-thought (CoT) reasoning can harm a model’s ability to follow instructions" and "Mitigation strategies such as few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning can counteract reasoning-induced failures".
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
not much happened today
hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb
Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.
Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning
intellect-2 dreamo qwen gemini-2.5-pro dynamic-byte-latent-transformer gen-4-references mistral-medium-3 le-chat-enterprise primeintellect bytedance qwen gemma meta-ai-fair runwayml mistral-ai google distributed-training reinforcement-learning gpu-clusters model-optimization quantization multimodality agentic-ai video-understanding fine-tuning _akhaliq reach_vb osanseviero aiatmeta c_valenzuelab lmarena_ai adcock_brett
Prime Intellect released INTELLECT-2, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. ByteDance launched DreamO, a unified image customization model on Hugging Face. Qwen released models optimized for GPTQ, GGUF, and AWQ quantization. Gemma surpassed 150 million downloads on Hugging Face. Meta released weights for the Dynamic Byte Latent Transformer and the Collaborative Reasoner framework to improve language model efficiency and reasoning. RunwayML introduced Gen-4 References, a near-realtime model requiring no fine-tuning. Mistral AI released Mistral Medium 3, a strong multimodal model, and Le Chat Enterprise, an agentic AI assistant for business. Google updated Gemini 2.5 Pro Preview with video understanding and UI improvements. "Airbnb for spare GPUs from all over the world" highlights the ongoing challenges and potential of distributed GPU training.
not much happened today
open-code-reasoning-32b open-code-reasoning-14b open-code-reasoning-7b mistral-medium-3 llama-4-maverick gemini-2.5-pro gemini-2.5-flash claude-3.7-sonnet absolute-zero-reasoner x-reasoner fastvlm parakeet-asr openai nvidia mistral-ai google apple huggingface reinforcement-learning fine-tuning code-generation reasoning vision on-device-ai model-performance dataset-release model-optimization reach_vb artificialanlys scaling01 iscienceluvr arankomatsuzaki awnihannun risingsayak
OpenAI launched both Reinforcement Finetuning and Deep Research on GitHub repos, drawing comparisons to Cognition's DeepWiki. Nvidia open-sourced Open Code Reasoning models (32B, 14B, 7B) with Apache 2.0 license, showing 30% better token efficiency and compatibility with llama.cpp, vLLM, transformers, and TGI. Independent evaluations highlight Mistral Medium 3 rivaling Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet in coding and math reasoning, priced significantly lower but no longer open-source. Google's Gemini 2.5 Pro is noted as their most intelligent model with improved coding from simple prompts, while Gemini 2.5 Flash incurs a 150x cost increase over Gemini 2.0 Flash due to higher token usage and cost. The Absolute Zero Reasoner (AZR) achieves SOTA performance in coding and math reasoning via reinforced self-play without external data. Vision-language model X-REASONER is post-trained on general-domain text for reasoning. Apple ML research released FastVLM with on-device iPhone demo. HiDream LoRA trainer supports QLoRA fine-tuning under memory constraints. Nvidia's Parakeet ASR model tops Hugging Face ASR leaderboard with MLX implementation. New datasets SwallowCode and SwallowMath boost LLM performance in math and code. Overall, a quiet day with significant model releases and performance insights.
not much happened today
qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb
Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.
not much happened today
phi-4 phi-4-mini-reasoning qwen3-235b qwen3-moe-235b qwen3-moe-30b qwen3-dense-32b qwen3-dense-14b qwen3-dense-8b qwen3-dense-4b qwen3-dense-0.6b qwen2.5-omni-3b deepseek-prover-v2 llama llama-guard-4 prompt-guard-2 mimo-7b microsoft anthropic cursor alibaba togethercompute deepseek meta-ai-fair xiaomi openrouterai cohere reasoning model-fine-tuning model-evaluation benchmarking model-popularity open-source math model-scaling model-filtering jailbreak-prevention cline reach_vb vipulved akhaliq omarsar0 zhs05232838 huajian_xin mervenoyann karpathy random_walker sarahookr blancheminerva clefourrier
Microsoft released Phi-reasoning 4, a finetuned 14B reasoning model slightly behind QwQ but limited by data transparency and token efficiency issues. Anthropic introduced remote MCP server support and a 45-minute Research mode in Claude. Cursor published a model popularity list. Alibaba launched Qwen3-235B and other Qwen3 variants, highlighting budget-friendly coding and reasoning capabilities, with availability on Together AI API. Microsoft also released Phi-4-Mini-Reasoning with benchmark performance on AIME 2025 and OmniMath. DeepSeek announced DeepSeek-Prover V2 with state-of-the-art math problem solving, scaling to 671B parameters. Meta AI's Llama models hit 1.2 billion downloads, with new Llama Guard 4 and Prompt Guard 2 for input/output filtering and jailbreak prevention. Xiaomi released the open-source reasoning model MiMo-7B trained on 25 trillion tokens. Discussions on AI model evaluation highlighted issues with the LMArena leaderboard, data access biases favoring proprietary models, and challenges in maintaining fair benchmarking, with suggestions for alternatives like OpenRouterAI rankings. "LMArena slop and biased" and "61.3% of all data going to proprietary model providers" were noted concerns.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
LlamaCon: Meta AI gets into the Llama API platform business
llama-4 qwen3 qwen3-235b-a22b qwen3-30b-a3b qwen3-4b qwen2-5-72b-instruct o3-mini meta-ai-fair cerebras groq alibaba vllm ollama llamaindex hugging-face llama-cpp model-release fine-tuning reinforcement-learning moe multilingual-models model-optimization model-deployment coding benchmarking apache-license reach_vb huybery teortaxestex awnihannun thezachmueller
Meta celebrated progress in the Llama ecosystem at LlamaCon, launching an AI Developer platform with finetuning and fast inference powered by Cerebras and Groq hardware, though it remains waitlisted. Meanwhile, Alibaba released the Qwen3 family of large language models, including two MoE models and six dense models ranging from 0.6B to 235B parameters, with the flagship Qwen3-235B-A22B achieving competitive benchmark results and supporting 119 languages and dialects. The Qwen3 models are optimized for coding and agentic capabilities, are Apache 2.0 licensed, and have broad deployment support including local usage with tools like vLLM, Ollama, and llama.cpp. Community feedback highlights Qwen3's scalable performance and superiority over models like OpenAI's o3-mini.
Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
qwen-3 qwen3-235b-a22b qwen3-30b-a3b deepseek-r1 o1 o3-mini grok-3 gemini-2.5-pro alibaba google-deepmind deepseek mistral-ai mixture-of-experts reinforcement-learning benchmarking model-release model-architecture long-context multi-agent-systems inference dataset-release awnihannun prince_canuma actuallyisaak oriolvinyalsml iscienceluvr reach_vb teortaxestex omarsar0
Qwen 3 has been released by Alibaba featuring a range of models including two MoE variants, Qwen3-235B-A22B and Qwen3-30B-A3B, which demonstrate competitive performance against top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The models introduce an "enable_thinking=True" mode with advanced soft switching for inference scaling. The release is notable for its Apache 2.0 license and broad inference platform support including MCP. The dataset improvements and multi-stage RL post-training contribute to performance gains. Meanwhile, Gemini 2.5 Pro from Google DeepMind shows strong coding and long-context reasoning capabilities, and DeepSeek R2 is anticipated soon. Twitter discussions highlight Qwen3's finegrained MoE architecture, large context window, and multi-agent system applications.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq
OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
Llama 4's Controversial Weekend Release
llama-4 llama-3 llama-3-2 meta mixture-of-experts early-fusion attention-mechanisms fp8-training training-data benchmarking model-performance model-release multimodality open-models ahmad_al_dahle ylecun reach_vb yuchenj_uw
Meta released Llama 4, featuring two new medium-size MoE open models and a promised 2 Trillion parameter "behemoth" model, aiming to be the largest open model ever. The release included advanced training techniques like Chameleon-like early fusion with MetaCLIP, interleaved chunked attention without RoPE, native FP8 training, and training on up to 40 trillion tokens. Despite the hype, the release faced criticism for lack of transparency compared to Llama 3, implementation issues, and poor performance on some benchmarks. Meta leadership, including Ahmad Al Dahle, denied allegations of training on test sets. The smallest Scout model at 109B parameters is too large for consumer GPUs, and the claimed 10 million token context is disputed. The community response has been mixed, with some praising the openness and others pointing out discrepancies and quality concerns.
not much happened today
o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0
OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.
Promptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AI
gpt-4o-transcribe gpt-4o-mini-tts o1-pro kokoro-82m openai replicate speech-to-text text-to-speech voice-activity-detection prompt-engineering real-time-processing model-release api function-calling structured-outputs model-performance juberti sama reach_vb kevinweil omarsar0
OpenAI has launched three new state-of-the-art audio models in their API, including gpt-4o-transcribe, a speech-to-text model outperforming Whisper, and gpt-4o-mini-tts, a text-to-speech model with promptable prosody allowing control over timing and emotion. The Agents SDK now supports audio, enabling voice agents. OpenAI also updated turn detection for real-time voice activity detection (VAD) based on speech content. Additionally, OpenAI's o1-pro model is available to select developers with advanced features like vision and function calling, though at higher compute costs. The community shows strong enthusiasm for these audio advancements, with a radio contest for TTS creations underway. Meanwhile, Kokoro-82M v1.0 emerges as a leading open weights TTS model with competitive pricing on Replicate.
Every 7 Months: The Moore's Law for Agent Autonomy
claude-3-7-sonnet llama-4 phi-4-multimodal gpt-2 cosmos-transfer1 gr00t-n1-2b orpheus-3b metr nvidia hugging-face canopy-labs meta-ai-fair microsoft agent-autonomy task-completion multimodality text-to-speech robotics foundation-models model-release scaling-laws fine-tuning zero-shot-learning latency reach_vb akhaliq drjimfan scaling01
METR published a paper measuring AI agent autonomy progress, showing it has doubled every 7 months since 2019 (GPT-2). They introduced a new metric, the 50%-task-completion time horizon, where models like Claude 3.7 Sonnet achieve 50% success in about 50 minutes. Projections estimate 1 day autonomy by 2028 and 1 month autonomy by late 2029. Meanwhile, Nvidia released Cosmos-Transfer1 for conditional world generation and GR00T-N1-2B, an open foundation model for humanoid robot reasoning with 2B parameters. Canopy Labs introduced Orpheus 3B, a high-quality text-to-speech model with zero-shot voice cloning and low latency. Meta reportedly delayed Llama-4 release due to performance issues. Microsoft launched Phi-4-multimodal.
Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)
command-a mistral-ai-small-3.1 smoldocling qwen-2.5-vl cohere mistral-ai hugging-face context-windows multilinguality multimodality fine-tuning benchmarking ocr model-performance model-releases model-optimization aidangomez sophiamyang mervenoyann aidan_mclau reach_vb lateinteraction
Cohere's Command A model has solidified its position on the LMArena leaderboard, featuring an open-weight 111B parameter model with an unusually long 256K context window and competitive pricing. Mistral AI released the lightweight, multilingual, and multimodal Mistral AI Small 3.1 model, optimized for single RTX 4090 or Mac 32GB RAM setups, with strong performance on instruct and multimodal benchmarks. The new OCR model SmolDocling offers fast document reading with low VRAM usage, outperforming larger models like Qwen2.5VL. Discussions highlight the importance of system-level improvements over raw LLM advancements, and MCBench is recommended as a superior AI benchmark for evaluating model capabilities across code, aesthetics, and awareness.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
The new OpenAI Agents Platform
reka-flash-3 o1-mini claude-3-7-sonnet llama-3-3-70b sonic-2 qwen-chat olympiccoder openai reka-ai hugging-face deepseek togethercompute alibaba ai-agents api model-releases fine-tuning reinforcement-learning model-training model-inference multimodality voice-synthesis gpu-clusters model-distillation performance-optimization open-source sama reach_vb
OpenAI introduced a comprehensive suite of new tools for AI agents, including the Responses API, Web Search Tool, Computer Use Tool, File Search Tool, and an open-source Agents SDK with integrated observability tools, marking a significant step towards the "Year of Agents." Meanwhile, Reka AI open-sourced Reka Flash 3, a 21B parameter reasoning model that outperforms o1-mini and powers their Nexus platform, with weights available on Hugging Face. The OlympicCoder series surpassed Claude 3.7 Sonnet and much larger models on competitive coding benchmarks. DeepSeek built a 32K GPU cluster capable of training V3-level models in under a week and is exploring AI distillation. Hugging Face announced Cerebras inference support, achieving over 2,000 tokens/s on Llama 3.3 70B, 70x faster than leading GPUs. Reka's Sonic-2 voice AI model delivers 40ms latency via the Together API. Alibaba's Qwen Chat enhanced its multimodal interface with video understanding up to 500MB, voice-to-text, guest mode, and expanded file uploads. Sama praised OpenAI's new API as "one of the most well-designed and useful APIs ever."
DeepSeek's Open Source Stack
qwen-qwq-32b start character-3 gemini gemini-2.0 mercury-coder gpt-4.5 jamba-mini-1.6 gemini-2.0-flash gpt-4o-mini mistral-small-3 mistral-ocr deepseek pyspur hugging-face togethercompute hedra-labs google-deepmind deeplearningai openai ai21-labs mistral-ai fine-tuning benchmarking multimodality code-generation diffusion-models model-performance model-optimization ocr embedding-models context-windows runtime-limits _akhaliq lmarena_ai reach_vb danielhanchen _philschmid aidan_mclau vikhyatk jerryjliu0
DeepSeek's Open Source Week was summarized by PySpur, highlighting multiple interesting releases. The Qwen QwQ-32B model was fine-tuned into START, excelling in PhD-level science QA and math benchmarks. Character-3, an omnimodal AI video generation model by Hedra Labs and Together AI, enables realistic animated content creation. Google DeepMind introduced the Gemini embedding model with an 8k context window, ranking #1 on MMTEB, alongside the Gemini 2.0 Code Executor supporting Python libraries and auto-fix features. Inception Labs' Mercury Coder is a diffusion-based code generation model offering faster token processing. OpenAI released GPT-4.5, their largest model yet but with less reasoning ability than some competitors. AI21 Labs launched Jamba Mini 1.6, noted for superior output speed compared to Gemini 2.0 Flash, GPT-4o mini, and Mistral Small 3. A new dataset of 1.9M scanned pages was released for OCR benchmarking, with Mistral OCR showing competitive but not top-tier document parsing performance compared to LLM/LVM-powered methods. "Cracked engineers are all you need."
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
Anthropic's $61.5B Series E
gpt-4.5 claude-3.7-sonnet deepseek-r1 anthropic openai deepseek lmsys perplexity-ai deutsche-telekom model-performance benchmarking style-control coding multi-turn funding partnerships workflow lmarena_ai teortaxestex casper_hansen_ omarsar0 aidan_mclau willdepue vikhyatk teknim1 reach_vb _aidan_clark_ cto_junior aravsrinivas
Anthropic raised a $3.5 billion Series E funding round at a $61.5 billion valuation, signaling strong financial backing for the Claude AI model. GPT-4.5 achieved #1 rank across all categories on the LMArena leaderboard, excelling in multi-turn conversations, coding, math, creative writing, and style control. DeepSeek R1 tied with GPT-4.5 for top performance on hard prompts with style control. Discussions highlighted comparisons between GPT-4.5 and Claude 3.7 Sonnet in coding and workflow applications. The importance of the LMSYS benchmark was emphasized, though some questioned the relevance of benchmarks versus user acquisition. Additionally, Perplexity AI partnered with Deutsche Telekom to integrate the Perplexity Assistant into a new AI phone.
GPT 4.5 — Chonky Orion ships!
gpt-4.5 phi-4-multimodal phi-4-mini command-r7b-arabic openai microsoft cohere creative-writing natural-language-processing multimodality math coding context-windows model-releases open-source arabic-language sama kevinweil aidan_mclau omarsar0 rasbt reach_vb
OpenAI released GPT-4.5 as a research preview, highlighting its deep world knowledge, improved understanding of user intent, and a 128,000 token context window. It is noted for excelling in writing, creative tasks, image understanding, and data extraction but is not a reasoning model. Microsoft unveiled Phi-4 Multimodal and Phi-4 Mini, open-source models integrating text, vision, and speech/audio, with strong performance in math and coding tasks. Cohere released Command R7B Arabic, an open-weights model optimized for Arabic language capabilities targeting enterprises in the MENA region. The community is exploring the impact of larger models on creative writing, intent understanding, and world knowledge, with GPT-4.5 expected to be a basis for GPT-5.
lots of small launches
gpt-4o claude-3.7-sonnet claude-3.7 claude-3.5-sonnet deepseek-r1 deepseek-v3 grok-3 openai anthropic amazon cloudflare perplexity-ai deepseek-ai togethercompute elevenlabs elicitorg inceptionailabs mistral-ai voice model-releases cuda gpu-optimization inference open-source api model-performance token-efficiency context-windows cuda jit-compilation lmarena_ai alexalbert__ aravsrinivas reach_vb
GPT-4o Advanced Voice Preview is now available for free ChatGPT users with enhanced daily limits for Plus and Pro users. Claude 3.7 Sonnet has achieved the top rank in WebDev Arena with improved token efficiency. DeepSeek-R1 with 671B parameters benefits from the Together Inference platform optimizing NVIDIA Blackwell GPU usage, alongside the open-source DeepGEMM CUDA library delivering up to 2.7x speedups on Hopper GPUs. Perplexity launched a new Voice Mode and a Deep Research API. The upcoming Grok 3 API will support a 1M token context window. Several companies including Elicit, Amazon, Anthropic, Cloudflare, FLORA, Elevenlabs, and Inception Labs announced new funding rounds, product launches, and model releases.
not much happened today
claude-3.7-sonnet claude-3.7 deepseek-r1 o3-mini deepseek-v3 gemini-2.0-pro gpt-4o qwen2.5-coder-32b-instruct anthropic perplexity-ai amazon google-cloud deepseek_ai coding reasoning model-benchmarking agentic-workflows context-window model-performance open-source moe model-training communication-libraries fp8 nvlink rdma cli-tools skirano omarsar0 reach_vb artificialanlys terryyuezhuo _akhaliq _philschmid catherineols goodside danielhanchen
Claude 3.7 Sonnet demonstrates exceptional coding and reasoning capabilities, outperforming models like DeepSeek R1, O3-mini, and GPT-4o on benchmarks such as SciCode and LiveCodeBench. It is available on platforms including Perplexity Pro, Anthropic, Amazon Bedrock, and Google Cloud, with pricing at $3/$15 per million tokens. Key features include a 64k token thinking mode, 200k context window, and the CLI-based coding assistant Claude Code. Meanwhile, DeepSeek released DeepEP, an open-source communication library optimized for MoE model training and inference with support for NVLink, RDMA, and FP8. These updates highlight advancements in coding AI and efficient model training infrastructure.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
deepseek-native-sparse-attention r1-1776 paligemma-2-mix muse baichuan-m1-14b stripedhyena-2 huggingface deepseek perplexity-ai google-deepmind microsoft baichuan stripedhyena gpu-training scaling multimodality vision model-training foundation-models medical-llm genome-modeling robotic-manipulation interactive-content eliebakouch nouamanetazi lvwerra thom-wolf proftomyeh alex-wang aravsrinivas _akhaliq _philschmid mervenoyann reach_vb arankomatsuzaki maximelabonne
Huggingface released "The Ultra-Scale Playbook: Training LLMs on GPU Clusters," an interactive blogpost based on 4000 scaling experiments on up to 512 GPUs, providing detailed insights into modern GPU training strategies. DeepSeek introduced the Native Sparse Attention (NSA) model, gaining significant community attention, while Perplexity AI launched R1-1776, an uncensored and unbiased version of DeepSeek's R1 model. Google DeepMind unveiled PaliGemma 2 Mix, a multi-task vision-language model available in 3B, 10B, and 28B sizes. Microsoft introduced Muse, a generative AI model trained on the game Bleeding Edge, and presented Magma, a foundation model for multimodal AI agents excelling in UI navigation and robotic manipulation. Baichuan-M1-14B was announced as a state-of-the-art medical LLM trained on 20T tokens, and a fully open-source 40B genome modeling model using StripedHyena 2 architecture was also released. "Making your own gaming experience is coming sooner than you'd think," noted in relation to Muse.
LLaDA: Large Language Diffusion Models
llada-8b llama-3-8b step-video-t2v-30b step-audio-chat-132b llama-2-7b stepfun-ai scale-ai cambridge llamaindex diffusion-models text-generation multimodality video-generation voice-processing benchmarking instruction-following model-scaling gpu-usage long-context multi-turn-dialogue arankomatsuzaki _akhaliq omarsar0 iscienceluvr gallabytes maximelabonne reach_vb
LLaDA (Large Language Diffusion Model) 8B is a breakthrough diffusion-based language model that rivals LLaMA 3 8B while training on 7x fewer tokens (2 trillion tokens) and using 0.13 million H800 GPU hours. It introduces a novel text generation approach by predicting uniformly masked tokens in a diffusion process, enabling multi-turn dialogue and instruction-following. Alongside, StepFun AI released two major models: Step-Video-T2V 30B, a text-to-video model generating up to 204 frames with high coherence and motion quality, and Step-Audio-Chat 132B, a voice-to-voice model. Additionally, challenging multimodal benchmarks like Scale AI's EnigmaEval and Cambridge's ZeroBench highlight current frontier models scoring zero, emphasizing the difficulty of these tasks. The community also noted the return of diffusion models in language modeling, a previously speculative architecture now scaled successfully.
not much happened today
gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr
Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.
Mistral Small 3 24B and Tulu 3 405B
mistral-small-3 tulu-3-405b llama-3 tiny-swallow-1.5b qwen-2.5-max deepseek-v3 claude-3.5-sonnet gemini-1.5-pro gpt4o-mini llama-3-3-70b mistral-ai ai2 sakana-ai alibaba_qwen deepseek ollama llamaindex reinforcement-learning model-fine-tuning local-inference model-performance model-optimization on-device-ai instruction-following api training-data natural-language-processing clementdelangue dchaplot reach_vb
Mistral AI released Mistral Small 3, a 24B parameter model optimized for local inference with low latency and 81% accuracy on MMLU, competing with Llama 3.3 70B, Qwen-2.5 32B, and GPT4o-mini. AI2 released Tülu 3 405B, a large finetuned model of Llama 3 using Reinforcement Learning from Verifiable Rewards (RVLR), competitive with DeepSeek v3. Sakana AI launched TinySwallow-1.5B, a Japanese language model using TAID for on-device use. Alibaba_Qwen released Qwen 2.5 Max, trained on 20 trillion tokens, with performance comparable to DeepSeek V3, Claude 3.5 Sonnet, and Gemini 1.5 Pro, and updated API pricing. These releases highlight advances in open models, efficient inference, and reinforcement learning techniques.
not much happened today
deepseek-r1 qwen-2.5 qwen-2.5-max deepseek-v3 deepseek-janus-pro gpt-4 nvidia anthropic openai deepseek huawei vercel bespoke-labs model-merging multimodality reinforcement-learning chain-of-thought gpu-optimization compute-infrastructure compression crypto-api image-generation saranormous zizhpan victormustar omarsar0 markchen90 sakanaailabs reach_vb madiator dain_mclau francoisfleuret garygodchaux arankomatsuzaki id_aa_carmack lavanyasant virattt
Huawei chips are highlighted in a diverse AI news roundup covering NVIDIA's stock rebound, new open music foundation models like Local Suno, and competitive AI models such as Qwen 2.5 Max and Deepseek V3. The release of DeepSeek Janus Pro, a multimodal LLM with image generation capabilities, and advancements in reinforcement learning and chain-of-thought reasoning are noted. Discussions include GPU rebranding with NVIDIA's H6400 GPUs, data center innovations, and enterprise AI applications like crypto APIs in hedge funds. "Deepseek R1's capabilities" and "Qwen 2.5 models added to applications" are key highlights.
TinyZero: Reproduce DeepSeek R1-Zero for $30
deepseek-r1 qwen o1 claude-3-sonnet claude-3 prime ppo grpo llama-stack deepseek berkeley hugging-face meta-ai-fair openai deeplearningai reinforcement-learning fine-tuning chain-of-thought multi-modal-benchmark memory-management model-training open-source agentic-workflow-automation model-performance jiayi-pan saranormous reach_vb lmarena_ai nearcyan omarsar0 philschmid hardmaru awnihannun winglian
DeepSeek Mania continues to reshape the frontier model landscape with Jiayi Pan from Berkeley reproducing the OTHER result from the DeepSeek R1 paper, R1-Zero, in a cost-effective Qwen model fine-tune for two math tasks. A key finding is a lower bound to the distillation effect at 1.5B parameters, with RLCoT reasoning emerging as an intrinsic property. Various RL techniques like PPO, DeepSeek's GRPO, or PRIME show similar outcomes, and starting from an Instruct model speeds convergence. The Humanity’s Last Exam (HLE) Benchmark introduces a challenging multi-modal test with 3,000 expert-level questions across 100+ subjects, where models perform below 10%, with DeepSeek-R1 achieving 9.4%. DeepSeek-R1 excels in chain-of-thought reasoning, outperforming models like o1 while being 20x cheaper and MIT licensed. The WebDev Arena Leaderboard ranks DeepSeek-R1 #2 in technical domains and #1 under Style Control, closing in on Claude 3.5 Sonnet. OpenAI's Operator is deployed to 100% of Pro users in the US, enabling tasks like ordering meals and booking reservations, and functions as a research assistant for AI paper searches and summaries. Hugging Face announces a leadership change after significant growth, while Meta AI releases the first stable version of Llama Stack with streamlined upgrades and automated verification. DeepSeek-R1's open-source success is celebrated, and technical challenges like memory management on macOS 15+ are addressed with residency sets in MLX for stability.
not much happened today
oute-tts-0.3-1b oute-tts-0.3-500m olm-1b qwen-2.5-0.5b hover gpt-4o deepseek-v3 harvey meta-ai-fair stability-ai alibaba deepseek hugging-face text-to-speech zero-shot-learning multilinguality emotion-control motor-control reinforcement-learning local-ai distributed-inference pipeline-parallelism mathematical-reasoning process-reward-models legal-ai education-ai ai-security humor reach_vb drjimfan vikhyatk mervenoyann aiatmeta iscienceluvr alibaba_qwen awnihannun ajeya_cotra emollick qtnx_ designerx
Harvey secured a new $300M funding round. OuteTTS 0.3 1B & 500M text-to-speech models were released featuring zero-shot voice cloning, multilingual support (en, jp, ko, zh, fr, de), and emotion control, powered by OLMo-1B and Qwen 2.5 0.5B. The HOVER model, a 1.5M-parameter neural net for agile motor control, was introduced, leveraging human motion capture datasets and massively parallel reinforcement learning. kokoro.js enables running AI models locally in browsers with minimal dependencies. Meta AI awarded $200K LLM evaluation grants for projects on regional language understanding, complex reasoning, and interactive programming environments. Stability AI's Twitter account was hacked, prompting security warnings. Alibaba Qwen improved Process Reward Models (PRMs) for better mathematical reasoning using a consensus filtering mechanism. DeepSeek V3 uses pipeline parallelism to enhance distributed inference and long-context generation efficiency. Discussions on AI policy in legal frameworks and AI's role in democratizing education were highlighted. Lighthearted AI-related humor was also shared.
not much happened today
helium-1 qwen-2.5 phi-4 sky-t1-32b-preview o1 codestral-25.01 phi-3 mistral llama-3 gpt-3.5 llama-3 gpt-3.5 llmquoter kyutai-labs lmstudio mistralai llamaindex huggingface langchainai hyperbolic-labs replit fchollet philschmid multilinguality token-level-distillation context-windows model-performance open-source reasoning coding retrieval-augmented-generation hybrid-retrieval multiagent-systems video large-video-language-models dynamic-ui voice-interaction gpu-rentals model-optimization semantic-deduplication model-inference reach_vb awnihannun lior_on_ai sophiamyang omarsar0 skirano yuchenj_uw fchollet philschmid
Helium-1 Preview by kyutai_labs is a 2B-parameter multilingual base LLM outperforming Qwen 2.5, trained on 2.5T tokens with a 4096 context size using token-level distillation from a 7B model. Phi-4 (4-bit) was released in lmstudio on an M4 max, noted for speed and performance. Sky-T1-32B-Preview is a $450 open-source reasoning model matching o1's performance with strong benchmark scores. Codestral 25.01 by mistralai is a new SOTA coding model supporting 80+ programming languages and offering 2x speed.
Innovations include AutoRAG for optimizing retrieval-augmented generation pipelines, Agentic RAG for autonomous query reformulation and critique, Multiagent Finetuning using societies of models like Phi-3, Mistral, LLaMA-3, and GPT-3.5 for reasoning improvements, and VideoRAG incorporating video content into RAG with LVLMs.
Applications include a dynamic UI AI chat app by skirano on Replit, LangChain tools like DocTalk for voice PDF conversations, AI travel agent tutorials, and news summarization agents. Hyperbolic Labs offers competitive GPU rentals including H100, A100, and RTX 4090. LLMQuoter enhances RAG accuracy by identifying key quotes.
Infrastructure updates include MLX export for LLM inference from Python to C++ by fchollet and SemHash semantic text deduplication by philschmid.
Moondream 2025.1.9: Structured Text, Enhanced OCR, Gaze Detection in a 2B Model
o1 vdr-2b-multi-v1 llava-mini openai llamaindex langchainai qdrant genmoai vision model-efficiency structured-output gaze-detection reasoning model-distillation multimodality embedding-models gan diffusion-models self-attention training-optimizations development-frameworks api cross-language-deployment semantic-search agentic-document-processing developer-experience philschmid saranormous jxmnop reach_vb iscienceluvr multimodalart arohan adcock_brett awnihannun russelljkaplan ajayj_
Moondream has released a new version that advances VRAM efficiency and adds structured output and gaze detection, marking a new frontier in vision model practicality. Discussions on Twitter highlighted advancements in reasoning models like OpenAI's o1, model distillation techniques, and new multimodal embedding models such as vdr-2b-multi-v1 and LLaVA-Mini, which significantly reduce computational costs. Research on GANs and decentralized diffusion models showed improved stability and performance. Development tools like MLX and vLLM received updates for better portability and developer experience, while frameworks like LangChain and Qdrant enable intelligent data workflows. Company updates include new roles and team expansions at GenmoAI. "Efficiency tricks are all you need."
not much happened today
rstar-math o1-preview qwen2.5-plus qwen2.5-coder-32b-instruct phi-4 claude-3.5-sonnet openai anthropic alibaba microsoft cohere langchain weights-biases deepseek rakuten rbc amd johns-hopkins math process-reward-model mcts vision reasoning synthetic-data pretraining rag automation private-deployment multi-step-workflow open-source-dataset text-embeddings image-segmentation chain-of-thought multimodal-reasoning finetuning recursive-self-improvement collaborative-platforms ai-development partnerships cuda triton ai-efficiency ai-assisted-coding reach_vb rasbt akshaykagrawal arankomatsuzaki teortaxestex aidangomez andrewyng
rStar-Math surpasses OpenAI's o1-preview in math reasoning with 90.0% accuracy using a 7B LLM and MCTS with a Process Reward Model. Alibaba launches Qwen Chat featuring Qwen2.5-Plus and Qwen2.5-Coder-32B-Instruct models enhancing vision-language and reasoning. Microsoft releases Phi-4, trained on 40% synthetic data with improved pretraining. Cohere introduces North, a secure AI workspace integrating LLMs, RAG, and automation for private deployments. LangChain showcases a company research agent with multi-step workflows and open-source datasets. Transformers.js demos released for text embeddings and image segmentation in JavaScript. Research highlights include Meta Meta-CoT for enhanced chain-of-thought reasoning, DeepSeek V3 with recursive self-improvement, and collaborative AI development platforms. Industry partnerships include Rakuten with LangChain, North with RBC supporting 90,000 employees, and Agent Laboratory collaborating with AMD and Johns Hopkins. Technical discussions emphasize CUDA and Triton for AI efficiency and evolving AI-assisted coding stacks by Andrew Ng.
Common Corpus: 2T Open Tokens with Provenance
qwen-2.5-coder claude-3.5-sonnet janusflow-1.3b ocronos-vintage pleais huggingface langchainai deepseek alibaba anthropic provenance ocr multilingual-datasets prompt-engineering multimodality image-generation code-generation quantization model-scaling inference-efficiency tim-dettmers tom-doerr omarsar0 swyx madiator reach_vb
Pleais via Huggingface released Common Corpus, the largest fully open multilingual dataset with over 2 trillion tokens including detailed provenance information. They also introduced OCRonos-Vintage, a 124M-parameter OCR correction model that efficiently fixes digitization errors on CPU and GPU, unlocking knowledge from PDFs. On AI tools, LangChainAI launched Prompt Canvas for collaborative prompt engineering, while DeepSeek released JanusFlow 1.3B, a unified multimodal LLM integrating autoregressive and rectified flow models for enhanced image understanding and generation. Alibaba Cloud announced Qwen2.5-Coder, a code-focused LLM with advanced coding capabilities, and Claude 3.5 Sonnet was highlighted for superior code generation. Discussions on quantization challenges and scaling laws for precision by Tim Dettmers and others emphasized the impact of low-precision training on model scalability and inference efficiency. "Scaling Laws for Precision" paper insights and alternative efficiency methods were also noted.
not much happened this weekend
claude-3.5-sonnet llama-3 llama-3-8b notebookllama min-omni-2 moondream openai anthropic hugging-face mistral-ai google-deepmind langchain deepmind microsoft pattern-recognition reinforcement-learning prompt-optimization text-to-speech model-optimization tensor-parallelism hyperparameters multimodal modal-alignment multimodal-fine-tuning ai-productivity privacy generative-ai rag retrieval-augmentation enterprise-text-to-sql amanda-askell philschmid stasbekman francois-fleuret mervenoyann reach_vb dzhng aravsrinivas sama lateinteraction andrew-y-ng bindureddy jerryjliu0
Moondream, a 1.6b vision language model, secured seed funding, highlighting a trend in moon-themed tiny models alongside Moonshine (27-61m ASR model). Claude 3.5 Sonnet was used for AI Twitter recaps. Discussions included pattern recognition vs. intelligence in LLMs, reinforcement learning for prompt optimization, and NotebookLlama, an open-source NotebookLM variant using LLaMA models for tasks like text-to-speech. Advances in model optimization with async-TP in PyTorch for tensor parallelism and hyperparameter tuning were noted. Mini-Omni 2 demonstrated multimodal capabilities across image, audio, and text for voice conversations with emphasis on modal alignment and multimodal fine-tuning. AI productivity tools like an AI email writer and LlamaCloud-based research assistants were introduced. Emphasis on practical skill development and privacy-conscious AI tool usage with Llama3-8B was highlighted. Generative AI tools such as #AIPythonforBeginners and GenAI Agents with LangGraph were shared. Business insights covered rapid execution in AI product development and emerging AI-related job roles. Challenges in enterprise-grade text-to-SQL and advanced retrieval methods were discussed with tutorials on RAG applications using LangChain and MongoDB.
Did Nvidia's Nemotron 70B train on test?
nemotron-70b llama-3.1-70b llama-3.1 ministral-3b ministral-8b gpt-4o claude-3.5-sonnet claude-3.5 nvidia mistral-ai hugging-face zep benchmarking reinforcement-learning reward-models temporal-knowledge-graphs memory-layers context-windows model-releases open-source reach_vb philschmid swyx
NVIDIA's Nemotron-70B model has drawn scrutiny despite strong benchmark performances on Arena Hard, AlpacaEval, and MT-Bench, with some standard benchmarks like GPQA and MMLU Pro showing no improvement over the base Llama-3.1-70B. The new HelpSteer2-Preference dataset improves some benchmarks with minimal losses elsewhere. Meanwhile, Mistral released Ministral 3B and 8B models featuring 128k context length and outperforming Llama-3.1 and GPT-4o on various benchmarks under the Mistral Commercial License. NVIDIA's Nemotron 70B also surpasses GPT-4o and Claude-3.5-Sonnet on key benchmarks using RLHF (REINFORCE) training. Additionally, Zep introduced Graphiti, an open-source temporal knowledge graph memory layer for AI agents, built on Neo4j.
Pixtral 12B: Mistral beats Llama to Multimodality
pixtral-12b mistral-nemo-12b llama-3-1-70b llama-3-1-8b deeps-eek-v2-5 gpt-4-turbo llama-3-1 strawberry claude mistral-ai meta-ai-fair hugging-face arcee-ai deepseek-ai openai anthropic vision multimodality ocr benchmarking model-release model-architecture model-performance fine-tuning model-deployment reasoning code-generation api access-control reach_vb devendra_chapilot _philschmid rohanpaul_ai
Mistral AI released Pixtral 12B, an open-weights vision-language model with a Mistral Nemo 12B text backbone and a 400M vision adapter, featuring a large vocabulary of 131,072 tokens and support for 1024x1024 pixel images. This release notably beat Meta AI in launching an open multimodal model. At the Mistral AI Summit, architecture details and benchmark performances were shared, showing strong OCR and screen understanding capabilities. Additionally, Arcee AI announced SuperNova, a distilled Llama 3.1 70B & 8B model outperforming Meta's Llama 3.1 70B instruct on benchmarks. DeepSeek released DeepSeek-V2.5, scoring 89 on HumanEval, surpassing GPT-4-Turbo, Opus, and Llama 3.1 in coding tasks. OpenAI plans to release Strawberry as part of ChatGPT soon, though its capabilities are debated. Anthropic introduced Workspaces for managing multiple Claude deployments with enhanced access controls.
super quiet day
jamba-1.5 phi-3.5 dracarys llama-3-1-70b llama-3-1 ai21-labs anthropic stanford hugging-face langchain qdrant aws elastic state-space-models long-context benchmarking ai-safety virtual-environments multi-agent-systems resource-management community-engagement model-performance bindu-reddy rohanpaul_ai jackclarksf danhendrycks reach_vb iqdotgraph
AI21 Labs released Jamba 1.5, a scaled-up State Space Model optimized for long context windows with 94B parameters and up to 2.5X faster inference, outperforming models like Llama 3.1 70B on benchmarks. The Phi-3.5 model was praised for its safety and performance, while Dracarys, a new 70B open-source coding model announced by Bindu Reddy, claims superior benchmarks over Llama 3.1 70B. Discussions on California's SB 1047 AI safety legislation involve Stanford and Anthropic, highlighting a balance between precaution and industry growth. Innovations include uv virtual environments for rapid setup, LangChain's LangSmith resource tags for project management, and multi-agent systems in Qdrant enhancing data workflows. Community events like the RAG workshop by AWS, LangChain, and Elastic continue to support AI learning and collaboration. Memes remain a popular way to engage with AI industry culture.
Apple Intelligence Beta + Segment Anything Model 2
llama-3-405b llama-3 segment-anything-model meta-ai-fair apple image-segmentation memory-attention video-processing pretraining cloud-tpus post-training synthetic-data instruction-following reasoning writing benchmarking bindureddy maximelabonne reach_vb
Meta advanced its open source AI with a sequel to the Segment Anything Model, enhancing image segmentation with memory attention for video applications using minimal data and compute. Apple Intelligence delayed its official release to iOS 18.1 in October but launched developer previews on MacOS Sequoia, iOS 18, and iPadOS 18, accompanied by a detailed 47-page paper revealing extensive pretraining on 6.3T tokens and use of Cloud TPUs rather than Apple Silicon. The paper highlights improvements in instruction following, reasoning, and writing through post-training and synthetic data. Benchmarks show Apple’s model scores lower than Llama 3, but with trusted human evaluations. Additionally, Meta released Llama 3.1 with a 405B parameter model, marking a significant open-source frontier model release.