All tags
Topic: "speech-recognition"
not much happened today
kimi-k2 gpt-4.1 voxtral goedel-prover-v2 llama-3 mistral-ai moonshot-ai nous-research google-deepmind openai groq anthropic speech-recognition mixture-of-experts benchmarking dataset-release model-architecture theorem-proving reinforcement-learning asymmetry-of-verification inference-speed model-performance cline _jasonwei
Mistral released Voxtral, claimed as the world's best open speech recognition models, available via API and Hugging Face. Moonshot AI launched Kimi K2, a trillion-parameter Mixture-of-Experts (MoE) model, outperforming GPT-4.1 on benchmarks with 65.4% on SWE-Bench Verified and achieving 200 tokens/second inference speed on Groq hardware. Nous Research open-sourced the Hermes 3 dataset with 1 million samples, aiding SOTA models on the Llama-3 series. Google DeepMind introduced the Mixture-of-Recursions (MoR) architecture promising 2x inference speed and 50% parameter reduction but faced skepticism. Goedel-Prover V2 topped the PutnamBench theorem proving benchmark. AtCoder World Finals saw a human winner with OpenAI placing second. Research highlights include Jason Wei's insights on reinforcement learning and the "Verifier's Law" emphasizing the asymmetry of verification in AI training.
minor ai followups: MultiAgents, Meta-SSI-Scale, Karpathy, AI Engineer
gpt-4o afm-4.5b gemma qwen stt-1b-en_fr stt-2.6b-en hunyuan-3d-2.1 openai meta-ai-fair scale-ai huggingface tencent arcee-ai ai-safety alignment ai-regulation memory-optimization scalable-oversight speech-recognition 3d-generation foundation-models sama polynoamial neelnanda5 teortaxestex yoshua_bengio zachtratar ryanpgreenblatt reach_vb arankomatsuzaki code_star
OpenAI released a paper revealing how training models like GPT-4o on insecure code can cause broad misalignment, drawing reactions from experts like @sama and @polynoamial. California's AI regulation efforts were highlighted by @Yoshua_Bengio emphasizing transparency and whistleblower protections. The term "context rot" was coined to describe LLM conversation degradation, with systems like Embra using CRM-like memory for robustness. Scalable oversight research aiming to improve human control over smarter AIs was discussed by @RyanPGreenblatt. New model releases include Kyutai's speech-to-text models capable of 400 real-time streams on a single H100 GPU, Tencent's Hunyuan 3D 2.1 as the first open-source production-ready PBR 3D generative model, and Arcee's AFM-4.5B foundation model family targeting enterprise use, competitive with Gemma and Qwen.
Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model
gemini-2.5-pro claude-3.7-sonnet llama-nemotron qwen3 google-deepmind nvidia alibaba hugging-face multimodality coding reasoning model-release speech-recognition recommender-systems benchmarking demishassabis _philschmid lmarena_ai scaling01 fchollet
Gemini 2.5 Pro has been updated with enhanced multimodal image-to-code capabilities and dominates the WebDev Arena Leaderboard, surpassing Claude 3.7 Sonnet in coding and other tasks. Nvidia released the Llama-Nemotron model family on Hugging Face, noted for efficient reasoning and inference. Alibaba's Qwen3 models range from 0.6B to 235B parameters, including dense and MoE variants. KerasRS was released by Fran ois Chollet as a new recommender system library compatible with JAX, PyTorch, and TensorFlow, optimized for TPUs. These updates highlight advancements in coding, reasoning, and speech recognition models.
OpenAI adopts MCP
gemini-2.5-pro gemini-1.5-pro gemini-2.0-flash qwen-2.5-omni-7b deepseek-v3-0324 deepseek-r1 openai google-deepmind alibaba togethercompute model-benchmarking multimodality reasoning scaling-laws model-quantization synthetic-data model-performance context-windows speech-recognition translation audio-processing video-processing swyx
OpenAI announced support for MCP, a significant technical update. Google's Gemini 2.5 Pro leads benchmarks with top scores in MMLU-Pro (86%), GPQA Diamond (83%), and AIME 2024 (88%), featuring a 1 million token context window and multimodal inputs. Alibaba's Qwen 2.5 Omni 7B was released as a fully multimodal, interactive, open-source model with a novel "thinker-talker" architecture supporting voice and video chat. DeepSeek V3-0324 outperforms its predecessor on multiple benchmarks. Research on reasoning features in large language models using sparse autoencoders was highlighted, alongside a study on scaling laws of synthetic data showing performance plateaus near 300B tokens. Discussions also covered the fastest output speeds of Gemini models and concerns about over-reliance on benchmarks for intelligence measurement. Swyx will curate the Data Council AI Engineering Track in April.
a quiet weekend
sam-2 qwen2-math gpt-4 claude-3.5 figure deepmind boston-dynamics alibaba llamaindex robotics object-segmentation real-time-processing disease-prediction speech-recognition cli-tools model-performance adcock_brett rasbt hamel-husain rohanpaul_ai
Figure unveiled Figure 02, claimed as the most advanced humanoid robot, operating autonomously at BMW's Plant Spartanburg. DeepMind developed a table tennis robot achieving 100% wins against beginners and 55% against intermediates. Boston Dynamics showcased the dexterity of its fully-electric Atlas robot performing pushups and burpees. An autonomous dental robot performed the world's first dental procedure on a human, reducing a 2-hour process to 15 minutes using a 3D volumetric scanner. SAM 2 was introduced as an open model for real-time object segmentation without custom adaptation. Alibaba released Qwen2-Math, outperforming GPT-4 and Claude 3.5 in math capabilities. A new Listening-While-Speaking Language Model (LSLM) enables simultaneous listening and speaking in real-time. Researchers developed a disease prediction AI with 95% accuracy for diseases like coronary artery disease, type 2 diabetes, and breast cancer. Tools like LlamaParse CLI and MLX Whisper package enhance PDF parsing and speech recognition, with the latter running 40X faster than realtime on M1 Max. The news highlights significant advancements in robotics, AI models, and practical AI tools.
GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4O version)
gpt-4o gpt-4-turbo openai lmsys multion adept multimodality vision speech-recognition tokenization real-time-processing coding model-performance model-optimization desktop-agents sama gdb
OpenAI has released GPT-4o, a new multimodal model capable of reasoning across text, audio, and video in real time with low latency (~300ms). It features voice and vision capabilities, improved non-English language performance with an expanded 200k vocabulary tokenizer, and is available to all ChatGPT users including free plans. GPT-4o is half the price and twice as fast as GPT-4-turbo with 5x rate limits. The model supports real-time voice and video input/output and shows strong coding capabilities. The release includes a new desktop app that can read screen and clipboard history, challenging existing desktop agent startups. The announcement was accompanied by demos including image generation and 3D object handling, with OpenAI achieving state-of-the-art performance in ASR and vision tasks. The update was widely discussed on social media, with comparisons to GPT-4T highlighting GPT-4o's speed and versatility. "GPT-4o is smart, fast, natively multimodal, and a step towards more natural human-computer interaction" and "extremely versatile and fun to play with".