All tags
Topic: "voice"
Apple exposes Foundation Models API and... no new Siri
chatgpt apple openai langchain llamaindex on-device-ai foundation-models reasoning reinforcement-learning voice translation software-automation agentic-workflows gdb scaling01 giffmana kevinweil
Apple released on-device foundation models for iOS developers, though their recent "Illusion of Reasoning" paper faced significant backlash for flawed methodology regarding LLM reasoning. OpenAI updated ChatGPT's Advanced Voice Mode with more natural voice and improved translation, demonstrated by Greg Brockman. LangChain and LlamaIndex launched new AI agents and tools, including a SWE Agent for software automation and an Excel agent using reinforcement learning for data transformation. The AI community engaged in heated debate over reasoning capabilities of LLMs, highlighting challenges in evaluation methods.
OpenAI buys Jony Ive's io for $6.5b, LMArena lands $100m seed from a16z
gemini-2.5-pro gemini-diffusion openai lmarena a16z mistral-ai google google-deepmind multimodality reasoning code-generation math model-fine-tuning ai-assistants voice memory-optimization sundar_pichai
OpenAI confirmed a partnership with Jony Ive to develop consumer hardware. LMArena secured a $100 million seed round from a16z. Mistral launched a new code model fine-tune. Google DeepMind announced multiple updates at Google I/O 2024, including over a dozen new models and 20 AI products. Key highlights include the release of Gemini 2.5 Pro and Gemini Diffusion, featuring advanced multimodal reasoning, coding, and math capabilities, and integration of Gemini in Google Chrome as an AI browsing assistant. Deep Think enhanced reasoning mode and Project Astra improvements were also introduced, focusing on voice output, memory, and computer control for a universal AI assistant.
AI Engineer World's Fair: Second Run, Twice The Fun
gemini-2.5-pro google-deepmind waymo tesla anthropic braintrust retrieval-augmentation graph-databases recommendation-systems software-engineering-agents agent-reliability reinforcement-learning voice image-generation video-generation infrastructure security evaluation ai-leadership enterprise-ai mcp tiny-teams product-management design-engineering robotics foundation-models coding web-development demishassabis
The 2025 AI Engineer World's Fair is expanding with 18 tracks covering topics like Retrieval + Search, GraphRAG, RecSys, SWE-Agents, Agent Reliability, Reasoning + RL, Voice AI, Generative Media, Infrastructure, Security, and Evals. New focuses include MCP, Tiny Teams, Product Management, Design Engineering, and Robotics and Autonomy featuring foundation models from Waymo, Tesla, and Google. The event highlights the growing importance of AI Architects and enterprise AI leadership. Additionally, Demis Hassabis announced the Gemini 2.5 Pro Preview 'I/O edition', which leads coding and web development benchmarks on LMArena.
lots of small launches
gpt-4o claude-3.7-sonnet claude-3.7 claude-3.5-sonnet deepseek-r1 deepseek-v3 grok-3 openai anthropic amazon cloudflare perplexity-ai deepseek-ai togethercompute elevenlabs elicitorg inceptionailabs mistral-ai voice model-releases cuda gpu-optimization inference open-source api model-performance token-efficiency context-windows cuda jit-compilation lmarena_ai alexalbert__ aravsrinivas reach_vb
GPT-4o Advanced Voice Preview is now available for free ChatGPT users with enhanced daily limits for Plus and Pro users. Claude 3.7 Sonnet has achieved the top rank in WebDev Arena with improved token efficiency. DeepSeek-R1 with 671B parameters benefits from the Together Inference platform optimizing NVIDIA Blackwell GPU usage, alongside the open-source DeepGEMM CUDA library delivering up to 2.7x speedups on Hopper GPUs. Perplexity launched a new Voice Mode and a Deep Research API. The upcoming Grok 3 API will support a 1M token context window. Several companies including Elicit, Amazon, Anthropic, Cloudflare, FLORA, Elevenlabs, and Inception Labs announced new funding rounds, product launches, and model releases.
not much happened today
o1-preview o1-mini qwen-2.5 gpt-4o deepseek-v2.5 gpt-4-turbo-2024-04-09 grin llama-3-1-405b veo kat openai qwen deepseek-ai microsoft kyutai-labs perplexity-ai together-ai meta-ai-fair google-deepmind hugging-face google anthropic benchmarking math coding instruction-following model-merging model-expressiveness moe voice voice-models generative-video competition open-source model-deployment ai-agents hyung-won-chung noam-brown bindureddy akhaliq karpathy aravsrinivas fchollet cwolferesearch philschmid labenz ylecun
OpenAI's o1-preview and o1-mini models lead benchmarks in Math, Hard Prompts, and Coding. Qwen 2.5 72B model shows strong performance close to GPT-4o. DeepSeek-V2.5 tops Chinese LLMs, rivaling GPT-4-Turbo-2024-04-09. Microsoft's GRIN MoE achieves good results with 6.6B active parameters. Moshi voice model from Kyutai Labs runs locally on Apple Silicon Macs. Perplexity app introduces voice mode with push-to-talk. LlamaCoder by Together.ai uses Llama 3.1 405B for app generation. Google DeepMind's Veo is a new generative video model for YouTube Shorts. The 2024 ARC-AGI competition increases prize money and plans a university tour. A survey on model merging covers 50+ papers for LLM alignment. The Kolmogorov–Arnold Transformer (KAT) paper proposes replacing MLP layers with KAN layers for better expressiveness. Hugging Face Hub integrates with Google Cloud Vertex AI Model Garden for easier open-source model deployment. Agent.ai is introduced as a professional network for AI agents. "Touching grass is all you need."
The DSPy Roadmap
dspy litel-lm gemini chatgpt-4o grok-2 hermes-3 databricks mit google openai x-ai nous-research astribot apple sakana-ai model-optimization fine-tuning optimizers interactive-optimization robotics autonomous-systems voice image-generation open-source-models scientific-research streaming caching omar-khattab giffmana
Omar Khattab announced joining Databricks before his MIT professorship and outlined the roadmap for DSPy 2.5 and 3.0+, focusing on improving core components like LMs, signatures, optimizers, and assertions with features such as adopting LiteLLM to reduce code and enhance caching and streaming. The roadmap also includes developing more accurate, cost-effective optimizers, building tutorials, and enabling interactive optimization tracking. On AI Twitter, Google launched Gemini Live, a mobile conversational AI with voice and 10 voices, alongside Pixel Buds Pro 2 with a custom Tensor A1 chip. OpenAI updated ChatGPT-4o, reclaiming the top spot on LMSYS Arena. xAI released Grok-2 in beta, achieving SOTA in image generation with FLUX 1. Nous Research released open-source Hermes 3 models in 8B, 70B, and 405B sizes, with the 405B model achieving SOTA. Robotics updates include Astribot's humanoid robot and Apple's tabletop robot with Siri voice commands. Sakana AI introduced "The AI Scientist," an autonomous AI research system.
Gemma 2 2B + Scope + Shield
gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering
Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."