All tags
Person: "juberti"
OpenAI Realtime API GA and new `gpt-realtime` model, 20% cheaper than 4o
gpt-realtime gpt-4o-realtime grok-code-fast-1 codex mai-1-preview mai-voice-1 gemini-cli openai xai microsoft google speech-to-speech instruction-following function-calling telephony webrtc voice-agents multilingual-switching voice-control benchmarks coding-models ide-integration developer-tools model-updates swyx juberti omarsar0 reach_vb pbbakkum skcd42 mohitreddy13 cline kevinweil gdb sama _philschmid
OpenAI launched the gpt-realtime model and Realtime API to GA, featuring advanced speech-to-speech capabilities, new voices (Cedar, Marin), image input, SIP telephony, and a ~20% price cut. Benchmarks show improvements over gpt-4o-realtime on BigBench and ComplexFuncBench. xAI introduced Grok Code Fast 1, a speed-optimized coding model integrated with popular IDEs, while OpenAI Codex received major upgrades for local and cloud development workflows. Google’s Gemini CLI improved multi-editor support, and new models like Microsoft MAI-1-preview and MAI-Voice-1 were announced. "The new all-in-one WebRTC API removes the ephemeral token step and supports video on the same connection," highlighting enhanced developer tooling.
not much happened today
gpt-5 gpt-4o grok-4 claude-4-sonnet openai microsoft reasoning latency model-routing benchmarking reinforcement-learning hallucination-control creative-writing priority-processing api-traffic model-deprecation user-experience model-selection voice-mode documentation sama nickaturley elaineyale6 scaling01 mustafasuleyman kevinweil omarsar0 jeremyphoward juberti epochairesearch lechmazur gdb
OpenAI launched GPT-5 with a unified user experience removing manual model selection, causing initial routing and access issues for Plus users that are being addressed with fixes including restored model options and increased usage limits. GPT-5 introduces "Priority Processing" for lower latency at higher price tiers, achieving ~750ms median time-to-first-token in some cases. Microsoft reports full Copilot adoption of GPT-5, and API traffic doubled within 24 hours, peaking at 2 billion tokens per minute. Early benchmarks show GPT-5 leading in reasoning tasks like FrontierMath and LiveBench, with improvements in hallucination control and creative writing, though some models like Grok-4 and Claude-4 Sonnet Thinking outperform it in specific RL-heavy reasoning benchmarks. OpenAI also released extensive migration and feature guides but faced some rollout issues including a broken code sample and a problematic Voice Mode launch. "Unified GPT-5" ends model pickers, pushing developers away from manual model selection.
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
not much happened today
gpt-2 r1 gemma-3 gemmacoder3-12b qwen2.5-omni openai deepseek berkeley alibaba togethercompute nvidia azure runway langchain bmw amazon open-source function-calling benchmarking code-reasoning multimodality inference-speed image-generation voice-generation animation robotics realtime-transcription webrtc sama clémentdelangue lioronai scaling01 cognitivecompai osanseviero jack_w_rae ben_burtenshaw theturingpost vipulved kevinweil tomlikesrobots adcock_brett juberti
OpenAI plans to release its first open-weight language model since GPT-2 in the coming months, signaling a move towards more open AI development. DeepSeek launched its open-source R1 model earlier this year, challenging perceptions of China's AI progress. Gemma 3 has achieved function calling capabilities and ranks on the Berkeley Function-Calling Leaderboard, while GemmaCoder3-12b improves code reasoning performance on LiveCodeBench. Alibaba_Qwen's Qwen2.5-Omni introduces a novel Thinker-Talker system and TMRoPE for multimodal input understanding. The TogetherCompute team achieved 140 TPS on a 671B parameter model, outperforming Azure and DeepSeek API on Nvidia GPUs. OpenAI also expanded ChatGPT features with image generation for all free users and a new voice release. Runway Gen-4 enhances animation for miniature dioramas, and LangChain launched a chat-based generative UI agent. Commercial deployment of Figure 03 humanoid robots at BMW highlights advances in autonomy and manufacturing scaling. New tools include OpenAI's realtime transcription API with WebRTC support and Amazon's Nova Act AI browser agent.
Promptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AI
gpt-4o-transcribe gpt-4o-mini-tts o1-pro kokoro-82m openai replicate speech-to-text text-to-speech voice-activity-detection prompt-engineering real-time-processing model-release api function-calling structured-outputs model-performance juberti sama reach_vb kevinweil omarsar0
OpenAI has launched three new state-of-the-art audio models in their API, including gpt-4o-transcribe, a speech-to-text model outperforming Whisper, and gpt-4o-mini-tts, a text-to-speech model with promptable prosody allowing control over timing and emotion. The Agents SDK now supports audio, enabling voice agents. OpenAI also updated turn detection for real-time voice activity detection (VAD) based on speech content. Additionally, OpenAI's o1-pro model is available to select developers with advanced features like vision and function calling, though at higher compute costs. The community shows strong enthusiasm for these audio advancements, with a radio contest for TTS creations underway. Meanwhile, Kokoro-82M v1.0 emerges as a leading open weights TTS model with competitive pricing on Replicate.
o1 API, 4o/4o-mini in Realtime API + WebRTC, DPO Finetuning
o1-2024-12-17 o1 o1-pro 4o 4o-mini gemini-2-0-flash claude-3.5-sonnet claude-3.5 openai google google-deepmind function-calling structured-outputs vision reasoning webrtc realtime-api preference-tuning fine-tuning api model-performance aidan_mclau kevinweil simonw michpokrass morgymcg juberti
OpenAI launched the o1 API with enhanced features including vision inputs, function calling, structured outputs, and a new
reasoning_effort
parameter, achieving 60% fewer reasoning tokens on average. The o1 pro variant is confirmed as a distinct implementation coming soon. Improvements to the Realtime API with WebRTC integration offer easier usage, longer sessions (up to 30 minutes), and significantly reduced pricing (up to 10x cheaper with mini models). DPO Preference Tuning for fine-tuning is introduced, currently available for the 4o model. Additional updates include official Go and Java SDKs and OpenAI DevDay videos. The news also highlights discussions on Google Gemini 2.0 Flash model's performance reaching 83.6% accuracy.