All tags
Topic: "text-to-speech"
not much happened today
codex claude-4-opus claude-4-sonnet gemini-2.5-pro gemini-2.5 qwen-2.5-vl qwen-3 playdiffusion openai anthropic google perplexity-ai bing playai suno hugging-face langchain-ai qwen mlx assemblyai llamacloud fine-tuning model-benchmarking text-to-video agentic-ai retrieval-augmented-generation open-source-models speech-editing audio-processing text-to-speech ultra-low-latency multimodality public-notebooks sama gdb kevinweil lmarena_ai epochairesearch reach_vb wightmanr deeplearningai mervenoyann awnihannun jordirib1 aravsrinivas omarsar0 lioronai jerryjliu0 nerdai tonywu_71 _akhaliq clementdelangue _mfelfel
OpenAI rolled out Codex to ChatGPT Plus users with internet access and fine-grained controls, improving memory features for free users. Anthropic's Claude 4 Opus and Sonnet models lead coding benchmarks, while Google's Gemini 2.5 Pro and Flash models gain recognition with new audio capabilities. Qwen 2.5-VL and Qwen 3 quantizations are noted for versatility and support. Bing Video Creator launched globally enabling text-to-video generation, and Perplexity Labs sees increased demand for travel search. New agentic AI tools and RAG innovations include LlamaCloud and FedRAG. Open-source releases include Holo-1 for web navigation and PlayAI's PlayDiffusion for speech editing. Audio and multimodal advances feature Suno's music editing upgrades, Google's native TTS in 24+ languages, and Universal Streaming's ultra-low latency speech-to-text. Google NotebookLM now supports public notebooks. "Codex's internet access brings tradeoffs, with explicit warnings about risk" and "Gemini 2.5 Pro is cited as a daily driver by users".
not much happened today
claude-4 claude-4-opus claude-4-sonnet gemini-2.5-pro gemma-3n imagen-4-ultra anthropic google-deepmind openai codebase-understanding coding agentic-performance multimodality text-to-speech video-generation model-integration benchmarking memory-optimization cline amanrsanger ryanpgreenblatt johnschulman2 alexalbert__ nearcyan mickeyxfriedman jeremyphoward gneubig teortaxesTex scaling01 artificialanlys philschmid
Anthropic's Claude 4 models (Opus 4, Sonnet 4) demonstrate strong coding abilities, with Sonnet 4 achieving 72.7% on SWE-bench and Opus 4 at 72.5%. Claude Sonnet 4 excels in codebase understanding and is considered SOTA on large codebases. Criticism arose over Anthropic's handling of ASL-3 security requirements. Demand for Claude 4 is high, with integration into IDEs and support from Cherry Studio and FastHTML. Google DeepMind introduced Gemini 2.5 Pro Deep Think and Gemma 3n, a mobile multimodal model reducing RAM usage by nearly 3x. Google's Imagen 4 Ultra ranks third in the Artificial Analysis Image Arena, available on Vertex AI Studio. Google also promoted Google Beam, an AI video model for immersive 3D experiences, and new text-to-speech models with multi-speaker support. The GAIA benchmark shows Claude 4 Opus and Sonnet leading in agentic performance.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
Promptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AI
gpt-4o-transcribe gpt-4o-mini-tts o1-pro kokoro-82m openai replicate speech-to-text text-to-speech voice-activity-detection prompt-engineering real-time-processing model-release api function-calling structured-outputs model-performance juberti sama reach_vb kevinweil omarsar0
OpenAI has launched three new state-of-the-art audio models in their API, including gpt-4o-transcribe, a speech-to-text model outperforming Whisper, and gpt-4o-mini-tts, a text-to-speech model with promptable prosody allowing control over timing and emotion. The Agents SDK now supports audio, enabling voice agents. OpenAI also updated turn detection for real-time voice activity detection (VAD) based on speech content. Additionally, OpenAI's o1-pro model is available to select developers with advanced features like vision and function calling, though at higher compute costs. The community shows strong enthusiasm for these audio advancements, with a radio contest for TTS creations underway. Meanwhile, Kokoro-82M v1.0 emerges as a leading open weights TTS model with competitive pricing on Replicate.
Every 7 Months: The Moore's Law for Agent Autonomy
claude-3-7-sonnet llama-4 phi-4-multimodal gpt-2 cosmos-transfer1 gr00t-n1-2b orpheus-3b metr nvidia hugging-face canopy-labs meta-ai-fair microsoft agent-autonomy task-completion multimodality text-to-speech robotics foundation-models model-release scaling-laws fine-tuning zero-shot-learning latency reach_vb akhaliq drjimfan scaling01
METR published a paper measuring AI agent autonomy progress, showing it has doubled every 7 months since 2019 (GPT-2). They introduced a new metric, the 50%-task-completion time horizon, where models like Claude 3.7 Sonnet achieve 50% success in about 50 minutes. Projections estimate 1 day autonomy by 2028 and 1 month autonomy by late 2029. Meanwhile, Nvidia released Cosmos-Transfer1 for conditional world generation and GR00T-N1-2B, an open foundation model for humanoid robot reasoning with 2B parameters. Canopy Labs introduced Orpheus 3B, a high-quality text-to-speech model with zero-shot voice cloning and low latency. Meta reportedly delayed Llama-4 release due to performance issues. Microsoft launched Phi-4-multimodal.
not much happened today
zonos-v0.1 audiobox-aesthetics moshi sonar llama-3-70b gpt-4o-mini claude-3.5-haiku gpt-4o claude-3.5-sonnet deepseek-r1-distilled-qwen-1.5b reasonflux-32b o1-preview zyphra-ai meta-ai-fair kyutai-labs perplexity-ai cerebras uc-berkeley brilliant-labs google-deepmind text-to-speech speech-to-speech benchmarking model-performance reinforcement-learning math real-time-processing open-source cross-platform-integration multilinguality zero-shot-learning danhendrycks
Zyphra AI launched Zonos-v0.1, a leading open-weight text-to-speech model supporting multiple languages and zero-shot voice cloning. Meta FAIR released the open-source Audiobox Aesthetics model trained on 562 hours of audio data. Kyutai Labs introduced Moshi, a real-time speech-to-speech system with low latency. Perplexity AI announced the Sonar model based on Llama 3.3 70b, outperforming top models like GPT-4o and Claude 3.5 Sonnet with 1200 tokens/second speed, powered by Cerebras infrastructure. UC Berkeley open-sourced a 1.5B model trained with reinforcement learning that beats o1-preview on math tasks. ReasonFlux-32B achieved 91.2% on the MATH benchmark, outperforming OpenAI o1-preview. CrossPoster, an AI agent for cross-platform posting, was released using LlamaIndex workflows. Brilliant Labs integrated the Google DeepMind Gemini Live API into smart glasses for real-time translation and object identification.
not much happened today
gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr
Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.
not much happened today
oute-tts-0.3-1b oute-tts-0.3-500m olm-1b qwen-2.5-0.5b hover gpt-4o deepseek-v3 harvey meta-ai-fair stability-ai alibaba deepseek hugging-face text-to-speech zero-shot-learning multilinguality emotion-control motor-control reinforcement-learning local-ai distributed-inference pipeline-parallelism mathematical-reasoning process-reward-models legal-ai education-ai ai-security humor reach_vb drjimfan vikhyatk mervenoyann aiatmeta iscienceluvr alibaba_qwen awnihannun ajeya_cotra emollick qtnx_ designerx
Harvey secured a new $300M funding round. OuteTTS 0.3 1B & 500M text-to-speech models were released featuring zero-shot voice cloning, multilingual support (en, jp, ko, zh, fr, de), and emotion control, powered by OLMo-1B and Qwen 2.5 0.5B. The HOVER model, a 1.5M-parameter neural net for agile motor control, was introduced, leveraging human motion capture datasets and massively parallel reinforcement learning. kokoro.js enables running AI models locally in browsers with minimal dependencies. Meta AI awarded $200K LLM evaluation grants for projects on regional language understanding, complex reasoning, and interactive programming environments. Stability AI's Twitter account was hacked, prompting security warnings. Alibaba Qwen improved Process Reward Models (PRMs) for better mathematical reasoning using a consensus filtering mechanism. DeepSeek V3 uses pipeline parallelism to enhance distributed inference and long-context generation efficiency. Discussions on AI policy in legal frameworks and AI's role in democratizing education were highlighted. Lighthearted AI-related humor was also shared.
not much happened this weekend
claude-3.5-sonnet llama-3 llama-3-8b notebookllama min-omni-2 moondream openai anthropic hugging-face mistral-ai google-deepmind langchain deepmind microsoft pattern-recognition reinforcement-learning prompt-optimization text-to-speech model-optimization tensor-parallelism hyperparameters multimodal modal-alignment multimodal-fine-tuning ai-productivity privacy generative-ai rag retrieval-augmentation enterprise-text-to-sql amanda-askell philschmid stasbekman francois-fleuret mervenoyann reach_vb dzhng aravsrinivas sama lateinteraction andrew-y-ng bindureddy jerryjliu0
Moondream, a 1.6b vision language model, secured seed funding, highlighting a trend in moon-themed tiny models alongside Moonshine (27-61m ASR model). Claude 3.5 Sonnet was used for AI Twitter recaps. Discussions included pattern recognition vs. intelligence in LLMs, reinforcement learning for prompt optimization, and NotebookLlama, an open-source NotebookLM variant using LLaMA models for tasks like text-to-speech. Advances in model optimization with async-TP in PyTorch for tensor parallelism and hyperparameter tuning were noted. Mini-Omni 2 demonstrated multimodal capabilities across image, audio, and text for voice conversations with emphasis on modal alignment and multimodal fine-tuning. AI productivity tools like an AI email writer and LlamaCloud-based research assistants were introduced. Emphasis on practical skill development and privacy-conscious AI tool usage with Llama3-8B was highlighted. Generative AI tools such as #AIPythonforBeginners and GenAI Agents with LangGraph were shared. Business insights covered rapid execution in AI product development and emerging AI-related job roles. Challenges in enterprise-grade text-to-SQL and advanced retrieval methods were discussed with tutorials on RAG applications using LangChain and MongoDB.
DeepMind SIMA: one AI, 9 games, 600 tasks, vision+language ONLY
llama-3 claude-3-opus claude-3 gpt-3.5-turbo deepmind cognition-labs deepgram modal-labs meta-ai-fair anthropic multimodality transformer software-engineering ai-agents ai-infrastructure training text-to-speech speech-to-text real-time-processing model-architecture benchmarking andrej-karpathy arav-srinivas francois-chollet yann-lecun soumith-chintala john-carmack
DeepMind SIMA is a generalist AI agent for 3D virtual environments evaluated on 600 tasks across 9 games using only screengrabs and natural language instructions, achieving 34% success compared to humans' 60%. The model uses a multimodal Transformer architecture. Andrej Karpathy outlines AI autonomy progression in software engineering, while Arav Srinivas praises Cognition Labs' AI agent demo. François Chollet expresses skepticism about automating software engineering fully. Yann LeCun suggests moving away from generative models and reinforcement learning towards human-level AI. Meta's Llama-3 training infrastructure with 24k H100 Cluster Pods is shared by Soumith Chintala and Yann LeCun. Deepgram's Aura offers low-latency speech APIs, and Modal Labs' Devin AI demonstrates document navigation and interaction with ComfyUI. Memes and humor circulate in the AI community.
MetaVoice & RIP Bard
mixtral nous-mixtral-dpo miqu-70b gpt-4 llama-2-70b-instruct llama-2 llama-2-70b llama-2-70b-instruct coqui metavoice google openai thebloke text-to-speech voice-cloning longform-synthesis prompt-engineering direct-preference-optimization lora-fine-tuning transformers gpu-acceleration apple-silicon content-authenticity metadata ai-censorship open-source-ai model-comparison usability model-limitations
Coqui, a TTS startup that recently shut down, inspired a new TTS model supporting voice cloning and longform synthesis from a small startup called MetaVoice. Google discontinued the Bard brand in favor of Gemini. On TheBloke Discord, discussions focused on AI training with models like Mixtral, Nous Mixtral DPO, and Miqu 70B, comparing them to OpenAI's GPT models, and debated prompt engineering, lorebooks, and removing safety features via LoRA fine-tuning on models such as Llama2 70B instruct. Technical topics included transformer layer offloading limitations and adapting LLaMa 2 for Apple Silicon. On OpenAI Discord, DALL-E images now include C2PA metadata for content authenticity, sparking debates on AI censorship, metadata manipulation, and open-source AI models versus commercial giants like GPT-4. Users discussed GPT-4 usability, limitations, and practical applications.
1/3/2024: RIP Coqui
sdxl diffusers-0.25 coqui mozilla hugging-face google text-to-speech performance-optimization token-management transformer-architecture image-datasets web-crawling pytorch leaderboards
Coqui, a prominent open source text-to-speech project from the Mozilla ML group, officially shut down. Discussions in the HuggingFace Discord highlighted skepticism about the claimed
3X faster
speed of sdxl, attributing improvements more to techniques like torch.compile
and removal of fp16
and attention
rather than diffusers 0.25 features. Users confirmed that a HuggingFace user token can be used across multiple machines, though distinct tokens are recommended for safety. The Learning Loss Minimization (LLM) Leaderboard briefly experienced issues but was later confirmed operational. A Kaggle notebook was shared demonstrating how to build Transformer architectures from scratch using PyTorch. Additionally, a new image dataset with 15k shoe, sandal, and boot images was introduced for multiclass classification tasks. Explanations about the workings of the Common Crawl web-crawling process were also shared.