All tags
Topic: "text-to-video"
not much happened today
codex claude-4-opus claude-4-sonnet gemini-2.5-pro gemini-2.5 qwen-2.5-vl qwen-3 playdiffusion openai anthropic google perplexity-ai bing playai suno hugging-face langchain-ai qwen mlx assemblyai llamacloud fine-tuning model-benchmarking text-to-video agentic-ai retrieval-augmented-generation open-source-models speech-editing audio-processing text-to-speech ultra-low-latency multimodality public-notebooks sama gdb kevinweil lmarena_ai epochairesearch reach_vb wightmanr deeplearningai mervenoyann awnihannun jordirib1 aravsrinivas omarsar0 lioronai jerryjliu0 nerdai tonywu_71 _akhaliq clementdelangue _mfelfel
OpenAI rolled out Codex to ChatGPT Plus users with internet access and fine-grained controls, improving memory features for free users. Anthropic's Claude 4 Opus and Sonnet models lead coding benchmarks, while Google's Gemini 2.5 Pro and Flash models gain recognition with new audio capabilities. Qwen 2.5-VL and Qwen 3 quantizations are noted for versatility and support. Bing Video Creator launched globally enabling text-to-video generation, and Perplexity Labs sees increased demand for travel search. New agentic AI tools and RAG innovations include LlamaCloud and FedRAG. Open-source releases include Holo-1 for web navigation and PlayAI's PlayDiffusion for speech editing. Audio and multimodal advances feature Suno's music editing upgrades, Google's native TTS in 24+ languages, and Universal Streaming's ultra-low latency speech-to-text. Google NotebookLM now supports public notebooks. "Codex's internet access brings tradeoffs, with explicit warnings about risk" and "Gemini 2.5 Pro is cited as a daily driver by users".
Not much (in AI) happened this weekend
llama-3.1-8b llama-3.2 chatgpt movie-gen openai meta-ai-fair google-deepmind microsoft x-ai spacex harvard nvidia long-context feature-prediction-loss ai-agents privacy text-to-video text-to-image humanoid-robots gpu-deployment media-foundation-models ai-research-labs sam-altman yann-lecun rasbt bindureddy andrej-karpathy soumithchintala svpino adcock_brett rohanpaul_ai
OpenAI introduced an "edit this area" feature for image generation, praised by Sam Altman. Yann LeCun highlighted a NYU paper improving pixel generation with feature prediction loss using pre-trained visual encoders like DINOv2. Long-context LLMs such as llama-3.1-8b and llama-3.2 variants now support up to 131k tokens, offering alternatives to RAG systems. Bindu Reddy announced AI agents capable of building and deploying code from English instructions, signaling AI's replacement of SQL and potential impact on Python. SpaceX's successful Starship rocket catch was celebrated by Andrej Karpathy and others, with Soumith Chintala praising SpaceX's efficient, low-bureaucracy research approach. Privacy concerns arose from Harvard students' AI glasses, I-XRAY, which can reveal personal information. Meta AI FAIR's Movie Gen model advances media foundation models with high-quality text-to-image and video generation, including synced audio. Humanoid robots like Ameca and Azi now engage in expressive conversations using ChatGPT. xAI rapidly deployed 100K Nvidia H100 GPUs in 19 days, with CEO Jensen Huang commending Elon Musk. Leading AI research labs compared include Meta-FAIR, Google DeepMind, and Microsoft Research. Skepticism about LLM intelligence was voiced by Sam Pino, emphasizing limitations in novel problem-solving despite strong memorization.
not much happened this weekend
jamba-1.5 dream-machine-1.5 ideogram-v2 mistral-nemo-minitron-8b mistral-7b llama-3-8b nous-research cursor-ai gdm george-hotz agibot unitree eth-zurich disney uc-san-diego ai21-labs luma-labs ideogram nvidia mistral-ai meta-ai-fair distributed-ai optimizer inter-gpu-communication low-latency-training open-source humanoid-robots robotics physics-based-motion teleoperation multilingual-models long-context text-to-video text-to-image model-performance george-hotz adcock_brett aman
Nous Research announced DisTrO, a new optimizer that drastically reduces inter-GPU communication by 1000x to 10,000x enabling efficient training on slow networks, offering an alternative to GDM's DiLoCo. Cursor AI gained viral attention from an 8-year-old user and announced a new fundraise, with co-host Aman returning to their podcast. George Hotz launched tinybox for sale. In robotics, AGIBOT revealed 5 new humanoid robots with open-source plans, and Unitree showcased its G1 humanoid robot nearing mass production at $16,000. ETH Zurich and Disney developed an AI system for physics-based robot motion generation from text or images. UC San Diego released ACE, an open-source teleoperation system for controlling multiple robots. AI21 Labs unveiled Jamba 1.5, a multilingual model with 256k context length and permissive licensing. Luma Labs released Dream Machine 1.5 for improved text-to-video generation. Ideogram launched v2 of its text-to-image model with near-perfect text generation. Nvidia and Mistral released Mistral-NeMo-Minitron 8B, a small model outperforming Mistral-7B and llama-3-8b on the Open LLM leaderboard.
How Carlini Uses AI
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b gen-3-alpha segment-anything-model-2 stable-fast-3d groq intel deepmind box figure-ai openai google meta-ai-fair nvidia stability-ai runway benchmarking adversarial-attacks large-language-models text-generation multimodality robotics emotion-detection structured-data-extraction real-time-processing teleoperation 3d-generation text-to-video nicholas-carlini chris-dixon rasbt
Groq's shareholders' net worth rises while others fall, with Intel's CEO expressing concern. Nicholas Carlini of DeepMind gains recognition and criticism for his extensive AI writings, including an 80,000-word treatise on AI use and a benchmark for large language models. Chris Dixon comments on AI Winter skepticism, emphasizing long-term impact. Box introduces an AI API for extracting structured data from documents, highlighting potential and risks of LLM-driven solutions. Recent AI developments include Figure AI launching the advanced humanoid robot Figure 02, OpenAI rolling out Advanced Voice Mode for ChatGPT with emotion detection, Google open-sourcing Gemma 2 2B model matching GPT-3.5-Turbo-0613 performance, Meta AI Fair releasing Segment Anything Model 2 (SAM 2) for real-time object tracking, NVIDIA showcasing Project GR00T for humanoid teleoperation with Apple Vision Pro, Stability AI launching Stable Fast 3D for rapid 3D asset generation, and Runway unveiling Gen-3 Alpha for AI text-to-video generation.
Rombach et al: FLUX.1 [pro|dev|schnell], $31m seed for Black Forest Labs
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b flux-1 stability-ai google-deepmind nvidia text-to-image text-to-video model-benchmarking open-weight-models model-distillation safety-classifiers sparse-autoencoders ai-coding-tools rohanpaul_ai fchollet bindureddy clementdelangue ylecun svpino
Stability AI co-founder Rombach launched FLUX.1, a new text-to-image model with three variants: pro (API only), dev (open-weight, non-commercial), and schnell (Apache 2.0). FLUX.1 outperforms Midjourney and Ideogram based on Black Forest Labs' ELO score and plans to expand into text-to-video. Google DeepMind released Gemma-2 2B, a 2 billion parameter open-source model that outperforms larger models like GPT-3.5-Turbo-0613 and Mixtral-8x7b on Chatbot Arena, optimized with NVIDIA TensorRT-LLM. The release includes safety classifiers (ShieldGemma) and sparse autoencoder analysis (Gemma Scope). Discussions highlight benchmarking discrepancies and US government support for open-weight AI models. Critiques of AI coding tools' productivity gains were also noted.
Hybrid SSM/Transformers > Pure SSMs/Pure Transformers
mamba-2-hybrid gpt-4 qwen-72b table-llava-7b nvidia lamini-ai sakana-ai luma-labs mixture-of-experts benchmarking fine-tuning multimodality text-to-video model-performance memory-optimization preference-optimization video-understanding multimodal-tables bryan-catanzaro bindureddy ylecun ctnzr corbtt realsharonzhou andrew-n-carr karpathy _akhaliq omarsar0
NVIDIA's Bryan Catanzaro highlights a new paper on Mamba models, showing that mixing Mamba and Transformer blocks outperforms either alone, with optimal attention below 20%. Mixture-of-Agents (MoA) architecture improves LLM generation quality, scoring 65.1% on AlpacaEval 2.0 versus GPT-4 Omni's 57.5%. The LiveBench AI benchmark evaluates reasoning, coding, writing, and data analysis. A hybrid Mamba-2-Hybrid model with 7% attention surpasses a Transformer on MMLU accuracy, jumping from 50% to 53.6%. GPT-4 performs better at temperature=1. Qwen 72B leads open-source models on LiveBench AI. LaminiAI Memory Tuning achieves 95% accuracy on a SQL agent task, improving over instruction fine-tuning. Sakana AI Lab uses evolutionary strategies for preference optimization. Luma Labs Dream Machine demonstrates advanced text-to-video generation. The MMWorld benchmark evaluates multimodal video understanding, and Table-LLaVa 7B competes with GPT-4V on multimodal table tasks.
World_sim.exe
gpt-4 gpt-4o grok-1 llama-cpp claude-3-opus claude-3 gpt-5 nvidia nous-research stability-ai hugging-face langchain anthropic openai multimodality foundation-models hardware-optimization model-quantization float4 float6 retrieval-augmented-generation text-to-video prompt-engineering long-form-rag gpu-optimization philosophy-of-ai agi-predictions jensen-huang yann-lecun sam-altman
NVIDIA announced Project GR00T, a foundation model for humanoid robot learning using multimodal instructions, built on their tech stack including Isaac Lab, OSMO, and Jetson Thor. They revealed the DGX Grace-Blackwell GB200 with over 1 exaflop compute, capable of training GPT-4 1.8T parameters in 90 days on 2000 Blackwells. Jensen Huang confirmed GPT-4 has 1.8 trillion parameters. The new GB200 GPU supports float4/6 precision with ~3 bits per parameter and achieves 40,000 TFLOPs on fp4 with 2x sparsity.
Open source highlights include the release of Grok-1, a 340B parameter model, and Stability AI's SV3D, an open-source text-to-video generation solution. Nous Research collaborated on implementing Steering Vectors in Llama.CPP.
In Retrieval Augmented Generation (RAG), a new 5.5-hour tutorial builds a pipeline using open-source HF models, and LangChain released a video on query routing and announced integration with NVIDIA NIM for GPU-optimized LLM inference.
Prominent opinions include Yann LeCun distinguishing language from other cognitive abilities, Sam Altman predicting AGI arrival in 6 years with a leap from GPT-4 to GPT-5 comparable to GPT-3 to GPT-4, and discussions on the philosophical status of LLMs like Claude. There is also advice against training models from scratch for most companies.
Google Solves Text to Video
mistral-7b llava google-research amazon-science huggingface mistral-ai together-ai text-to-video inpainting space-time-diffusion code-evaluation fine-tuning inference gpu-rentals multimodality api model-integration learning-rates
Google Research introduced Lumiere, a text-to-video model featuring advanced inpainting capabilities using a Space-Time diffusion process, surpassing previous models like Pika and Runway. Manveer from UseScholar.org compiled a comprehensive list of code evaluation benchmarks beyond HumanEval, including datasets from Amazon Science, Hugging Face, and others. Discord communities such as TheBloke discussed topics including running Mistral-7B via API, GPU rentals, and multimodal model integration with LLava. Nous Research AI highlighted learning rate strategies for LLM fine-tuning, issues with inference, and benchmarks like HumanEval and MBPP. RestGPT gained attention for controlling applications via RESTful APIs, showcasing LLM application capabilities.