All tags
Topic: "video-understanding"
not much happened today
hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb
Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.
Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning
intellect-2 dreamo qwen gemini-2.5-pro dynamic-byte-latent-transformer gen-4-references mistral-medium-3 le-chat-enterprise primeintellect bytedance qwen gemma meta-ai-fair runwayml mistral-ai google distributed-training reinforcement-learning gpu-clusters model-optimization quantization multimodality agentic-ai video-understanding fine-tuning _akhaliq reach_vb osanseviero aiatmeta c_valenzuelab lmarena_ai adcock_brett
Prime Intellect released INTELLECT-2, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. ByteDance launched DreamO, a unified image customization model on Hugging Face. Qwen released models optimized for GPTQ, GGUF, and AWQ quantization. Gemma surpassed 150 million downloads on Hugging Face. Meta released weights for the Dynamic Byte Latent Transformer and the Collaborative Reasoner framework to improve language model efficiency and reasoning. RunwayML introduced Gen-4 References, a near-realtime model requiring no fine-tuning. Mistral AI released Mistral Medium 3, a strong multimodal model, and Le Chat Enterprise, an agentic AI assistant for business. Google updated Gemini 2.5 Pro Preview with video understanding and UI improvements. "Airbnb for spare GPUs from all over the world" highlights the ongoing challenges and potential of distributed GPU training.
gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq
OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.
Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights
apollo-1b apollo-3b apollo-7b veo-2 imagen-3 llama-3-70b llama-3b command-r7b llama-1b llama-8b chatgpt meta-ai-fair hugging-face google-deepmind openai figure-ai klarna cohere notion video-understanding scaling-consistency benchmarking temporal-ocr egocentric-perception spatial-perception reasoning video-generation physics-simulation voice-features map-integration language-expansion test-time-compute-scaling humanoid-robots ai-integration search-optimization self-recognition self-preference-bias akhaliq _lewtun clementdelangue adcock_brett rohanpaul_ai swyx shaneguML
Meta released Apollo, a new family of state-of-the-art video-language models available in 1B, 3B, and 7B sizes, featuring "Scaling Consistency" for efficient scaling and introducing ApolloBench, which speeds up video understanding evaluation by 41× across five temporal perception categories. Google Deepmind launched Veo 2, a 4K video generation model with improved physics and camera control, alongside an enhanced Imagen 3 image model. OpenAI globally rolled out ChatGPT search with advanced voice and map features and discussed a potential $2,000/month "ChatGPT Max" tier. Research highlights include achieving Llama 70B performance using Llama 3B via test-time compute scaling and expanding Command R7B language support from 10 to 23 languages. Industry updates feature Figure AI delivering humanoid robots commercially and Klarna reducing workforce through AI. Notion integrated Cohere Rerank for better search. Studies reveal LLMs can recognize their own writing style and show self-preference bias. Discussions note video processing progress outpacing text due to better signal-per-compute and data evaluation.
AIPhone 16: the Visual Intelligence Phone
reflection-70b llama-3-70b qwen-2-72b llama-3-1-405b claude gpt-4 gemini apple openai weights-biases vision video-understanding benchmarking planning model-evaluation privacy ai-integration instruction-following yann-lecun
Apple announced the new iPhone 16 lineup featuring Visual Intelligence, a new AI capability integrated with Camera Control, Apple Maps, and Siri, emphasizing privacy and default service use over third-party AI like OpenAI. Apple Photos now includes advanced video understanding with timestamp recognition. Meanwhile, Reflection-70B claims to be a top open-source model but benchmarks show it performs close to Llama 3 70B and slightly worse than Qwen 2 72B. Yann LeCun highlighted ongoing challenges with LLM planning abilities, noting models like Llama-3.1-405b and Claude show some skill, while GPT-4 and Gemini lag behind. Weights & Biases is sponsoring an event to advance LLM evaluation techniques with prizes and API access.
Hybrid SSM/Transformers > Pure SSMs/Pure Transformers
mamba-2-hybrid gpt-4 qwen-72b table-llava-7b nvidia lamini-ai sakana-ai luma-labs mixture-of-experts benchmarking fine-tuning multimodality text-to-video model-performance memory-optimization preference-optimization video-understanding multimodal-tables bryan-catanzaro bindureddy ylecun ctnzr corbtt realsharonzhou andrew-n-carr karpathy _akhaliq omarsar0
NVIDIA's Bryan Catanzaro highlights a new paper on Mamba models, showing that mixing Mamba and Transformer blocks outperforms either alone, with optimal attention below 20%. Mixture-of-Agents (MoA) architecture improves LLM generation quality, scoring 65.1% on AlpacaEval 2.0 versus GPT-4 Omni's 57.5%. The LiveBench AI benchmark evaluates reasoning, coding, writing, and data analysis. A hybrid Mamba-2-Hybrid model with 7% attention surpasses a Transformer on MMLU accuracy, jumping from 50% to 53.6%. GPT-4 performs better at temperature=1. Qwen 72B leads open-source models on LiveBench AI. LaminiAI Memory Tuning achieves 95% accuracy on a SQL agent task, improving over instruction fine-tuning. Sakana AI Lab uses evolutionary strategies for preference optimization. Luma Labs Dream Machine demonstrates advanced text-to-video generation. The MMWorld benchmark evaluates multimodal video understanding, and Table-LLaVa 7B competes with GPT-4V on multimodal table tasks.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."
Google AI: Win some (Gemma, 1.5 Pro), Lose some (Image gen)
gemma-2b gemma-7b gemma gemini-pro-1.5 llama-2 llama-3 mistral google hugging-face nvidia benchmarking license-policies image-generation video-understanding long-context dataset-editing model-integration gpu-hardware bug-fixes quantization
Google's Gemma open models (2-7B parameters) outperform Llama 2 and Mistral in benchmarks but face criticism for an unusual license and poor image generation quality, which Google partially acknowledges. The upcoming Gemini Pro 1.5 model features a 1 million token context window, excelling in video understanding and needle-in-haystack tasks. Discord communities like TheBloke and LM Studio discuss mixed reception of Gemma models, anticipation for Llama 3 release, challenges in dataset editing, and hardware considerations such as NVIDIA GeForce RTX 3090 and RTX 4090 GPUs. LM Studio users report issues with version 0.2.15 Beta and ongoing integration of Gemma models, with resources shared on Hugging Face.