All tags
Person: "dair_ai"
not much happened today
kernelllm-8b gpt-4o deepseek-v3 mistral-medium-3 qwen3 blip3-o xgen-small anisora stable-audio-open-small alphaevolve meta-ai-fair mistral-ai qwen deepseek salesforce bilibili stability-ai google benchmarking model-performance multilinguality hardware-optimization multimodality image-generation video-generation text-to-audio model-parallelism chain-of-thought instruction-following reasoning mitigation-strategies reach_vb lmarena_ai theadimeline adcock_brett jxmnop dair_ai omarsar0
Meta released KernelLLM 8B, outperforming GPT-4o and DeepSeek V3 on KernelBench-Triton Level 1. Mistral Medium 3 debuted strongly in multiple benchmarks. Qwen3 models introduced a unified framework with multilingual support. DeepSeek-V3 features hardware-aware co-design. BLIP3-o family released for multimodal tasks using diffusion transformers. Salesforce launched xGen-Small models excelling in long-context and math benchmarks. Bilibili released AniSORA for anime video generation. Stability AI open-sourced Stable Audio Open Small optimized for Arm devices. Google’s AlphaEvolve coding agent improved Strassen's algorithm for the first time since 1969. Research shows chain-of-thought reasoning can harm instruction-following ability, with mitigation strategies like classifier-selective reasoning being most effective, but reasoning techniques show high variance and limited generalization. "Chain-of-thought (CoT) reasoning can harm a model’s ability to follow instructions" and "Mitigation strategies such as few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning can counteract reasoning-induced failures".
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
OpenAI beats Anthropic to releasing Speculative Decoding
claude-3-sonnet mrt5 openai anthropic nvidia microsoft boston-dynamics meta-ai-fair runway elevenlabs etched osmo physical-intelligence langchain speculative-decoding prompt-lookup cpu-inference multimodality retrieval-augmented-generation neural-networks optimization ai-safety governance model-architecture inference-economics content-generation adcock_brett vikhyatk dair_ai rasbt bindureddy teortaxestex svpino c_valenzuelab davidsholz
Prompt lookup and Speculative Decoding techniques are gaining traction with implementations from Cursor, Fireworks, and teased features from Anthropic. OpenAI has introduced faster response times and file edits with these methods, offering about 50% efficiency improvements. The community is actively exploring AI engineering use cases with these advancements. Recent updates highlight progress from companies like NVIDIA, OpenAI, Anthropic, Microsoft, Boston Dynamics, and Meta. Key technical insights include CPU inference capabilities, multimodal retrieval-augmented generation (RAG), and neural network fundamentals. New AI products include fully AI-generated games and advanced content generation tools. Challenges in AI research labs such as bureaucracy and resource allocation were also discussed, alongside AI safety and governance concerns.
Microsoft AgentInstruct + Orca 3
mistral-7b orca-2.5 microsoft-research apple tencent hugging-face synthetic-data fine-tuning instruction-following transformers model-performance hallucination-detection dataset-quality flashattention mixture-of-experts philschmid sama bindureddy rohanpaul_ai zachtratar dair_ai
Microsoft Research released AgentInstruct, the third paper in its Orca series, introducing a generative teaching pipeline that produces 25.8 million synthetic instructions to fine-tune mistral-7b, achieving significant performance gains: +40% AGIEval, +19% MMLU, +54% GSM8K, +38% BBH, +45% AlpacaEval, and a 31.34% reduction in hallucinations. This synthetic data approach follows the success of FineWeb and Apple's Rephrasing research in improving dataset quality. Additionally, Tencent claims to have generated 1 billion diverse personas for synthetic data. On AI Twitter, notable discussions included a shooting incident at a Trump rally and recent ML research highlights such as FlashAttention-3, RankRAG, and Mixture of A Million Experts.
Gemini Nano: 50-90% of Gemini Pro, <100ms inference, on device, in Chrome Canary
gemini-nano gemini-pro claude-3.5-sonnet gpt-4o deepseek-coder-v2 glm-0520 nemotron-4-340b gpt-4-turbo-0409 google gemini huggingface anthropic deepseek zhipu-ai tsinghua nvidia model-quantization prompt-api optimization model-weights benchmarking code-generation math synthetic-data automatic-differentiation retrieval-augmented-generation mitigating-memorization tree-search inference-time-algorithms adcock_brett dair_ai lmsysorg
The latest Chrome Canary now includes a feature flag for Gemini Nano, offering a prompt API and on-device optimization guide, with models Nano 1 and 2 at 1.8B and 3.25B parameters respectively, showing decent performance relative to Gemini Pro. The base and instruct-tuned model weights have been extracted and posted to HuggingFace. In AI model releases, Anthropic launched Claude 3.5 Sonnet, which outperforms GPT-4o on some benchmarks, is twice as fast as Opus, and is free to try. DeepSeek-Coder-V2 achieves 90.2% on HumanEval and 75.7% on MATH, surpassing GPT-4-Turbo-0409, with models up to 236B parameters and 128K context length. GLM-0520 from Zhipu AI/Tsinghua ranks highly in coding and overall benchmarks. NVIDIA announced Nemotron-4 340B, an open model family for synthetic data generation. Research highlights include TextGrad, a framework for automatic differentiation on textual feedback; PlanRAG, an iterative plan-then-RAG decision-making technique; a paper on goldfish loss to mitigate memorization in LLMs; and a tree search algorithm for language model agents.