All tags
Topic: "vision"
not much happened today
hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb
Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.
not much happened today
open-code-reasoning-32b open-code-reasoning-14b open-code-reasoning-7b mistral-medium-3 llama-4-maverick gemini-2.5-pro gemini-2.5-flash claude-3.7-sonnet absolute-zero-reasoner x-reasoner fastvlm parakeet-asr openai nvidia mistral-ai google apple huggingface reinforcement-learning fine-tuning code-generation reasoning vision on-device-ai model-performance dataset-release model-optimization reach_vb artificialanlys scaling01 iscienceluvr arankomatsuzaki awnihannun risingsayak
OpenAI launched both Reinforcement Finetuning and Deep Research on GitHub repos, drawing comparisons to Cognition's DeepWiki. Nvidia open-sourced Open Code Reasoning models (32B, 14B, 7B) with Apache 2.0 license, showing 30% better token efficiency and compatibility with llama.cpp, vLLM, transformers, and TGI. Independent evaluations highlight Mistral Medium 3 rivaling Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet in coding and math reasoning, priced significantly lower but no longer open-source. Google's Gemini 2.5 Pro is noted as their most intelligent model with improved coding from simple prompts, while Gemini 2.5 Flash incurs a 150x cost increase over Gemini 2.0 Flash due to higher token usage and cost. The Absolute Zero Reasoner (AZR) achieves SOTA performance in coding and math reasoning via reinforced self-play without external data. Vision-language model X-REASONER is post-trained on general-domain text for reasoning. Apple ML research released FastVLM with on-device iPhone demo. HiDream LoRA trainer supports QLoRA fine-tuning under memory constraints. Nvidia's Parakeet ASR model tops Hugging Face ASR leaderboard with MLX implementation. New datasets SwallowCode and SwallowMath boost LLM performance in math and code. Overall, a quiet day with significant model releases and performance insights.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
not much happened today
grok-3 grok-3-mini gpt-4.5 claude-3.7-sonnet quasar-alpha optimus-alpha gpt-4.1 kaleidoscope internvl3 internvit qwen2.5vl transmamba fantasytalking openai alibaba cmu reinforcement-learning reasoning benchmarks vision multilinguality multimodality transformers attention-mechanisms agents code-generation model-performance rasbt sarahookr mervenoyann gneubig svpino mathemagic1an
The AI news recap highlights independent evaluations showing Grok-3 outperforming models like GPT-4.5 and Claude 3.7 Sonnet on reasoning benchmarks, while Grok-3 mini excels in reasoning tasks. Research on reinforcement learning (RL) fine-tuning reveals potential improvements for small reasoning models but also notes instability in reported gains. Benchmark results suggest Quasar Alpha and Optimus Alpha may be versions of GPT-4.1. Vision and multimodal models like Kaleidoscope, supporting 18 languages, and InternVL3, built on InternViT and Qwen2.5VL, demonstrate advances in multilingual vision and reasoning. The fusion model TransMamba combines transformer precision with speed via SSM mechanisms. Alibaba's FantasyTalking generates realistic talking portraits. Agent-focused events at CMU and tools like FilmAgent AI for virtual film production and BrowseComp benchmark for browsing agents were announced. The coding assistant Augment supports multiple IDEs with code analysis and suggestions. Discussions also covered Google’s new agent-to-agent protocol concept.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
deepseek-native-sparse-attention r1-1776 paligemma-2-mix muse baichuan-m1-14b stripedhyena-2 huggingface deepseek perplexity-ai google-deepmind microsoft baichuan stripedhyena gpu-training scaling multimodality vision model-training foundation-models medical-llm genome-modeling robotic-manipulation interactive-content eliebakouch nouamanetazi lvwerra thom-wolf proftomyeh alex-wang aravsrinivas _akhaliq _philschmid mervenoyann reach_vb arankomatsuzaki maximelabonne
Huggingface released "The Ultra-Scale Playbook: Training LLMs on GPU Clusters," an interactive blogpost based on 4000 scaling experiments on up to 512 GPUs, providing detailed insights into modern GPU training strategies. DeepSeek introduced the Native Sparse Attention (NSA) model, gaining significant community attention, while Perplexity AI launched R1-1776, an uncensored and unbiased version of DeepSeek's R1 model. Google DeepMind unveiled PaliGemma 2 Mix, a multi-task vision-language model available in 3B, 10B, and 28B sizes. Microsoft introduced Muse, a generative AI model trained on the game Bleeding Edge, and presented Magma, a foundation model for multimodal AI agents excelling in UI navigation and robotic manipulation. Baichuan-M1-14B was announced as a state-of-the-art medical LLM trained on 20T tokens, and a fully open-source 40B genome modeling model using StripedHyena 2 architecture was also released. "Making your own gaming experience is coming sooner than you'd think," noted in relation to Muse.
not much happened today
gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr
Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.
s1: Simple test-time scaling (and Kyutai Hibiki)
qwen-2.5-32b gemini-2.0-flash smollm2 granite-vision-3.1-2b google-deepmind qwen gemini hugging-face ibm deepseek reasoning fine-tuning scaling-laws open-source-models data-centric-training vision multilingual-models language-model-reasoning niklas-muennighoff
"Wait" is all you need introduces a novel reasoning model finetuned from Qwen 2.5 32B using just 1000 questions with reasoning traces distilled from Gemini 2.0 Flash Thinking, enabling controllable test-time compute by appending "Wait" to extend reasoning. Lead author Niklas Muennighoff, known for work on Bloom, StarCoder, and BIG-bench, highlights this method's efficiency and its reproduction of the famous o1 scaling chart. Additionally, Kyutai Moshi's Hibiki project demonstrates impressive offline French-English live translation on iPhone. Recent AI model releases include DeepSeek R1 and R3 open source models, potentially marking a major open-source milestone, Hugging Face's SmolLM2 emphasizing data-centric training for small LMs, and IBM's Granite-Vision-3.1-2B, a small vision-language model with strong performance. Key research papers spotlight LIMO for minimal demonstration reasoning achieving high accuracy on AIME and MATH benchmarks, and Token-Assisted Reasoning mixing latent and text tokens to improve language model reasoning.
DeepSeek #1 on US App Store, Nvidia stock tanks -17%
deepseek-r1 deepseek-v3 qwen2.5-vl o1 deepseek openai nvidia langchain moe-architecture chain-of-thought fp8-precision multimodality vision agentic-ai inference-scaling gpu-optimization model-efficiency ai-chatbots memory-integration tool-use stock-market-reactions sama mervenoyann omarasar0 teortaxestex nptacek carpeetti finbarrtimbers cwolferesearch arthurrapier danhendrycks scaling01 janusflow
DeepSeek has made a significant cultural impact by hitting mainstream news unexpectedly in 2025. The DeepSeek-R1 model features a massive 671B parameter MoE architecture and demonstrates chain-of-thought (CoT) capabilities comparable to OpenAI's o1 at a lower cost. The DeepSeek V3 model trains a 236B parameter model 42% faster than its predecessor using fp8 precision. The Qwen2.5 multimodal models support images and videos with sizes ranging from 3B to 72B parameters, featuring strong vision and agentic capabilities. LangChain and LangGraph integration enable AI chatbots with memory and tool use, including applications like the DeFi Agent. Discussions highlight NVIDIA's role in hardware acceleration, with concerns about stock drops due to DeepSeek's efficiency and market fears. The compute demand is expected to rise despite efficiency gains, driven by inference scaling and MoE design improvements.
Titans: Learning to Memorize at Test Time
minimax-01 gpt-4o claude-3.5-sonnet internlm3-8b-instruct transformer2 google meta-ai-fair openai anthropic langchain long-context mixture-of-experts self-adaptive-models prompt-injection agent-authentication diffusion-models zero-trust-architecture continuous-adaptation vision agentic-systems omarsar0 hwchase17 abacaj hardmaru rez0__ bindureddy akhaliq saranormous
Google released a new paper on "Neural Memory" integrating persistent memory directly into transformer architectures at test time, showing promising long-context utilization. MiniMax-01 by @omarsar0 features a 4 million token context window with 456B parameters and 32 experts, outperforming GPT-4o and Claude-3.5-Sonnet. InternLM3-8B-Instruct is an open-source model trained on 4 trillion tokens with state-of-the-art results. Transformer² introduces self-adaptive LLMs that dynamically adjust weights for continuous adaptation. Advances in AI security highlight the need for agent authentication, prompt injection defenses, and zero-trust architectures. Tools like Micro Diffusion enable budget-friendly diffusion model training, while LeagueGraph and Agent Recipes support open-source social media agents.
Moondream 2025.1.9: Structured Text, Enhanced OCR, Gaze Detection in a 2B Model
o1 vdr-2b-multi-v1 llava-mini openai llamaindex langchainai qdrant genmoai vision model-efficiency structured-output gaze-detection reasoning model-distillation multimodality embedding-models gan diffusion-models self-attention training-optimizations development-frameworks api cross-language-deployment semantic-search agentic-document-processing developer-experience philschmid saranormous jxmnop reach_vb iscienceluvr multimodalart arohan adcock_brett awnihannun russelljkaplan ajayj_
Moondream has released a new version that advances VRAM efficiency and adds structured output and gaze detection, marking a new frontier in vision model practicality. Discussions on Twitter highlighted advancements in reasoning models like OpenAI's o1, model distillation techniques, and new multimodal embedding models such as vdr-2b-multi-v1 and LLaVA-Mini, which significantly reduce computational costs. Research on GANs and decentralized diffusion models showed improved stability and performance. Development tools like MLX and vLLM received updates for better portability and developer experience, while frameworks like LangChain and Qdrant enable intelligent data workflows. Company updates include new roles and team expansions at GenmoAI. "Efficiency tricks are all you need."
not much happened today
rstar-math o1-preview qwen2.5-plus qwen2.5-coder-32b-instruct phi-4 claude-3.5-sonnet openai anthropic alibaba microsoft cohere langchain weights-biases deepseek rakuten rbc amd johns-hopkins math process-reward-model mcts vision reasoning synthetic-data pretraining rag automation private-deployment multi-step-workflow open-source-dataset text-embeddings image-segmentation chain-of-thought multimodal-reasoning finetuning recursive-self-improvement collaborative-platforms ai-development partnerships cuda triton ai-efficiency ai-assisted-coding reach_vb rasbt akshaykagrawal arankomatsuzaki teortaxestex aidangomez andrewyng
rStar-Math surpasses OpenAI's o1-preview in math reasoning with 90.0% accuracy using a 7B LLM and MCTS with a Process Reward Model. Alibaba launches Qwen Chat featuring Qwen2.5-Plus and Qwen2.5-Coder-32B-Instruct models enhancing vision-language and reasoning. Microsoft releases Phi-4, trained on 40% synthetic data with improved pretraining. Cohere introduces North, a secure AI workspace integrating LLMs, RAG, and automation for private deployments. LangChain showcases a company research agent with multi-step workflows and open-source datasets. Transformers.js demos released for text embeddings and image segmentation in JavaScript. Research highlights include Meta Meta-CoT for enhanced chain-of-thought reasoning, DeepSeek V3 with recursive self-improvement, and collaborative AI development platforms. Industry partnerships include Rakuten with LangChain, North with RBC supporting 90,000 employees, and Agent Laboratory collaborating with AMD and Johns Hopkins. Technical discussions emphasize CUDA and Triton for AI efficiency and evolving AI-assisted coding stacks by Andrew Ng.
not much happened today
phi-4 reinforce++ arc-agi-2 ai21-labs ollama langchain togethercompute groq reinforcement-learning ppo model-optimization memory-efficiency python-packages vision text-extraction frontend-code-generation workflow-automation coding-agents compute-cost-reduction ethical-ai agi-benchmarks scam-alerts sebastien-bubeck fchollet tom-doerr arohan_ bindureddy hwchase17 jonathanross321 clementdelangue vikhyatk
Sebastien Bubeck introduced REINFORCE++, enhancing classical REINFORCE with PPO-inspired techniques for 30% faster training. AI21 Labs released Phi-4 under the MIT License, accessible via Ollama. François Chollet announced plans for ARC-AGI-2 and a next-generation AGI benchmark. LangChain launched 10 new integration packages to boost LLM application development. Tom Doerr introduced Ollama-OCR, a Python package for text extraction using vision language models. Arohan optimized Shampoo for memory efficiency, reducing usage from 20 to 6 bytes per parameter. Bindu Reddy showcased CodeLLM's v1 for frontend code generation and highlighted LlamaIndex Workflows for academic summarization and slide generation. Hwchase17 collaborated with Together Compute to enhance WebDev Arena with complex coding agents for LLM coding evaluations. Jonathan Ross detailed Groq's mission to reduce compute costs by 1000x amid rising generative AI spending. Clement Delangue warned about scam alerts involving false claims of association with AI21. Vikhyat K raised concerns about the ethical implications and trade-offs of AGI. Memes and humor included creative AI prompts and critiques of LLM behaviors.
not much happened today
qwen-o1 qvq claude-3.5-sonnet gpt-4o o3 o3-mini alibaba openai mit idsia llamaindex ollama vision benchmarking llm-calibration intentionality alignment-faking deliberative-alignment artificial-life gdpr-compliance contract-review-agent app-creation synthetic-data post-transformers smol-models agents bret-taylor
The Qwen team launched QVQ, a vision-enabled version of their experimental QwQ o1 clone, benchmarking comparably to Claude 3.5 Sonnet. Discussions include Bret Taylor's insights on autonomous software development distinct from the Copilot era. The Latent Space LIVE! talks cover highlights of 2024 AI startups, vision, open models, post-transformers, synthetic data, smol models, and agents. Twitter recaps by Claude 3.5 Sonnet highlight proposals for benchmarks measuring LLM calibration and falsehood confidence, with QVQ outperforming GPT-4o and Claude Sonnet 3.5. AI alignment debates focus on intentionality and critiques of alignment faking in models like Claude. Updates from OpenAI include new o3 and o3-mini models and a deliberative alignment strategy. The ASAL project is a collaboration between MIT, OpenAI, and Swiss AI Lab IDSIA to automate artificial life discovery. Personal stories reveal frustrations with USCIS green card denials despite high qualifications. New tools like GeminiCoder enable rapid app creation, and a contract review agent using Reflex and Llama Index checks GDPR compliance. Holiday greetings and memes were also shared.
Genesis: Generative Physics Engine for Robotics (o1-mini version)
o1 o1-preview gpt-4o claude-3.5-sonnet gemini-2.0-pro llama-3-3b llama-3-70b openai google-deepmind meta-ai-fair hugging-face function-calling structured-outputs vision performance-benchmarks sdk webrtc reasoning math code-generation transformer-architecture model-training humanoid-robots search model-efficiency dataset-sharing aidan_mclau sundarpichai adcock_brett
OpenAI launched the o1 model API featuring function calling, structured outputs, vision support, and developer messages, achieving 60% fewer reasoning tokens than its preview. The model excels in math and code with a 0.76 LiveBench Coding score, outperforming Sonnet 3.5. Beta SDKs for Go and Java and WebRTC support with 60% lower prices were also released. Google Gemini 2.0 Pro (Gemini Exp 1206) deployment accelerated, showing improved coding, math, and reasoning performance. Meta AI FAIR introduced research on training transformers directly on raw bytes using dynamic entropy-based patching. Commercial humanoid robots were successfully deployed by an industry player. Hugging Face researchers demonstrated that their 3B Llama model can outperform the 70B Llama model on MATH-500 accuracy using search techniques, highlighting efficiency gains with smaller models. Concerns about reproducibility and domain-specific limitations were noted.
Genesis: Generative Physics Engine for Robotics (o1-2024-12-17)
o1 gemini-2.0-pro openai google carnegie-mellon-university universal-physics-engine robotics-simulation physics-simulation photo-realistic-rendering generative-data simulation-platform open-source function-calling vision performance-benchmarks sdk realtime-api zhou-xian aidan_mclau sundar-pichai
Genesis is a newly announced universal physics engine developed by a large-scale collaboration led by CMU PhD student Zhou Xian. It integrates multiple state-of-the-art physics solvers to simulate diverse materials and physical phenomena, targeting robotics applications with features like lightweight, ultra-fast simulation, photo-realistic rendering, and generative data capabilities. The engine is open source and designed for robotics simulation beyond just video generation. Additionally, OpenAI released the o1 model to API with advanced features like function calling and vision support, showing strong math and coding performance. Google teased updates on Gemini 2.0 Pro, accelerating deployment for advanced users.
o1 API, 4o/4o-mini in Realtime API + WebRTC, DPO Finetuning
o1-2024-12-17 o1 o1-pro 4o 4o-mini gemini-2-0-flash claude-3.5-sonnet claude-3.5 openai google google-deepmind function-calling structured-outputs vision reasoning webrtc realtime-api preference-tuning fine-tuning api model-performance aidan_mclau kevinweil simonw michpokrass morgymcg juberti
OpenAI launched the o1 API with enhanced features including vision inputs, function calling, structured outputs, and a new
reasoning_effort
parameter, achieving 60% fewer reasoning tokens on average. The o1 pro variant is confirmed as a distinct implementation coming soon. Improvements to the Realtime API with WebRTC integration offer easier usage, longer sessions (up to 30 minutes), and significantly reduced pricing (up to 10x cheaper with mini models). DPO Preference Tuning for fine-tuning is introduced, currently available for the 4o model. Additional updates include official Go and Java SDKs and OpenAI DevDay videos. The news also highlights discussions on Google Gemini 2.0 Flash model's performance reaching 83.6% accuracy. Meta BLT: Tokenizer-free, Byte-level LLM
byte-latent-transformer llama-3 phi-4 gpt-4o command-r7b meta-ai-fair llamaindex microsoft deepseek-ai openai cohere anthropic tokenization transformer-architecture model-efficiency benchmarking multimodality vision reinforcement-learning model-scaling jailbreaking model-optimization
Meta AI introduces the Byte Latent Transformer (BLT), a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming Llama 3 on benchmarks including the CUTE benchmark. The model was trained on approximately 1 trillion tokens and features a three-block transformer design with local and global components. This approach challenges traditional tokenization and may enable new multimodal capabilities such as direct file interaction without retrieval-augmented generation. Additionally, Microsoft announced the Phi-4 14B parameter model achieving state-of-the-art results on STEM and reasoning benchmarks, surpassing GPT-4o. DeepSeek AI launched new vision-language models based on their MoE architecture with sizes ranging from 1.0B to 27B parameters. OpenAI released a new Projects feature for ChatGPT, and Cohere introduced their smallest and fastest Command R7B model. Anthropic published research on "Best-of-N Jailbreaking" vulnerabilities across text, vision, and audio models. Industry discussion highlights a trend of decreasing frontier LLM sizes, with GPT-4 at approximately 1.8 trillion parameters compared to newer models.
$200 ChatGPT Pro and o1-full/pro, with vision, without API, and mixed reviews
o1 o1-pro claude-3.5-sonnet pali-gemma-2 openai google llamaindex multimodality vision fine-tuning benchmarking model-performance image-generation document-processing model-release sama bindureddy mervenoyann fchollet
OpenAI launched the o1 model with multimodal capabilities, faster reasoning, and image input support, marking it as a state-of-the-art model despite some bugs and mixed community reviews. The new o1-pro tier offers unlimited access for $200/month with notable benchmark improvements but some performance trade-offs compared to claude-3.5-sonnet. Google released the PaliGemma 2 vision-language model family in sizes 3B, 10B, and 28B, excelling in visual question answering, image segmentation, and OCR, with day-0 support for fine-tuning. LlamaIndex announced discounts and feature updates for large-scale document processing. The AI community also reacted humorously to the new pricing tiers and model comparisons. "o1 can see now, which makes it the SOTA multimodal model" and "most users will be best served by free/Plus tiers" were notable sentiments.
not much happened today
o1-full sora gpt-4.5 gpt-4 claude-3.5-sonnet llama-3-1-nemotron-51b llama-3-1 llama-3 nemotron-51b openai google-deepmind anthropic nvidia huggingface vision model-performance neural-architecture-search model-optimization multimodality model-release model-training reinforcement-learning image-generation lucas-beyer alexander-kolesnikov xiaohua-zhai aidan_mclau giffmana joannejang sama
OpenAI announced their "12 Days of OpenAI" event with daily livestreams and potential releases including the O1 full model, Sora video model, and GPT-4.5. Google DeepMind released the GenCast weather model capable of 15-day forecasts in 8 minutes using TPU chips, and launched Genie 2, a model generating playable 3D worlds from single images. Leading vision researchers Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai moved from DeepMind to OpenAI, which is opening a Zürich office. Criticism arose over OpenAI's strategy and model quality compared to Anthropic and Claude 3.5 Sonnet. On Reddit, a modified llama.cpp supports Nvidia's Llama-3_1-Nemotron-51B, matching performance of larger 70B models via NAS optimization.
OLMo 2 - new SOTA Fully Open LLM
llama-3-1-8b olmo-2 qwen2-5-72b-instruct smolvlm tulu-3 ai2 huggingface intel reinforcement-learning quantization learning-rate-annealing ocr fine-tuning model-training vision
AI2 has updated OLMo-2 to roughly Llama 3.1 8B equivalent, training with 5T tokens and using learning rate annealing and new high-quality data (Dolmino). They credit Tülu 3 and its "Reinforcement Learning with Verifiable Rewards" approach. On Reddit, Qwen2.5-72B instruct model shows near lossless performance with AutoRound 4-bit quantization, available on HuggingFace in 4-bit and 2-bit versions, with discussions on MMLU benchmark and quantization-aware training. HuggingFace released SmolVLM, a 2B parameter vision-language model running efficiently on consumer GPUs, supporting fine-tuning on Google Colab and demonstrating strong OCR capabilities with adjustable resolution and quantization options.
Vision Everywhere: Apple AIMv2 and Jina CLIP v2
aimv2-3b jina-clip-v2 tulu-3 llama-3-1 claude-3-5 llama-3-1-70b apple jina allen_ai autoregressive-objectives vision multilinguality multimodality image-generation model-training model-optimization reinforcement-learning fine-tuning model-benchmarking
Apple released AIMv2, a novel vision encoder pre-trained with autoregressive objectives that achieves 89.5% accuracy on ImageNet and integrates joint visual and textual objectives. Jina launched Jina CLIP v2, a multimodal embedding model supporting 89 languages and high-resolution images with efficient Matryoshka embeddings reducing dimensions by 94% with minimal accuracy loss. Allen AI introduced Tülu 3 models based on Llama 3.1 with 8B and 70B parameters, offering 2.5x faster inference and alignment via SFT, DPO, and RLVR methods, competing with Claude 3.5 and Llama 3.1 70B. These developments highlight advances in autoregressive training, vision encoders, and multilingual multimodal embeddings.
LMSys killed Model Versioning (gpt 4o 1120, gemini exp 1121)
gpt-4o-2024-11-20 gemini-exp-1121 deepseek-r1 openai google-deepmind anthropic deepseek mistral-ai model-release model-ranking open-source vision coding reasoning market-competition
AI News for 11/21/2024-11/22/2024 highlights the intense frontier lab race with OpenAI's gpt-4o-2024-11-20 and Google DeepMind's gemini-exp-1121 trading top spots on the Lmsys leaderboard. The trend of using date-based model identifiers instead of traditional versioning is noted across leading labs including Anthropic. DeepSeek R1 is gaining attention as a potent open-source alternative, especially in the context of the AI competition between China and the US. Gemini-Exp-1121 is praised for improvements in vision, coding, and reasoning, while MistralAI expands with a new Palo Alto office, signaling growth and hiring.
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.
Claude 3.5 Sonnet (New) gets Computer Use
claude-3.5-sonnet claude-3.5-haiku llama-3.1 nemotron anthropic zep nvidia coding benchmarks computer-use vision multimodal-memory model-updates ai-integration philschmid swyx
Anthropic announced new Claude 3.5 models: 3.5 Sonnet and 3.5 Haiku, improving coding performance significantly, with Sonnet topping several coding benchmarks like Aider and Vectara. The new Computer Use API enables controlling computers via vision, scoring notably higher than other AI systems, showcasing progress in AI-driven computer interaction. Zep launched a cloud edition for AI agents memory management, highlighting challenges in multimodal memory. The update also mentions Llama 3.1 and Nemotron models from NVIDIA.
The AI Nobel Prize
claude-3.5-sonnet reka-flash got openai anthropic reka-ai zep artificial-neural-networks nobel-prize knowledge-graphs memory-layers real-time-voice-api vision fine-tuning prompt-caching multimodality function-calling ocr open-source single-sign-on software-testing ai-assisted-coding ai-ethics geoff-hinton john-hopfield philschmid alexalbert mervenoyann clementdelangue svpino bindureddy ylecun rohanpaul_ai
Geoff Hinton and John Hopfield won the Nobel Prize in Physics for their work on Artificial Neural Networks. The award citation spans 14 pages highlighting their contributions. Zep released a new community edition of their low-latency memory layer for AI agents, emphasizing knowledge graphs for memory. At OpenAI's DevDay, new features like real-time voice API, vision model fine-tuning, and prompt caching with a 50% discount on reused tokens were introduced. Anthropic's Claude 3.5 Sonnet was recognized as the best model currently. Reka AI Labs updated their Reka Flash model with enhanced multimodal and function calling capabilities. The GOT (Generic OCR Transformer) achieved 98.79% accuracy on OCR benchmarks. Discussions on open-source AI models highlighted their role in fostering competition and decentralization. Software development insights included the importance of Single Sign-On (SSO), thorough testing, and AI-assisted coding workflows. Ethical and societal topics covered critiques of tax policies and the appointment of France's first Minister of AI.
Llama 3.2: On-device 1B/3B, and Multimodal 11B/90B (with AI2 Molmo kicker)
llama-3-2 llama-3-1 claude-3-haiku gpt-4o-mini molmo-72b molmo-7b gemma-2 phi-3-5 llama-3-2-vision llama-3-2-3b llama-3-2-20b meta-ai-fair ai2 qualcomm mediatek arm ollama together-ai fireworks-ai weights-biases cohere weaviate multimodality vision context-windows quantization model-release tokenization model-performance model-optimization rag model-training instruction-following mira-murati daniel-han
Meta released Llama 3.2 with new multimodal versions including 3B and 20B vision adapters on a frozen Llama 3.1, showing competitive performance against Claude Haiku and GPT-4o-mini. AI2 launched multimodal Molmo 72B and 7B models outperforming Llama 3.2 in vision tasks. Meta also introduced new 128k-context 1B and 3B models competing with Gemma 2 and Phi 3.5, with collaborations hinted with Qualcomm, Mediatek, and Arm for on-device AI. The release includes a 9 trillion token count for Llama 1B and 3B. Partner launches include Ollama, Together AI offering free 11B model access, and Fireworks AI. Additionally, a new RAG++ course from Weights & Biases, Cohere, and Weaviate offers systematic evaluation and deployment guidance for retrieval-augmented generation systems based on extensive production experience.
Pixtral 12B: Mistral beats Llama to Multimodality
pixtral-12b mistral-nemo-12b llama-3-1-70b llama-3-1-8b deeps-eek-v2-5 gpt-4-turbo llama-3-1 strawberry claude mistral-ai meta-ai-fair hugging-face arcee-ai deepseek-ai openai anthropic vision multimodality ocr benchmarking model-release model-architecture model-performance fine-tuning model-deployment reasoning code-generation api access-control reach_vb devendra_chapilot _philschmid rohanpaul_ai
Mistral AI released Pixtral 12B, an open-weights vision-language model with a Mistral Nemo 12B text backbone and a 400M vision adapter, featuring a large vocabulary of 131,072 tokens and support for 1024x1024 pixel images. This release notably beat Meta AI in launching an open multimodal model. At the Mistral AI Summit, architecture details and benchmark performances were shared, showing strong OCR and screen understanding capabilities. Additionally, Arcee AI announced SuperNova, a distilled Llama 3.1 70B & 8B model outperforming Meta's Llama 3.1 70B instruct on benchmarks. DeepSeek released DeepSeek-V2.5, scoring 89 on HumanEval, surpassing GPT-4-Turbo, Opus, and Llama 3.1 in coding tasks. OpenAI plans to release Strawberry as part of ChatGPT soon, though its capabilities are debated. Anthropic introduced Workspaces for managing multiple Claude deployments with enhanced access controls.
AIPhone 16: the Visual Intelligence Phone
reflection-70b llama-3-70b qwen-2-72b llama-3-1-405b claude gpt-4 gemini apple openai weights-biases vision video-understanding benchmarking planning model-evaluation privacy ai-integration instruction-following yann-lecun
Apple announced the new iPhone 16 lineup featuring Visual Intelligence, a new AI capability integrated with Camera Control, Apple Maps, and Siri, emphasizing privacy and default service use over third-party AI like OpenAI. Apple Photos now includes advanced video understanding with timestamp recognition. Meanwhile, Reflection-70B claims to be a top open-source model but benchmarks show it performs close to Llama 3 70B and slightly worse than Qwen 2 72B. Yann LeCun highlighted ongoing challenges with LLM planning abilities, noting models like Llama-3.1-405b and Claude show some skill, while GPT-4 and Gemini lag behind. Weights & Biases is sponsoring an event to advance LLM evaluation techniques with prizes and API access.
CogVideoX: Zhipu's Open Source Sora
cogvideox llama-3-1 llama-3-405b moondream phi-3.5 llama-rank zhipu-ai alibaba meta-ai-fair google hugging-face nvidia togethercompute salesforce video-generation serverless-computing vision document-vqa text-vqa mixture-of-experts retrieval-augmented-generation long-context model-routing webgpu background-removal long-form-generation superposition-prompting rohanpaul_ai philschmid vikhyatk algo_diver jayalammar davidsholz
Zhipu AI, Alibaba's AI arm and China's 3rd largest AI lab, released the open 5B video generation model CogVIdeoX, which can run without GPUs via their ChatGLM web and desktop apps. Meta AI announced trust & safety research and CyberSecEval 3 alongside the release of Llama 3.1, with Llama 3 405B now available serverless on Google Cloud Vertex AI and Hugging Face x NVIDIA NIM API. Updates include Moondream, an open vision-language model improving DocVQA and TextVQA tasks, and the lightweight MoE chat model Phi-3.5 with 16x3.8B parameters. Together Compute introduced the Rerank API featuring Salesforce's LlamaRank model for document and code ranking. Research highlights include superposition prompting for RAG without fine-tuning, the AgentWrite pipeline for long-form content generation over 20,000 words, and a comparison showing Long Context methods outperform RAG at higher costs. Tools include Not Diamond, an AI model router, AI command line interfaces, and an open-source WebGPU background removal tool. "You don't even need GPUs to run it," referring to CogVIdeoX.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.
not much happened today
grok-2 claude-3.5-sonnet claude-3.5 gpt-4 chatgpt-4o-latest anthropic x-ai google-deepmind openai mistral-ai meta-ai-fair salesforce box prompt-caching model-performance vision fine-tuning multilinguality ai-safety design-automation document-processing ai-agents ai-integration ai-job-market ai-acceleration humor demis-hassabis francois-chollet
Anthropic rolled out prompt caching in its API, reducing input costs by up to 90% and latency by 80%, enabling instant fine-tuning with longer prompts. xAI released Grok-2, a new model competing with frontier models from Google DeepMind, OpenAI, Anthropic, Mistral AI, and Meta AI Fair, supporting vision and text inputs and integrating external image generation models. Claude 3.5 Sonnet is reported to outperform GPT-4 in coding and reasoning, while ChatGPT-4o-latest shows reasoning improvements. François Chollet proposed a theory defining intelligence as the efficiency of operationalizing past information for future tasks. The Aya project involves 3000 collaborators building multilingual AI datasets. Demis Hassabis discussed AI hype and safe AI development in a podcast. Tools like Dora AI for Figma and Box's AI API enhance design automation and document processing. Salesforce released DEI, an open AI software engineering agents framework with a 55% resolve rate on SWE-Bench Lite. Industry trends highlight rapid AI integration, networking importance in the AI job market, and potential OpenAI GPT-4 expansion in response to competitors. Memes include humor about Apple Vision Pro.
not much happened today
llama-3 llama-3-1 grok-2 claude-3.5-sonnet gpt-4-turbo nous-research nvidia salesforce goodfire-ai anthropic x-ai google-deepmind box langchain fine-tuning prompt-caching mechanistic-interpretability model-performance multimodality agent-frameworks software-engineering-agents api document-processing text-generation model-releases vision image-generation efficiency scientific-discovery fchollet demis-hassabis
GPT-5 delayed again amid a quiet news day. Nous Research released Hermes 3 finetune of Llama 3 base models, rivaling FAIR's instruct tunes but sparking debate over emergent existential crisis behavior with 6% roleplay data. Nvidia introduced Minitron finetune of Llama 3.1. Salesforce launched a DEI agent scoring 55% on SWE-Bench Lite. Goodfire AI secured $7M seed funding for mechanistic interpretability work. Anthropic rolled out prompt caching in their API, cutting input costs by up to 90% and latency by 80%, aiding coding assistants and large document processing. xAI released Grok-2, matching Claude 3.5 Sonnet and GPT-4 Turbo on LMSYS leaderboard with vision+text inputs and image generation integration. Claude 3.5 Sonnet reportedly outperforms GPT-4 in coding and reasoning. François Chollet defined intelligence as efficient operationalization of past info for future tasks. Salesforce's DEI framework surpasses individual agent performance. Google DeepMind's Demis Hassabis discussed AGI's role in scientific discovery and safe AI development. Dora AI plugin generates landing pages in under 60 seconds, boosting web team efficiency. Box AI API beta enables document chat, data extraction, and content summarization. LangChain updated Python & JavaScript integration docs.
Too Cheap To Meter: AI prices cut 50-70% in last 30 days
gpt-4o gpt-4o-mini llama-3-1-405b mistral-large-2 gemini-1.5-flash deepseek-v2 sonnet-3.5 exaone-3.0 minicpm-v-2.6 claude-3.5 gpt-4o-2024-08-06 llamaindex together-ai deepinfra deepseek-ai mistral-ai google-deepmind lg-ai-research llamaindex llamaindex llamaindex price-cuts context-caching instruction-tuning vision benchmarks pytorch attention-mechanisms reinforcement-learning-from-human-feedback compute-optimal-scaling rohanpaul_ai akhaliq mervenoyann sophiamyang chhillee karpathy
Gemini 1.5 Flash has cut prices by approximately 70%, offering a highly competitive free tier of 1 million tokens per minute at $0.075/mtok, intensifying the AI model price war. Other significant price reductions include GPT-4o (~50% cut to $2.50/mtok), GPT-4o mini (70-98.5% cut to $0.15/mtok), Llama 3.1 405b (46% cut to $2.7/mtok), and Mistral Large 2 (62% cut to $3/mtok). Deepseek v2 introduced context caching, reducing input token costs by up to 90% to $0.014/mtok. New model releases include Llama 3.1 405b, Sonnet 3.5, EXAONE-3.0 (7.8B instruction-tuned by LG AI Research), and MiniCPM V 2.6 (vision-language model combining SigLIP 400M and Qwen2-7B). Benchmarks show Mistral Large performing well on ZebraLogic and Claude-3.5 leading LiveBench. FlexAttention, a new PyTorch API, simplifies and optimizes attention mechanisms. Andrej Karpathy analyzed RLHF, highlighting its limitations compared to traditional reinforcement learning. Google DeepMind research on compute-optimal scaling was also summarized.
not much happened today
gpt-4-0613 gpt-3.5-turbo-0613 gpt-4o-2024-08-06 mistral-large-2 gpt4-turbo claude-3-opus idefics3-llama bigllama-3.1-1t-instruct llama-3-120b-instruct openai mistral-ai meta-ai-fair structured-outputs function-calling json-schema benchmarking multimodality context-windows model-scaling ai-hardware vision speech-processing robotics ai-regulation sama rohanpaul_ai corbtt guillaumelample mervenoyann maximelabonne aidan_mclau adcock_brett ylecun
OpenAI introduced structured outputs in their API with a new "strict" mode and a "response_format" parameter, supporting models like gpt-4-0613, gpt-3.5-turbo-0613, and the new gpt-4o-2024-08-06. They also halved the price of gpt-4o to $2.50 per million tokens. Mistral Large 2 outperforms gpt4-turbo and claude-3-opus on hard benchmarks and coding tasks. Idefics3-Llama offers multimodal capabilities with a 10k token context window. BigLlama-3.1-1T-Instruct is an upscaled version of llama-3-120b-instruct. New benchmark "big_model_smell" measures creativity and reliability. Figure 02 robot features advanced AI hardware with onboard vision language model, enhanced battery, and speech-to-speech reasoning. Yann LeCun expressed concerns about California's SB1047 regulation.
GPT4o August + 100% Structured Outputs for All (GPT4o August edition)
gpt-4o-2024-08-06 llama-3-1-405b llama-3 claude-3.5-sonnet gemini-1.5-pro gpt-4o yi-large-turbo openai meta-ai-fair google-deepmind yi-large nvidia groq langchain jamai langsmith structured-output context-windows model-pricing benchmarking parameter-efficient-expert-retrieval retrieval-augmented-generation mixture-of-experts model-performance ai-hardware model-deployment filtering multi-lingual vision john-carmack jonathan-ross rohanpaul_ai
OpenAI released the new gpt-4o-2024-08-06 model with 16k context window and 33-50% lower pricing than the previous 4o-May version, featuring a new Structured Output API that improves output quality and reduces retry costs. Meta AI launched Llama 3.1, a 405-billion parameter model surpassing GPT-4 and Claude 3.5 Sonnet on benchmarks, alongside expanding the Llama Impact Grant program. Google DeepMind quietly released Gemini 1.5 Pro, outperforming GPT-4o, Claude-3.5, and Llama 3.1 on LMSYS benchmarks and leading the Vision Leaderboard. Yi-Large Turbo was introduced as a cost-effective upgrade priced at $0.19 per million tokens. In hardware, NVIDIA H100 GPUs were highlighted by John Carmack for their massive AI workload power, and Groq announced plans to deploy 108,000 LPUs by Q1 2025. New AI tools and techniques include RAG (Retrieval-Augmented Generation), the JamAI Base platform for Mixture of Agents systems, and LangSmith's enhanced filtering capabilities. Google DeepMind also introduced PEER (Parameter Efficient Expert Retrieval) architecture.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
We Solved Hallucinations
gpt-2 flashattention-3 lynx meta-ai-fair nvidia princeton colfax patronus-ai databricks mosaic-ai openai compute-hardware gpu-optimization flashattention llm-evaluation hallucination-detection vision benchmarking synthetic-data model-training karpathy tri_dao giffmana vikhyatk dbrxmosaicai
Reddit's URL structure causes link errors in AI-generated summaries, especially with NSFW content affecting models like Claude and GPT-4. The team fixed this glitch while still leveraging LLMs for summarizing Reddit content. GPT-2 training costs have dramatically dropped to ~$672 using H100 GPUs and software improvements like CUDA and FlashAttention. FlashAttention-3 was released, achieving up to 740 TFLOPS on H100 GPUs, with FP8 nearing 1.2 PFLOPS, developed collaboratively by Meta, NVIDIA, Princeton, and Colfax. Hopper GPUs enable major speedups with new hardware features. Synthetic data may not improve vision tasks, as shown in recent research. The Avocado360 benchmark evaluates vision-language models' ability to detect avocados in images. Lynx, a hallucination detection model for LLMs, was introduced for real-world healthcare and fintech applications, trained by Patronus AI on Databricks Mosaic AI using Composer.
FlashAttention 3, PaliGemma, OpenAI's 5 Levels to Superintelligence
flashattention-3 paligemma-3b gemma-2b numinamath-7b deepseekmath-7b codellama-34b wizardcoder-python-34b-v1.0 chatgpt-3.5 openai together-ai google hugging-face deepseek code-llama attention-mechanisms fp8-training vision prefix-lm superintelligence fine-tuning chain-of-thought tool-integrated-reasoning self-consistency-decoding python coding-capabilities elo-ratings ilya-sutskever lucas-giffman
FlashAttention-3 introduces fast and accurate attention optimized for H100 GPUs, advancing native FP8 training. PaliGemma, a versatile 3B Vision-Language Model (VLM) combining a SigLIP-So400m ViT encoder with the Gemma-2B language model, emphasizes a prefix-LM architecture for improved image-query interaction. OpenAI reveals a framework on levels of superintelligence, signaling progress toward Level 2 and highlighting internal safety disagreements. On Reddit, NuminaMath 7B, fine-tuned from DeepSeekMath-7B, wins the AI Math Olympiad by solving 29 problems using iterative supervised fine-tuning and tool-integrated reasoning. Open-source LLMs like CodeLlama-34b and WizardCoder-Python-34B-V1.0 are closing the coding performance gap with closed models such as ChatGPT-3.5.
Qdrant's BM42: "Please don't trust us"
claude-3.5-sonnet gemma-2 nano-llava-1.5 qdrant cohere stripe anthropic hugging-face stablequan_ai semantic-search benchmarking dataset-quality model-evaluation model-optimization vision fine-tuning context-windows nils-reimers jeremyphoward hamelhusain rohanpaul_ai
Qdrant attempted to replace BM25 and SPLADE with a new method called "BM42" combining transformer attention and collection-wide statistics for semantic and keyword search, but their evaluation using the Quora dataset was flawed. Nils Reimers from Cohere reran BM42 on better datasets and found it underperformed. Qdrant acknowledged the errors but still ran a suboptimal BM25 implementation. This highlights the importance of dataset choice and evaluation sanity checks in search model claims. Additionally, Stripe faced criticism for AI/ML model failures causing account and payment issues, prompting calls for alternatives. Anthropic revealed that Claude 3.5 Sonnet suppresses some answer parts with backend tags, sparking debate. Gemma 2 model optimizations allow 2x faster fine-tuning with 63% less memory and longer context windows, running up to 34B parameters on consumer GPUs. nanoLLaVA-1.5 was announced as a compact 1B parameter vision model with significant improvements.
That GPT-4o Demo
gpt-4o gemma-2 meta-code-llama openai google-deepmind meta-ai-fair voice-generation ocr screen-sharing vision code-understanding model-customization efficiency textual-intelligence multimodal-agents sft distillation rlhf model-merging model-optimization safety romain-huet fchollet
Romain Huet demonstrated an unreleased version of GPT-4o on ChatGPT Desktop showcasing capabilities like low latency voice generation, whisper tone moderation, camera mode streaming video to GPT-4o, rapid OCR, screen sharing with ChatGPT for programming help, clipboard reading, and vision-based code conversation. OpenAI's four investment areas highlighted include textual intelligence, efficiency/cost, model customization, and multimodal agents. Google DeepMind released Gemma 2 models in 9B and 27B sizes trained on 8T and 13T tokens respectively, using SFT, distillation, RLHF, and model merging, optimized for TPUv5e with strong performance and safety measures. Meta AI announced the Meta LLM Compiler built on Meta Code Llama with enhanced code optimization and compiler features.
There's Ilya!
chameleon-7b chameleon-34b deepseek-coder-v2 gpt-4-turbo claude-3-opus voco-llama safe-superintelligence-inc openai anthropic meta deepseek google-deepmind parallel-decoding code-generation quantization training-dynamics vision benchmarks datasets image-captioning reasoning memory-optimization ilya-sutskever jan-leike ylecun akhaliq philschmid rohanpaul_ai mervenoyann fchollet
Ilya Sutskever has co-founded Safe Superintelligence Inc shortly after leaving OpenAI, while Jan Leike moved to Anthropic. Meta released new models including Chameleon 7B and 34B with mixed-modal input and unified token space quantization. DeepSeek-Coder-V2 shows code capabilities comparable to GPT-4 Turbo, supporting 338 programming languages and 128K context length. Consistency Large Language Models (CLLMs) enable parallel decoding generating multiple tokens per step. Grokked Transformers demonstrate reasoning through training dynamics affecting memory formation and generalization. VoCo-LLaMA compresses vision tokens with LLMs improving video temporal correlation understanding. The BigCodeBench benchmark evaluates LLMs on 1,140 coding tasks across 139 Python libraries, topped by DeepSeek-Coder-V2 and Claude 3 Opus. PixelProse is a large 16M image-caption dataset with reduced toxicity.
Ways to use Anthropic's Tool Use GA
claude-3-opus haiku opus convnext anthropic amazon google tool-use function-calling agentic-ai streaming vision parallelization delegation debate specialization open-science superintelligence convolutional-networks self-attention ai-research yann-lecun alex-albert sainingxie
Anthropic launched general availability of tool use/function calling with support for streaming, forced use, and vision, alongside Amazon and Google. Alex Albert shared five architectures for agentic tool use: delegation, parallelization, debate, specialization, and tool suite experts. Anthropic also introduced a self-guided course on tool use. Yann LeCun emphasized ethical open science funding, gradual emergence of superintelligence with safety guardrails, and convolutional networks for image/video processing as competitive with vision transformers. He also noted growth in AI researchers across industry, academia, and government.
Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model
chameleon gpt-4o gemini-1.5-flash claude-3 meta-ai-fair openai google-deepmind anthropic reddit multimodality early-fusion benchmarking model-training tokenization streaming tool-use vision coding hallucination-detection model-performance armen-aghajanyan sama alexandr-wang abacaj alexalbert__
Meta AI FAIR introduced Chameleon, a new multimodal model family with 7B and 34B parameter versions trained on 10T tokens of interleaved text and image data enabling "early fusion" multimodality that can natively output any modality. While reasoning benchmarks are modest, its "omnimodality" approach competes well with pre-GPT4o multimodal models. OpenAI launched GPT-4o, a model excelling in benchmarks like MMLU and coding tasks, with strong multimodal capabilities but some regression in ELO scores and hallucination issues. Google DeepMind announced Gemini 1.5 Flash, a small model with 1M context window and flash performance, highlighting convergence trends between OpenAI and Google models. Anthropic updated Claude 3 with streaming support, forced tool use, and vision tool integration for multimodal knowledge extraction. OpenAI also partnered with Reddit, raising industry attention.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."
GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4O version)
gpt-4o gpt-4-turbo openai lmsys multion adept multimodality vision speech-recognition tokenization real-time-processing coding model-performance model-optimization desktop-agents sama gdb
OpenAI has released GPT-4o, a new multimodal model capable of reasoning across text, audio, and video in real time with low latency (~300ms). It features voice and vision capabilities, improved non-English language performance with an expanded 200k vocabulary tokenizer, and is available to all ChatGPT users including free plans. GPT-4o is half the price and twice as fast as GPT-4-turbo with 5x rate limits. The model supports real-time voice and video input/output and shows strong coding capabilities. The release includes a new desktop app that can read screen and clipboard history, challenging existing desktop agent startups. The announcement was accompanied by demos including image generation and 3D object handling, with OpenAI achieving state-of-the-art performance in ASR and vision tasks. The update was widely discussed on social media, with comparisons to GPT-4T highlighting GPT-4o's speed and versatility. "GPT-4o is smart, fast, natively multimodal, and a step towards more natural human-computer interaction" and "extremely versatile and fun to play with".
AdamW -> AaronD?
claude-3-opus llama-3 llama-3-300m bert-large stable-diffusion-1.5 wdxl openai hugging-face optimizer machine-learning-benchmarks vision time-series-forecasting image-generation prompt-injection policy-enforcement aaron-defazio
Aaron Defazio is gaining attention for proposing a potential tuning-free replacement of the long-standing Adam optimizer, showing promising experimental results across classic machine learning benchmarks like ImageNet ResNet-50 and CIFAR-10/100. On Reddit, Claude 3 Opus has surpassed all OpenAI models on the LMSys leaderboard, while a user pretrained a LLaMA-based 300M model outperforming bert-large on language modeling tasks with a modest budget. The new MambaMixer architecture demonstrates promising results in vision and time series forecasting. In image generation, Stable Diffusion 1.5 with LoRAs achieves realistic outputs, and the WDXL release showcases impressive capabilities. AI applications include an AI-generated Nike spec ad and a chatbot built with OpenAI models that may resist prompt injections. OpenAI is reportedly planning a ban wave targeting policy violators and jailbreak users. "The high alpha seems to come from Aaron Defazio," highlighting his impactful work in optimizer research.
Welcome /r/LocalLlama!
cerebrum-8x7b mixtral-7b gpt-3.5-turbo gemini-pro moistral-11b-v1 claude-opus qwen-vl-chat sakana openinterpreter reddit aether-research mistral-ai nvidia lmdeploy model-merging benchmarking quantization performance-optimization deployment vision fine-tuning training-data synthetic-data rag gui
Sakana released a paper on evolutionary model merging. OpenInterpreter launched their O1 devkit. Discussions highlight Claude Haiku's underrated performance with 10-shot examples. On Reddit's IPO, AINews introduces Reddit summaries starting with /r/LocalLlama, covering upcoming subreddits like r/machinelearning and r/openai. Aether Research released Cerebrum 8x7b based on Mixtral, matching GPT-3.5 Turbo and Gemini Pro on reasoning tasks, setting a new open-source reasoning SOTA. Moistral 11B v1 finetuned model from Cream-Phi-2 creators was released. A creative writing benchmark uses Claude Opus as judge. Hobbyists explore 1.58 BitNet ternary quantization and 1-bit LLMs training. Nvidia's Blackwell (h200) chip supports FP4 precision quantization. LMDeploy v0.2.6+ enables efficient vision-language model deployment with models like Qwen-VL-Chat. Users seek GUIs for LLM APIs with plugin and RAG support. Pipelines for synthetic training data generation and fine-tuning language models for chat are discussed.
Claude 3 just destroyed GPT 4 (see for yourself)
claude-3 claude-3-opus claude-3-sonnet claude-3-haiku gpt-4 anthropic amazon google claude-ai multimodality vision long-context model-alignment model-evaluation synthetic-data structured-output instruction-following model-speed cost-efficiency benchmarking safety mmitchell connor-leahy
Claude 3 from Anthropic launches in three sizes: Haiku (small, unreleased), Sonnet (medium, default on claude.ai, AWS, and GCP), and Opus (large, on Claude Pro). Opus outperforms GPT-4 on key benchmarks like GPQA, impressing benchmark authors. All models support multimodality with advanced vision capabilities, including converting a 2-hour video into a blog post. Claude 3 offers improved alignment, fewer refusals, and extended context length up to 1 million tokens with near-perfect recall. Haiku is noted for speed and cost-efficiency, processing dense research papers in under three seconds. The models excel at following complex instructions and producing structured outputs like JSON. Safety improvements reduce refusal rates, though some criticism remains from experts. Claude 3 is trained on synthetic data and shows strong domain-specific evaluation results in finance, medicine, and philosophy.
CodeLLama 70B beats GPT4 on HumanEval
codellama miqu mistral-medium llama-2-70b aphrodite-engine mixtral flatdolphinmaid noromaid rpcal chatml mistral-7b activation-beacon eagle-7b rwkv-v5 openhermes2.5 nous-hermes-2-mixtral-8x7b-dpo imp-v1-3b bakllava moondream qwen-vl meta-ai-fair ollama nous-research mistral-ai hugging-face ai-ethics alignment gpu-optimization direct-prompt-optimization fine-tuning cuda-programming optimizer-technology quantization multimodality context-length dense-retrieval retrieval-augmented-generation multilinguality model-performance open-source code-generation classification vision
Meta AI surprised the community with the release of CodeLlama, an open-source model now available on platforms like Ollama and MLX for local use. The Miqu model sparked debate over its origins, possibly linked to Mistral Medium or a fine-tuned Llama-2-70b, alongside discussions on AI ethics and alignment risks. The Aphrodite engine showed strong performance on A6000 GPUs with specific configurations. Role-playing AI models such as Mixtral and Flatdolphinmaid faced challenges with repetitiveness, while Noromaid and Rpcal performed better, with ChatML and DPO recommended for improved responses. Learning resources like fast.ai's course were highlighted for ML/DL beginners, and fine-tuning techniques with optimizers like Paged 8bit lion and adafactor were discussed.
At Nous Research AI, the Activation Beacon project introduced a method for unlimited context length in LLMs using "global state" tokens, potentially transforming retrieval-augmented models. The Eagle-7B model, based on RWKV-v5, outperformed Mistral in benchmarks with efficiency and multilingual capabilities. OpenHermes2.5 was recommended for consumer hardware due to its quantization methods. Multimodal and domain-specific models like IMP v1-3b, Bakllava, Moondream, and Qwen-vl were explored for classification and vision-language tasks. The community emphasized centralizing AI resources for collaborative research.
12/28/2023: Smol Talk updates
tinyllama-1.1b mixtral tinygpt-v nous-research tyrannosaurus latex benchmarking knowledge-graphs model-finetuning tokenization decentralized-computation philosophy-of-ai multimodality vision open-source-models gary-marcus
Nous Research AI Discord discussions covered topics such as AI placement charts, ChatGPT's issues with Latex math format compatibility with Obsidian, and performance metrics of the TinyLlama 1.1B model on various benchmarks. Users shared resources including the math-centric corpus MathPile, knowledge graph building methods, and open-source large language model repositories. Technical discussions included decentralized computation feasibility for models like Mixtral, philosophical debates on AI sentience, and strategies for model finetuning and token counting. The community also discussed the Obsidian model, vision model training, and the release of the multimodal TinyGPT-V model by Tyrannosaurus. "ChatGPT not generating Latex math format compatible with Obsidian" and "optimistic about human-level AI within our lifetime" were notable quotes.
12/20/2023: Project Obsidian - Multimodal Mistral 7B from Nous
gpt-4 gpt-3.5 dall-e-3 nous-research teknim openai multimodality image-detection security-api bias facial-recognition healthcare-ai gpu-optimization prompt-engineering vision
Project Obsidian is a multimodal model being trained publicly, tracked by Teknium on the Nous Discord. Discussions include 4M: Massively Multimodal Masked Modeling and Reason.dev, a TypeScript framework for LLM applications. The OpenAI Discord community discussed hardware specs for running TensorFlow JS for image detection, security API ideas for filtering inappropriate images, and concerns about racial and cultural bias in AI, especially in facial recognition and healthcare. Challenges with GPT-3.5 and GPT-4 in word puzzle games were noted, along with GPU recommendations prioritizing VRAM for AI inference. Users also debated GPT-4's vision capabilities, limitations of DALL·E 3, platform access issues, and prompting strategies for better outputs.
12/13/2023 SOLAR10.7B upstages Mistral7B?
solar-10.7b llama-2 mistral-7b phi-2 gpt-4 gemini upstage nous-research openai mistral-ai microsoft depth-up-scaling pretraining synthetic-data gpu-training api-usage model-integration agi asi chat-models vision model-performance fine-tuning
Upstage released the SOLAR-10.7B model, which uses a novel Depth Up-Scaling technique built on the llama-2 architecture and integrates mistral-7b weights, followed by continued pre-training. The Nous community finds it promising but not exceptional. Additionally, weights for the phi-2 base model were released, trained on 1.4 trillion tokens including synthetic texts created by GPT-3 and filtered by GPT-4, using 96 A100 GPUs over 14 days. On OpenAI's Discord, users discussed challenges with various GPT models, including incoherent outputs, API usage limitations, and issues with GPT-4 Vision API. Conversations also covered understanding AGI and ASI, concerns about OpenAI's partnership with Axel Springer, and pricing changes for GPT Plus. Discussions included the Gemini chat model integrated into Bard and comparisons with GPT-4 performance.