All tags
Company: "meta-ai-fair"
Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI
o3-pro o3 o1-pro gpt-4o gpt-4.1 gpt-4.1-mini gpt-4.1-nano meta-ai-fair scale-ai lamini amd openai gemini google anthropic model-release benchmarking reasoning fine-tuning pricing model-performance direct-preference-optimization complex-problem-solving alexandr_wang sharon_zhou fidji_simo sama jack_rae markchen90 kevinweil gdb gregkamradt lechmazur wesrothmoney paul_cal imjaredz cto_junior johnowhitaker polynoamial scaling01
Meta hires Scale AI's Alexandr Wang to lead its new "Superintelligence" division following a $15 billion investment for a 49% stake in Scale. Lamini's Sharon Zhou joins AMD as VP of AI under Lisa Su, while Instacart's Fidji Simo becomes CEO of Apps at OpenAI under Sama. Meta offers over $10 million/year compensation packages to top researchers, successfully recruiting Jack Rae from Gemini. OpenAI releases o3-pro model to ChatGPT Pro users and API, outperforming o3 and setting new benchmarks like Extended NYT Connections and SnakeBench. Despite being slower than o1-pro, o3-pro excels in reasoning and complex problem-solving. OpenAI cuts o3 pricing by 80%, making it cheaper than GPT-4o and pressuring competitors like Google and Anthropic to lower prices. Users can now fine-tune the GPT-4.1 family using direct preference optimization (DPO) for subjective tasks.
AI Engineer World's Fair Talks Day 1
gemini-2.5 gemma claude-code mistral cursor anthropic openai aie google-deepmind meta-ai-fair agent-based-architecture open-source model-memorization scaling-laws quantization mixture-of-experts language-model-memorization model-generalization langgraph model-architecture
Mistral launched a new Code project, and Cursor released version 1.0. Anthropic improved Claude Code plans, while ChatGPT announced expanded connections. The day was dominated by AIE keynotes and tracks including GraphRAG, RecSys, and Tiny Teams. On Reddit, Google open-sourced the DeepSearch stack for building AI agents with Gemini 2.5 and LangGraph, enabling flexible agent architectures and integration with local LLMs like Gemma. A new Meta paper analyzed language model memorization, showing GPT-style transformers store about 3.5–4 bits/parameter and exploring the transition from memorization to generalization, with implications for Mixture-of-Experts models and quantization effects.
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.
Mistral's Agents API and the 2025 LLM OS
qwen claude-4 chatgpt o3 o4 mistral-ai langchain-ai openai meta-ai-fair agent-frameworks multi-agent-systems tool-use code-execution web-search model-context-protocol persistent-memory function-calling open-source no-code reinforcement-learning model-performance agent-orchestration omarsar0 simonw swyx scaling01
The LLM OS concept has evolved since 2023, with Mistral AI releasing a new Agents API that includes code execution, web search, persistent memory, and agent orchestration. LangChainAI introduced the Open Agent Platform (OAP), an open-source no-code platform for intelligent agents. OpenAI plans to develop ChatGPT into a super-assistant by H1 2025, competing with Meta. Discussions around Qwen models focus on reinforcement learning effects, while Claude 4 performance is also noted. The AI Engineer World's Fair is calling for volunteers.
not much happened today
kernelllm-8b gpt-4o deepseek-v3 mistral-medium-3 qwen3 blip3-o xgen-small anisora stable-audio-open-small alphaevolve meta-ai-fair mistral-ai qwen deepseek salesforce bilibili stability-ai google benchmarking model-performance multilinguality hardware-optimization multimodality image-generation video-generation text-to-audio model-parallelism chain-of-thought instruction-following reasoning mitigation-strategies reach_vb lmarena_ai theadimeline adcock_brett jxmnop dair_ai omarsar0
Meta released KernelLLM 8B, outperforming GPT-4o and DeepSeek V3 on KernelBench-Triton Level 1. Mistral Medium 3 debuted strongly in multiple benchmarks. Qwen3 models introduced a unified framework with multilingual support. DeepSeek-V3 features hardware-aware co-design. BLIP3-o family released for multimodal tasks using diffusion transformers. Salesforce launched xGen-Small models excelling in long-context and math benchmarks. Bilibili released AniSORA for anime video generation. Stability AI open-sourced Stable Audio Open Small optimized for Arm devices. Google’s AlphaEvolve coding agent improved Strassen's algorithm for the first time since 1969. Research shows chain-of-thought reasoning can harm instruction-following ability, with mitigation strategies like classifier-selective reasoning being most effective, but reasoning techniques show high variance and limited generalization. "Chain-of-thought (CoT) reasoning can harm a model’s ability to follow instructions" and "Mitigation strategies such as few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning can counteract reasoning-induced failures".
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
not much happened today
hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb
Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.
Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning
intellect-2 dreamo qwen gemini-2.5-pro dynamic-byte-latent-transformer gen-4-references mistral-medium-3 le-chat-enterprise primeintellect bytedance qwen gemma meta-ai-fair runwayml mistral-ai google distributed-training reinforcement-learning gpu-clusters model-optimization quantization multimodality agentic-ai video-understanding fine-tuning _akhaliq reach_vb osanseviero aiatmeta c_valenzuelab lmarena_ai adcock_brett
Prime Intellect released INTELLECT-2, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. ByteDance launched DreamO, a unified image customization model on Hugging Face. Qwen released models optimized for GPTQ, GGUF, and AWQ quantization. Gemma surpassed 150 million downloads on Hugging Face. Meta released weights for the Dynamic Byte Latent Transformer and the Collaborative Reasoner framework to improve language model efficiency and reasoning. RunwayML introduced Gen-4 References, a near-realtime model requiring no fine-tuning. Mistral AI released Mistral Medium 3, a strong multimodal model, and Le Chat Enterprise, an agentic AI assistant for business. Google updated Gemini 2.5 Pro Preview with video understanding and UI improvements. "Airbnb for spare GPUs from all over the world" highlights the ongoing challenges and potential of distributed GPU training.
not much happened today
phi-4 phi-4-mini-reasoning qwen3-235b qwen3-moe-235b qwen3-moe-30b qwen3-dense-32b qwen3-dense-14b qwen3-dense-8b qwen3-dense-4b qwen3-dense-0.6b qwen2.5-omni-3b deepseek-prover-v2 llama llama-guard-4 prompt-guard-2 mimo-7b microsoft anthropic cursor alibaba togethercompute deepseek meta-ai-fair xiaomi openrouterai cohere reasoning model-fine-tuning model-evaluation benchmarking model-popularity open-source math model-scaling model-filtering jailbreak-prevention cline reach_vb vipulved akhaliq omarsar0 zhs05232838 huajian_xin mervenoyann karpathy random_walker sarahookr blancheminerva clefourrier
Microsoft released Phi-reasoning 4, a finetuned 14B reasoning model slightly behind QwQ but limited by data transparency and token efficiency issues. Anthropic introduced remote MCP server support and a 45-minute Research mode in Claude. Cursor published a model popularity list. Alibaba launched Qwen3-235B and other Qwen3 variants, highlighting budget-friendly coding and reasoning capabilities, with availability on Together AI API. Microsoft also released Phi-4-Mini-Reasoning with benchmark performance on AIME 2025 and OmniMath. DeepSeek announced DeepSeek-Prover V2 with state-of-the-art math problem solving, scaling to 671B parameters. Meta AI's Llama models hit 1.2 billion downloads, with new Llama Guard 4 and Prompt Guard 2 for input/output filtering and jailbreak prevention. Xiaomi released the open-source reasoning model MiMo-7B trained on 25 trillion tokens. Discussions on AI model evaluation highlighted issues with the LMArena leaderboard, data access biases favoring proprietary models, and challenges in maintaining fair benchmarking, with suggestions for alternatives like OpenRouterAI rankings. "LMArena slop and biased" and "61.3% of all data going to proprietary model providers" were noted concerns.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
LlamaCon: Meta AI gets into the Llama API platform business
llama-4 qwen3 qwen3-235b-a22b qwen3-30b-a3b qwen3-4b qwen2-5-72b-instruct o3-mini meta-ai-fair cerebras groq alibaba vllm ollama llamaindex hugging-face llama-cpp model-release fine-tuning reinforcement-learning moe multilingual-models model-optimization model-deployment coding benchmarking apache-license reach_vb huybery teortaxestex awnihannun thezachmueller
Meta celebrated progress in the Llama ecosystem at LlamaCon, launching an AI Developer platform with finetuning and fast inference powered by Cerebras and Groq hardware, though it remains waitlisted. Meanwhile, Alibaba released the Qwen3 family of large language models, including two MoE models and six dense models ranging from 0.6B to 235B parameters, with the flagship Qwen3-235B-A22B achieving competitive benchmark results and supporting 119 languages and dialects. The Qwen3 models are optimized for coding and agentic capabilities, are Apache 2.0 licensed, and have broad deployment support including local usage with tools like vLLM, Ollama, and llama.cpp. Community feedback highlights Qwen3's scalable performance and superiority over models like OpenAI's o3-mini.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
not much happened today
o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0
OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.
lots of little things happened this week
llama-3-3-nemotron-super-49b-v1 claude anthropic nvidia sakana-ai meta-ai-fair reinforcement-learning reasoning benchmarks multi-turn-collaboration instruction-following dataset-release model-evaluation percy-liang
Anthropic introduced a novel 'think' tool enhancing instruction adherence and multi-step problem solving in agents, with combined reasoning and tool use demonstrated by Claude. NVIDIA's Llama-3.3-Nemotron-Super-49B-v1 ranked #14 on LMArena, noted for strong math reasoning and a 15M post-training dataset. Sakana AI launched a Sudoku-based reasoning benchmark to advance AI problem-solving capabilities. Meta AI released SWEET-RL, a reinforcement learning algorithm improving long-horizon multi-turn tasks by 6%, and introduced CollaborativeAgentBench, a benchmark for collaborative LLM agents working with humans on programming and design tasks. Percy Liang relaunched the HELM benchmark with 5 challenging datasets evaluating 22 top language models.
Every 7 Months: The Moore's Law for Agent Autonomy
claude-3-7-sonnet llama-4 phi-4-multimodal gpt-2 cosmos-transfer1 gr00t-n1-2b orpheus-3b metr nvidia hugging-face canopy-labs meta-ai-fair microsoft agent-autonomy task-completion multimodality text-to-speech robotics foundation-models model-release scaling-laws fine-tuning zero-shot-learning latency reach_vb akhaliq drjimfan scaling01
METR published a paper measuring AI agent autonomy progress, showing it has doubled every 7 months since 2019 (GPT-2). They introduced a new metric, the 50%-task-completion time horizon, where models like Claude 3.7 Sonnet achieve 50% success in about 50 minutes. Projections estimate 1 day autonomy by 2028 and 1 month autonomy by late 2029. Meanwhile, Nvidia released Cosmos-Transfer1 for conditional world generation and GR00T-N1-2B, an open foundation model for humanoid robot reasoning with 2B parameters. Canopy Labs introduced Orpheus 3B, a high-quality text-to-speech model with zero-shot voice cloning and low latency. Meta reportedly delayed Llama-4 release due to performance issues. Microsoft launched Phi-4-multimodal.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
not much happened today
zonos-v0.1 audiobox-aesthetics moshi sonar llama-3-70b gpt-4o-mini claude-3.5-haiku gpt-4o claude-3.5-sonnet deepseek-r1-distilled-qwen-1.5b reasonflux-32b o1-preview zyphra-ai meta-ai-fair kyutai-labs perplexity-ai cerebras uc-berkeley brilliant-labs google-deepmind text-to-speech speech-to-speech benchmarking model-performance reinforcement-learning math real-time-processing open-source cross-platform-integration multilinguality zero-shot-learning danhendrycks
Zyphra AI launched Zonos-v0.1, a leading open-weight text-to-speech model supporting multiple languages and zero-shot voice cloning. Meta FAIR released the open-source Audiobox Aesthetics model trained on 562 hours of audio data. Kyutai Labs introduced Moshi, a real-time speech-to-speech system with low latency. Perplexity AI announced the Sonar model based on Llama 3.3 70b, outperforming top models like GPT-4o and Claude 3.5 Sonnet with 1200 tokens/second speed, powered by Cerebras infrastructure. UC Berkeley open-sourced a 1.5B model trained with reinforcement learning that beats o1-preview on math tasks. ReasonFlux-32B achieved 91.2% on the MATH benchmark, outperforming OpenAI o1-preview. CrossPoster, an AI agent for cross-platform posting, was released using LlamaIndex workflows. Brilliant Labs integrated the Google DeepMind Gemini Live API into smart glasses for real-time translation and object identification.
TinyZero: Reproduce DeepSeek R1-Zero for $30
deepseek-r1 qwen o1 claude-3-sonnet claude-3 prime ppo grpo llama-stack deepseek berkeley hugging-face meta-ai-fair openai deeplearningai reinforcement-learning fine-tuning chain-of-thought multi-modal-benchmark memory-management model-training open-source agentic-workflow-automation model-performance jiayi-pan saranormous reach_vb lmarena_ai nearcyan omarsar0 philschmid hardmaru awnihannun winglian
DeepSeek Mania continues to reshape the frontier model landscape with Jiayi Pan from Berkeley reproducing the OTHER result from the DeepSeek R1 paper, R1-Zero, in a cost-effective Qwen model fine-tune for two math tasks. A key finding is a lower bound to the distillation effect at 1.5B parameters, with RLCoT reasoning emerging as an intrinsic property. Various RL techniques like PPO, DeepSeek's GRPO, or PRIME show similar outcomes, and starting from an Instruct model speeds convergence. The Humanity’s Last Exam (HLE) Benchmark introduces a challenging multi-modal test with 3,000 expert-level questions across 100+ subjects, where models perform below 10%, with DeepSeek-R1 achieving 9.4%. DeepSeek-R1 excels in chain-of-thought reasoning, outperforming models like o1 while being 20x cheaper and MIT licensed. The WebDev Arena Leaderboard ranks DeepSeek-R1 #2 in technical domains and #1 under Style Control, closing in on Claude 3.5 Sonnet. OpenAI's Operator is deployed to 100% of Pro users in the US, enabling tasks like ordering meals and booking reservations, and functions as a research assistant for AI paper searches and summaries. Hugging Face announces a leadership change after significant growth, while Meta AI releases the first stable version of Llama Stack with streamlined upgrades and automated verification. DeepSeek-R1's open-source success is celebrated, and technical challenges like memory management on macOS 15+ are addressed with residency sets in MLX for stability.
not much happened today
deepseek-v3 llama-3-1-405b gpt-4o gpt-5 minimax-01 claude-3-haiku cosmos-nemotron-34b openai deep-learning-ai meta-ai-fair google-deepmind saama langchain nvidia mixture-of-experts coding math scaling visual-tokenizers diffusion-models inference-time-scaling retrieval-augmented-generation ai-export-restrictions security-vulnerabilities prompt-injection gpu-optimization fine-tuning personalized-medicine clinical-trials ai-agents persistent-memory akhaliq
DeepSeek-V3, a 671 billion parameter mixture-of-experts model, surpasses Llama 3.1 405B and GPT-4o in coding and math benchmarks. OpenAI announced the upcoming release of GPT-5 on April 27, 2023. MiniMax-01 Coder mode in ai-gradio enables building a chess game in one shot. Meta research highlights trade-offs in scaling visual tokenizers. Google DeepMind improves diffusion model quality via inference-time scaling. The RA-DIT method fine-tunes LLMs and retrievers for better RAG responses. The U.S. proposes a three-tier export restriction system on AI chips and models, excluding countries like China and Russia. Security vulnerabilities in AI chatbots involving CSRF and prompt injection were revealed. Concerns about superintelligence and weapons-grade AI models were expressed. ai-gradio updates include NVIDIA NIM compatibility and new models like cosmos-nemotron-34b. LangChain integrates with Claude-3-haiku for AI agents with persistent memory. Triton Warp specialization optimizes GPU usage for matrix multiplication. Meta's fine-tuned Llama models, OpenBioLLM-8B and OpenBioLLM-70B, target personalized medicine and clinical trials.
not much happened today
oute-tts-0.3-1b oute-tts-0.3-500m olm-1b qwen-2.5-0.5b hover gpt-4o deepseek-v3 harvey meta-ai-fair stability-ai alibaba deepseek hugging-face text-to-speech zero-shot-learning multilinguality emotion-control motor-control reinforcement-learning local-ai distributed-inference pipeline-parallelism mathematical-reasoning process-reward-models legal-ai education-ai ai-security humor reach_vb drjimfan vikhyatk mervenoyann aiatmeta iscienceluvr alibaba_qwen awnihannun ajeya_cotra emollick qtnx_ designerx
Harvey secured a new $300M funding round. OuteTTS 0.3 1B & 500M text-to-speech models were released featuring zero-shot voice cloning, multilingual support (en, jp, ko, zh, fr, de), and emotion control, powered by OLMo-1B and Qwen 2.5 0.5B. The HOVER model, a 1.5M-parameter neural net for agile motor control, was introduced, leveraging human motion capture datasets and massively parallel reinforcement learning. kokoro.js enables running AI models locally in browsers with minimal dependencies. Meta AI awarded $200K LLM evaluation grants for projects on regional language understanding, complex reasoning, and interactive programming environments. Stability AI's Twitter account was hacked, prompting security warnings. Alibaba Qwen improved Process Reward Models (PRMs) for better mathematical reasoning using a consensus filtering mechanism. DeepSeek V3 uses pipeline parallelism to enhance distributed inference and long-context generation efficiency. Discussions on AI policy in legal frameworks and AI's role in democratizing education were highlighted. Lighthearted AI-related humor was also shared.
Titans: Learning to Memorize at Test Time
minimax-01 gpt-4o claude-3.5-sonnet internlm3-8b-instruct transformer2 google meta-ai-fair openai anthropic langchain long-context mixture-of-experts self-adaptive-models prompt-injection agent-authentication diffusion-models zero-trust-architecture continuous-adaptation vision agentic-systems omarsar0 hwchase17 abacaj hardmaru rez0__ bindureddy akhaliq saranormous
Google released a new paper on "Neural Memory" integrating persistent memory directly into transformer architectures at test time, showing promising long-context utilization. MiniMax-01 by @omarsar0 features a 4 million token context window with 456B parameters and 32 experts, outperforming GPT-4o and Claude-3.5-Sonnet. InternLM3-8B-Instruct is an open-source model trained on 4 trillion tokens with state-of-the-art results. Transformer² introduces self-adaptive LLMs that dynamically adjust weights for continuous adaptation. Advances in AI security highlight the need for agent authentication, prompt injection defenses, and zero-trust architectures. Tools like Micro Diffusion enable budget-friendly diffusion model training, while LeagueGraph and Agent Recipes support open-source social media agents.
not much happened this weekend
o3 o1 opus sonnet octave openai langchain hume x-ai amd nvidia meta-ai-fair hugging-face inference-time-scaling model-ensembles small-models voice-cloning fine-math-dataset llm-agent-framework benchmarking software-stack large-concept-models latent-space-reasoning mechanistic-interpretability planning speech-language-models lisa-su clementdelangue philschmid neelnanda5
o3 model gains significant attention with discussions around its capabilities and implications, including an OpenAI board member referencing "AGI." LangChain released their State of AI 2024 survey. Hume announced OCTAVE, a 3B parameter API-only speech-language model with voice cloning. x.ai secured a $6B Series C funding round. Discussions highlight inference-time scaling, model ensembles, and the surprising generalization ability of small models. New tools and datasets include FineMath, the best open math dataset on Hugging Face, and frameworks for LLM agents. Industry updates cover a 5-month benchmarking of AMD MI300X vs Nvidia H100 + H200, insights from a meeting with Lisa Su on AMD's software stack, and open AI engineering roles. Research innovations include Large Concept Models (LCM) from Meta AI, Chain of Continuous Thought (Coconut) for latent space reasoning, and mechanistic interpretability initiatives.
ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,
modernbert gemini-2.0-flash-thinking o1 llama answerdotai lightonio hugging-face google-deepmind openai meta-ai-fair figure encoder-only-models long-context alternating-attention natural-language-understanding reasoning robotics-simulation physics-engine humanoid-robots model-performance model-releases jeremyphoward alec-radford philschmid drjimfan bindureddy
Answer.ai/LightOn released ModernBERT, an updated encoder-only model with 8k token context, trained on 2 trillion tokens including code, with 139M/395M parameters and state-of-the-art performance on retrieval, NLU, and code tasks. It features Alternating Attention layers mixing global and local attention. Gemini 2.0 Flash Thinking debuted as #1 in Chatbot Arena, and the O1 model scored top in reasoning benchmarks. Llama downloads surpassed 650 million, doubling in 3 months. OpenAI launched desktop app integrations with voice capabilities. Figure delivered its first humanoid robots commercially. Advances in robotics simulation and a new physics engine Genesis claiming 430,000x faster than real-time were highlighted.
Genesis: Generative Physics Engine for Robotics (o1-mini version)
o1 o1-preview gpt-4o claude-3.5-sonnet gemini-2.0-pro llama-3-3b llama-3-70b openai google-deepmind meta-ai-fair hugging-face function-calling structured-outputs vision performance-benchmarks sdk webrtc reasoning math code-generation transformer-architecture model-training humanoid-robots search model-efficiency dataset-sharing aidan_mclau sundarpichai adcock_brett
OpenAI launched the o1 model API featuring function calling, structured outputs, vision support, and developer messages, achieving 60% fewer reasoning tokens than its preview. The model excels in math and code with a 0.76 LiveBench Coding score, outperforming Sonnet 3.5. Beta SDKs for Go and Java and WebRTC support with 60% lower prices were also released. Google Gemini 2.0 Pro (Gemini Exp 1206) deployment accelerated, showing improved coding, math, and reasoning performance. Meta AI FAIR introduced research on training transformers directly on raw bytes using dynamic entropy-based patching. Commercial humanoid robots were successfully deployed by an industry player. Hugging Face researchers demonstrated that their 3B Llama model can outperform the 70B Llama model on MATH-500 accuracy using search techniques, highlighting efficiency gains with smaller models. Concerns about reproducibility and domain-specific limitations were noted.
OpenAI Voice Mode Can See Now - After Gemini Does
gemini-2.0-flash claude claude-3.5-sonnet llama-3-70b llama-3 mistral-large gpt-4o openai google-deepmind anthropic togethercompute scale-ai meta-ai-fair mistral-ai multimodality real-time-streaming roleplay prompt-handling model-comparison model-training creative-writing model-censorship code-execution developer-ecosystem ai-humor bindureddy
OpenAI launched Realtime Video shortly after Gemini, which led to less impact due to Gemini's earlier arrival with lower cost and fewer rate limits. Google DeepMind released Gemini 2.0 Flash featuring enhanced multimodal capabilities and real-time streaming. Anthropic introduced Clio, a system analyzing real-world usage of Claude models. Together Computing acquired CodeSandbox to launch a code interpreter tool. Discussions highlighted Meta's Llama 3.3-70B for its advanced roleplay and prompt handling abilities, outperforming models like Mistral Large and GPT-4o in expressiveness and censorship. The AI community also engaged in humorous takes on AI outages and model competition, with ChatGPT adding a Santa mode for holiday interactions. "Anthropic is capturing the developer ecosystem, Gemini has AI enthusiast mindshare, ChatGPT reigns over AI dabblers" was a noted observation from the community.
Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights
apollo-1b apollo-3b apollo-7b veo-2 imagen-3 llama-3-70b llama-3b command-r7b llama-1b llama-8b chatgpt meta-ai-fair hugging-face google-deepmind openai figure-ai klarna cohere notion video-understanding scaling-consistency benchmarking temporal-ocr egocentric-perception spatial-perception reasoning video-generation physics-simulation voice-features map-integration language-expansion test-time-compute-scaling humanoid-robots ai-integration search-optimization self-recognition self-preference-bias akhaliq _lewtun clementdelangue adcock_brett rohanpaul_ai swyx shaneguML
Meta released Apollo, a new family of state-of-the-art video-language models available in 1B, 3B, and 7B sizes, featuring "Scaling Consistency" for efficient scaling and introducing ApolloBench, which speeds up video understanding evaluation by 41× across five temporal perception categories. Google Deepmind launched Veo 2, a 4K video generation model with improved physics and camera control, alongside an enhanced Imagen 3 image model. OpenAI globally rolled out ChatGPT search with advanced voice and map features and discussed a potential $2,000/month "ChatGPT Max" tier. Research highlights include achieving Llama 70B performance using Llama 3B via test-time compute scaling and expanding Command R7B language support from 10 to 23 languages. Industry updates feature Figure AI delivering humanoid robots commercially and Klarna reducing workforce through AI. Notion integrated Cohere Rerank for better search. Studies reveal LLMs can recognize their own writing style and show self-preference bias. Discussions note video processing progress outpacing text due to better signal-per-compute and data evaluation.
Meta BLT: Tokenizer-free, Byte-level LLM
byte-latent-transformer llama-3 phi-4 gpt-4o command-r7b meta-ai-fair llamaindex microsoft deepseek-ai openai cohere anthropic tokenization transformer-architecture model-efficiency benchmarking multimodality vision reinforcement-learning model-scaling jailbreaking model-optimization
Meta AI introduces the Byte Latent Transformer (BLT), a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming Llama 3 on benchmarks including the CUTE benchmark. The model was trained on approximately 1 trillion tokens and features a three-block transformer design with local and global components. This approach challenges traditional tokenization and may enable new multimodal capabilities such as direct file interaction without retrieval-augmented generation. Additionally, Microsoft announced the Phi-4 14B parameter model achieving state-of-the-art results on STEM and reasoning benchmarks, surpassing GPT-4o. DeepSeek AI launched new vision-language models based on their MoE architecture with sizes ranging from 1.0B to 27B parameters. OpenAI released a new Projects feature for ChatGPT, and Cohere introduced their smallest and fastest Command R7B model. Anthropic published research on "Best-of-N Jailbreaking" vulnerabilities across text, vision, and audio models. Industry discussion highlights a trend of decreasing frontier LLM sizes, with GPT-4 at approximately 1.8 trillion parameters compared to newer models.
ChatGPT Canvas GA
llama-3-70b llama-3-1-8b tgi-v3 deepseek-v2.5-1210 coconut openai deepseek-ai meta-ai-fair huggingface cognition-labs hyperbolic google-deepmind code-execution gpt-integration model-finetuning gradient-checkpointing context-length latent-space-reasoning performance-optimization gpu-memory-optimization kubernetes gpu-marketplace ai-capabilities employment-impact neurips-2024 ai-scaling humor arav_srinivas sama jonathan-frankle dylan
OpenAI launched ChatGPT Canvas to all users, featuring code execution and GPT integration, effectively replacing Code Interpreter with a Google Docs-like interface. Deepseek AI announced their V2.5-1210 update improving performance on MATH-500 (82.8%) and LiveCodebench. Meta AI Fair introduced COCONUT, a new continuous latent space reasoning paradigm. Huggingface released TGI v3, processing 3x more tokens and running 13x faster than vLLM on long prompts. Cognition Labs released Devin, an AI developer building Kubernetes operators. Hyperbolic raised $12M Series A to build an open AI platform with an H100 GPU marketplace. Discussions included AI capabilities and employment impact, and NeurIPS 2024 announcements with Google DeepMind demos and a debate on AI scaling. On Reddit, Llama 3.3-70B supports 90K context length finetuning using Unsloth with gradient checkpointing and Apple's Cut Cross Entropy (CCE) algorithm, fitting on 41GB VRAM. Llama 3.1-8B reaches 342K context lengths with Unsloth, surpassing native limits.
Meta Llama 3.3: 405B/Nova Pro performance at 70B price
llama-3-70b llama-3.3-70b gpt-4o gemini-exp-1206 meta-ai-fair openai google-deepmind hugging-face llamacloud reinforcement-learning fine-tuning model-performance document-processing pricing-models alignment online-rl sama steven-heidel aidan_mclau lmarena_ai oriolvinyalsml jerryjliu0
Meta AI released Llama 3.3 70B, matching the performance of the 405B model with improved efficiency using "a new alignment process and progress in online RL techniques". OpenAI announced Reinforcement Fine-Tuning (RFT) for building expert models with limited data, offering alpha access to researchers and enterprises. Google DeepMind's Gemini-Exp-1206 leads benchmarks, tying with GPT-4o in coding performance. LlamaCloud enhanced document processing with table extraction and analytics. Discussions on OpenAI's pricing plans continue in the community.
Stripe lets Agents spend money with StripeAgentToolkit
gpt-4o gemini-exp-1114 stripe openai anthropic meta-ai-fair ai-computer-interfaces agentic-ai model-overfitting benchmarks scaling-laws agi chain-of-thought image-captioning dialogue-systems memory-efficient-fine-tuning diffusion-models mixture-of-experts adaptive-decoding creativity-optimization factuality-optimization pair-programming document-parsing retrieval-augmented-generation abacaj francois-fleuret lmarena_ai goodside jxmnop jaseweston stevenheidel
Stripe has pioneered an AI SDK specifically designed for agents that handle payments, integrating with models like gpt-4o to enable financial transactions and token-based charging. The AI developer tooling trend emphasizes better "AI-Computer Interfaces" for improved agent reliability, with tools like E2B and the
llms.txt
documentation trend gaining traction, notably adopted by Anthropic. In AI model news, Gemini-Exp-1114 topped the Vision Leaderboard and improved in Math Arena, while discussions continue around model overfitting and the limits of scaling laws for AGI. OpenAI released a ChatGPT desktop app for macOS with integrations for VS Code, Xcode, and Terminal, enhancing developer workflows and pair programming. Anthropic introduced a prompt improver using chain-of-thought reasoning, and Meta AI shared top research from EMNLP2024 on image captioning, dialogue systems, and memory-efficient fine-tuning. Highlights from ICLR 2025 include diffusion-based illumination harmonization, open mixture-of-experts language models, and hyperbolic vision-language models. A new adaptive decoding method optimizes creativity and factuality per token. Tools like LlamaParse and RAGformation were also introduced for document parsing and retrieval-augmented generation. Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo
claude-3-sonnet gpt-4 gemini-1.5 claude-3.5-sonnet anthropic openai langchain meta-ai-fair benchmarking prompt-engineering rag visuotactile-perception ai-governance theoretical-alignment ethical-alignment jailbreak-robustness model-releases alignment richardmcngo andrewyng philschmid
Anthropic released the 3.5 Sonnet benchmark for jailbreak robustness, emphasizing adaptive defenses. OpenAI enhanced GPT-4 with a new RAG technique for contiguous chunk retrieval. LangChain launched Promptim for prompt optimization. Meta AI introduced NeuralFeels with neural fields for visuotactile perception. RichardMCNgo resigned from OpenAI, highlighting concerns on AI governance and theoretical alignment. Discussions emphasized the importance of truthful public information and ethical alignment in AI deployment. The latest Gemini update marks a new #1 LLM amid alignment challenges. The AI community continues to focus on benchmarking, prompt-engineering, and alignment issues.
not much happened today
llama-3-2-vision gpt-2 meta-ai-fair ollama amd llamaindex gemini gitpod togethercompute langchainai weights-biases stanfordnlp deeplearningai model-scaling neural-networks multi-gpu-support skip-connections transformers healthcare-ai automated-recruitment zero-trust-security small-language-models numerical-processing chain-of-thought optical-character-recognition multi-agent-systems agent-memory interactive-language-learning bindureddy fstichler stasbekman jxmnop bindureddy omarsar0 giffmana rajammanabrolu
This week in AI news highlights Ollama 0.4 supporting Meta's Llama 3.2 Vision models (11B and 90B), with applications like handwriting recognition. Self-Consistency Preference Optimization (ScPO) was introduced to improve model consistency without human labels. Discussions on model scaling, neural networks resurgence, and AMD's multi-GPU bandwidth challenges were noted. The importance of skip connections in Transformers was emphasized. In healthcare, less regulation plus AI could revolutionize disease treatment and aging. Tools like LlamaParse and Gemini aid automated resume insights. Gitpod Flex demonstrated zero-trust architecture for secure development environments. Research includes surveys on Small Language Models (SLMs), number understanding in LLMs, and DTrOCR using a GPT-2 decoder for OCR. Multi-agent systems in prediction markets were discussed by TogetherCompute and LangChainAI. Community events include NeurIPS Happy Hour, NLP seminars, and courses on Agent Memory with LLMs as operating systems.
Not much happened today
grok-beta llama-3-1-70b claude-3-5-haiku claude-3-opus llama-3 chatgpt gemini meta-ai-fair scale-ai anthropic perplexity-ai langchainai weights-biases qwen pricing national-security defense open-source agentic-ai retrieval-augmented-generation election-predictions real-time-updates annotation ai-ecosystem memes humor alexandr_wang svpino aravsrinivas bindureddy teortaxestex jessechenglyu junyang-lin cte_junior jerryjliu0
Grok Beta surpasses Llama 3.1 70B in intelligence but is less competitive due to its pricing at $5/1M input tokens and $15/1M output tokens. Defense Llama, developed with Meta AI and Scale AI, targets American national security applications. SWE-Kit, an open-source framework, supports building customizable AI software engineers compatible with Llama 3, ChatGPT, and Claude. LangChainAI and Weights & Biases integrate to improve retrievers and reduce hallucinations in RAG applications using Gemini. Perplexity AI offers enhanced election tracking tools for the 2024 elections, including live state results and support for Claude 3.5 Haiku. AI Talk launched featuring discussions on Chinese AI labs with guests from Qwen. Memes highlight Elon Musk and humorous AI coding mishaps.
Tencent's Hunyuan-Large claims to beat DeepSeek-V2 and Llama3-405B with LESS Data
claude-3.5-haiku llama-3-1 llama-3-2 mlx-lm tencent anthropic meta-ai-fair togethercompute llamaindex mixture-of-experts synthetic-data model-scaling model-architecture model-optimization kv-cache-quantization react fine-tuning scaling-laws model-efficiency model-deployment multimodality
Tencent released a notable >300B parameter MoE model pretrained on 7T tokens, including 1.5T synthetic data generated via Evol-Instruct. The model introduces novel techniques like "recycle routing" and expert-specific learning rates, alongside a compute-efficient scaling law for MoE active parameters. However, its custom license restricts use in the EU and by companies with over 100M MAU, and it avoids China-sensitive queries. Meanwhile, Anthropic launched Claude 3.5 Haiku, now available on multiple platforms, praised for intelligence and speed but criticized for a 10x price increase. Meta opened Llama AI to the U.S. defense sector, and a Llama Impact Hackathon offers a $15K prize for projects using Llama 3.1 & 3.2 Vision. LlamaIndex released a React chat UI component with Tailwind CSS and LLM backend integrations. The MLX LM model advances text generation speed and efficiency with KV cache quantization.
OpenAI beats Anthropic to releasing Speculative Decoding
claude-3-sonnet mrt5 openai anthropic nvidia microsoft boston-dynamics meta-ai-fair runway elevenlabs etched osmo physical-intelligence langchain speculative-decoding prompt-lookup cpu-inference multimodality retrieval-augmented-generation neural-networks optimization ai-safety governance model-architecture inference-economics content-generation adcock_brett vikhyatk dair_ai rasbt bindureddy teortaxestex svpino c_valenzuelab davidsholz
Prompt lookup and Speculative Decoding techniques are gaining traction with implementations from Cursor, Fireworks, and teased features from Anthropic. OpenAI has introduced faster response times and file edits with these methods, offering about 50% efficiency improvements. The community is actively exploring AI engineering use cases with these advancements. Recent updates highlight progress from companies like NVIDIA, OpenAI, Anthropic, Microsoft, Boston Dynamics, and Meta. Key technical insights include CPU inference capabilities, multimodal retrieval-augmented generation (RAG), and neural network fundamentals. New AI products include fully AI-generated games and advanced content generation tools. Challenges in AI research labs such as bureaucracy and resource allocation were also discussed, alongside AI safety and governance concerns.
not much happened today
smollm2 llama-3-2 stable-diffusion-3.5 claude-3.5-sonnet gemini openai anthropic google meta-ai-fair suno-ai perplexity-ai on-device-ai model-performance robotics multimodality ai-regulation model-releases natural-language-processing prompt-engineering agentic-ai ai-application model-optimization sam-altman akhaliq arav-srinivas labenz loubnabenallal1 alexalbert fchollet stasbekman svpino rohanpaul_ai hamelhusain
ChatGPT Search was launched by Sam Altman, who called it his favorite feature since ChatGPT's original launch, doubling his usage. Comparisons were made between ChatGPT Search and Perplexity with improvements noted in Perplexity's web navigation. Google introduced a "Grounding" feature in the Gemini API & AI Studio enabling Gemini models to access real-time web information. Despite Gemini's leaderboard performance, developer adoption lags behind OpenAI and Anthropic. SmolLM2, a new small, powerful on-device language model, outperforms Meta's Llama 3.2 1B. A Claude desktop app was released for Mac and Windows. Meta AI announced robotics advancements including Meta Sparsh, Meta Digit 360, and Meta Digit Plexus. Stable Diffusion 3.5 Medium, a 2B parameter model with a permissive license, was released. Insights on AGI development suggest initial inferiority but rapid improvement. Anthropic advocates for early targeted AI regulation. Discussions on ML specialization predict training will concentrate among few companies, while inference becomes commoditized. New AI tools include Suno AI Personas for music creation, PromptQL for natural language querying over data, and Agent S for desktop task automation. Humor was shared about Python environment upgrades.
not much happened today
llama-3.1-nemotron-70b golden-gate-claude embed-3 liquid-ai anthropic cohere openai meta-ai-fair nvidia perplexity-ai langchain kestra ostrisai llamaindex feature-steering social-bias multimodality model-optimization workflow-orchestration inference-speed event-driven-workflows knowledge-backed-agents economic-impact ai-national-security trust-dynamics sam-altman lmarena_ai aravsrinivas svpino richardmcngo ajeya_cotra tamaybes danhendrycks jerryjliu0
Liquid AI held a launch event introducing new foundation models. Anthropic shared follow-up research on social bias and feature steering with their "Golden Gate Claude" feature. Cohere released multimodal Embed 3 embeddings models following Aya Expanse. There was misinformation about GPT-5/Orion debunked by Sam Altman. Meta AI FAIR announced Open Materials 2024 with new models and datasets for inorganic materials discovery using the EquiformerV2 architecture. Anthropic AI demonstrated feature steering to balance social bias and model capabilities. NVIDIA's Llama-3.1-Nemotron-70B ranked highly on the Arena leaderboard with style control. Perplexity AI expanded to 100M weekly queries with new finance and reasoning modes. LangChain emphasized real application integration with interactive frame interpolation. Kestra highlighted scalable event-driven workflows with open-source YAML-based orchestration. OpenFLUX optimized inference speed by doubling it through guidance LoRA training. Discussions on AI safety included trust dynamics between humans and AI, economic impacts of AI automation, and the White House AI National Security memo addressing cyber and biological risks. LlamaIndex showcased knowledge-backed agents for enhanced AI applications.
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
bitnet-b1.58 llama-3.1-nemotron-70b-instruct gpt-4o claude-3.5-sonnet uc-berkeley deepmind openai microsoft nvidia archetype-ai boston-dynamics toyota-research google adobe openai mistral tesla meta-ai-fair model-optimization on-device-ai fine-tuning large-corpus-processing gpu-acceleration frameworks model-benchmarking rohanpaul_ai adcock_brett david-patterson
UC Berkeley's EPIC lab introduces innovative LLM data operators with projects like LOTUS and DocETL, focusing on effective programming and computation over large data corpora. This approach contrasts GPU-rich big labs like Deepmind and OpenAI with GPU-poor compound AI systems. Microsoft open-sourced BitNet b1.58, a 1-bit ternary parameter LLM enabling 4-20x faster training and on-device inference at human reading speeds. Nvidia released Llama-3.1-Nemotron-70B-Instruct, a fine-tuned open-source model outperforming GPT-4o and Claude-3.5-sonnet. These developments highlight advances in model-optimization, on-device-ai, and fine-tuning.
DeepSeek Janus and Meta SpiRit-LM: Decoupled Image and Expressive Voice Omnimodality
nemotron-70b claude claude-3.5-sonnet gpt-4o deepseek meta-ai-fair wandb nvidia anthropic hugging-face perplexity-ai multimodality image-generation speech-synthesis fine-tuning model-merging benchmarking open-source model-optimization reinforcement-learning bindureddy aravsrinivas danielhanchen clementdelangue cwolferesearch
DeepSeek Janus and Meta SpiRit-LM are two notable multimodality AI models recently released, showcasing advances in image generation and speech synthesis respectively. DeepSeek Janus separates vision encoders for image understanding and generation, achieving better results in both tasks. Meta's SpiRit-LM introduces an expressive speech and writing model generating pitch and style units, improving over standard TTS. Additionally, W&B Weave offers comprehensive LLM observability and multimodality fine-tuning tools. Industry updates include Nvidia's Nemotron 70b model underperforming, Meta open-sourcing Movie Gen Bench for media generation benchmarking, Perplexity launching internal search with multi-step reasoning, and Anthropic updating Claude apps. Open source progress includes Hugging Face's gradient accumulation fix in transformers and advocacy for open source AI to prevent Big Tech dominance. "Model merging for combining skills of multiple models" is also highlighted.
not much happened today
claudette llama-3-1 yi-lightning gpt-4o claude-3.5-sonnet answer-ai tencent notebooklm motherduck perplexity dropbox openai meta-ai-fair yi-ai zyphra-ai anthropic langchain openai synthetic-data fine-tuning sql audio-processing on-device-ai dataset-release transformer llm-reasoning ai-safety code-generation ai-pricing ai-job-market fchollet aravsrinivas svpino swyx
Answer.ai launched fastdata, a synthetic data generation library using
claudette
and Tencent's Billion Persona paper. NotebookLM became customizable, and Motherduck introduced notable LLMs in SQL implementations. Perplexity and Dropbox announced competitors to Glean. OpenAI unveiled audio chat completions priced at 24 cents per minute. Meta AI released Llama 3.1, powering Lenovo AI Now's on-device agent. Yi-Lightning model ranked #6 globally, surpassing GPT-4o. Zyphra AI released the large Zyda-2 dataset with 5 trillion tokens. François Chollet clarified transformer architecture as set-processing, not sequence-processing. Research suggests memorization aids LLM reasoning. Anthropic updated its Responsible Scaling Policy for AI safety. Tools like Perplexity Finance, Open Canvas by LangChain, and AlphaCodium code generation tool were highlighted. Approximately $500 million was raised for AI agent startups, with ongoing discussions on AI's job market impact. Combining prompt caching with the Batches API can yield a 95% discount on Claude 3.5 Sonnet tokens. Not much (in AI) happened this weekend
llama-3.1-8b llama-3.2 chatgpt movie-gen openai meta-ai-fair google-deepmind microsoft x-ai spacex harvard nvidia long-context feature-prediction-loss ai-agents privacy text-to-video text-to-image humanoid-robots gpu-deployment media-foundation-models ai-research-labs sam-altman yann-lecun rasbt bindureddy andrej-karpathy soumithchintala svpino adcock_brett rohanpaul_ai
OpenAI introduced an "edit this area" feature for image generation, praised by Sam Altman. Yann LeCun highlighted a NYU paper improving pixel generation with feature prediction loss using pre-trained visual encoders like DINOv2. Long-context LLMs such as llama-3.1-8b and llama-3.2 variants now support up to 131k tokens, offering alternatives to RAG systems. Bindu Reddy announced AI agents capable of building and deploying code from English instructions, signaling AI's replacement of SQL and potential impact on Python. SpaceX's successful Starship rocket catch was celebrated by Andrej Karpathy and others, with Soumith Chintala praising SpaceX's efficient, low-bureaucracy research approach. Privacy concerns arose from Harvard students' AI glasses, I-XRAY, which can reveal personal information. Meta AI FAIR's Movie Gen model advances media foundation models with high-quality text-to-image and video generation, including synced audio. Humanoid robots like Ameca and Azi now engage in expressive conversations using ChatGPT. xAI rapidly deployed 100K Nvidia H100 GPUs in 19 days, with CEO Jensen Huang commending Elon Musk. Leading AI research labs compared include Meta-FAIR, Google DeepMind, and Microsoft Research. Skepticism about LLM intelligence was voiced by Sam Pino, emphasizing limitations in novel problem-solving despite strong memorization.
not much happened today
aria o1-preview o1-mini gemini-1.5-pro gemini-1.5-flash gemini-1.5 claude-3.5-sonnet rhymes-ai openai anthropic google meta-ai-fair oxylabs multimodality mixture-of-experts long-context retrieval-augmented-generation benchmarking software-engineering llm-evaluation prompt-engineering web-scraping python production-applications mervenoyann osanseviero dbrxmosaicai ylecun ofirpress clefourrier omarsar0 rohanpaul_ai svpino finbarrtimbers _philschmid
Rhymes AI released Aria, a new 25.3B parameter multimodal MoE model supporting text, code, image, and video with a 64k token context window and Apache-2.0 license. OpenAI's o1-preview and o1-mini models show consistent improvement over Anthropic and Google Gemini 1.5 Pro/Flash on long context RAG benchmarks up to 128k tokens, while Google Gemini 1.5 models excel at extreme context lengths up to 2 million tokens. Meta AI expanded rollout to 21 countries with new language support but remains unavailable in the EU. The one-year anniversary of SWE-bench benchmark for software engineering tasks was celebrated, alongside the introduction of SWE-bench Multimodal. New AI tools include OxyCopilot by Oxylabs for web scraping, Taipy for Python-based production apps, and Latitude for prompt engineering. Industry insights highlight changing AI funding dynamics and OpenAI's strategic focus on consumer products like ChatGPT. "all recaps done by Claude 3.5 Sonnet, best of 4 runs."
State of AI 2024
llama-3-2 bitnet cerebras daily pipecat meta-ai-fair anthropic multimodality synthetic-data protein-structure-prediction neural-networks statistical-mechanics conversational-ai voice-ai hackathon ipo model-release geoffrey-hinton john-hopfield demis-hassabis john-jumper david-baker
Nathan Benaich's State of AI Report in its 7th year provides a comprehensive overview of AI research and industry trends, including highlights like BitNet and the synthetic data debate. Cerebras is preparing for an IPO, reflecting growth in AI compute. A hackathon hosted by Daily and the Pipecat community focuses on conversational voice AI and multimodal experiences with $20,000 in prizes. Nobel Prizes in Physics and Chemistry were awarded for AI research: Geoffrey Hinton and John Hopfield for neural networks and statistical mechanics, and Demis Hassabis, John Jumper, and David Baker for AlphaFold and protein structure prediction. Meta released Llama 3.2 with multimodal capabilities, accompanied by educational resources and performance updates. "This recognizes the impact of deep neural networks on society" and "tremendous impact of AlphaFold and ML-powered protein structure prediction" were noted by experts.
not much happened today
flux-schnell meta-ai-fair anthropic togethercompute hugging-face audio-generation quantization prompt-caching long-term-memory llm-serving-framework hallucination-detection ai-safety ai-governance geoffrey-hinton john-hopfield demis-hassabis rohanpaul_ai svpino hwchase17 shreyar philschmid mmitchell_ai bindureddy
Geoffrey Hinton and John Hopfield won the Nobel Prize in Physics for foundational work on neural networks linking AI and physics. Meta AI introduced a 13B parameter audio generation model as part of Meta Movie Gen for video-synced audio. Anthropic launched the Message Batches API enabling asynchronous processing of up to 10,000 queries at half the cost. Together Compute released Flux Schnell, a free model for 3 months. New techniques like PrefixQuant quantization and Prompt Caching for low-latency inference were highlighted by rohanpaul_ai. LangGraph added long-term memory support for persistent document storage. Hex-LLM framework was introduced for TPU-based low-cost, high-throughput LLM serving from Hugging Face models. Discussions on AI safety emphasized gender equality in science, and concerns about premature AI regulation by media and Hollywood were raised.
not much happened this weekend
o1-preview claude-3.5-sonnet 21b-flash-model openai meta-ai-fair reka langchainai entropix prompting-techniques finetuning entropy-based-sampling temporal-understanding native-audio tool-use instruction-chaining multimodality retrieval-augmented-generation synthetic-data-generation rnn parallel-training biologically-inspired-ai-safety text-to-video-generation video-editing lex-fridman imrat jjitsev giffmana _philschmid karpathy rasbt adcock_brett glennko rohanpaul_ai labenz
AI news from 10/4/2024 to 10/7/2024 highlights several developments: OpenAI's o1-preview shows strong performance on complex tasks but struggles with simpler ones, while Claude 3.5 Sonnet can match its reasoning through advanced prompting techniques. Meta introduced Movie Gen, a cutting-edge media foundation model for text-to-video generation and editing. Reka updated their 21B Flash Model with temporal video understanding, native audio, and tool use capabilities. Interest grows in "open o1" reproductions focusing on prompting and finetuning, with Entropix exploring entropy-based sampling. LangChainAI demonstrated a Retrieval Agent for complex Q&A, and synthetic data generation research surveyed 417 models. A resurgence in RNNs shows efficient parallel training making them competitive with Transformers. Biologically-inspired AI safety approaches were also noted. "A quiet weekend and air conditioning is all you need."
Contextual Document Embeddings: `cde-small-v1`
llama-3 cde-small-v1 gemini-1.5-flash-8b chatgpt meta-ai-fair openai google-deepmind weights-biases togethercompute contextual-embeddings contextual-batching video-generation synthetic-data model-efficiency training-techniques rag algorithmic-efficiency jack-morris sasha-rush tim-brooks demis-hassabis karina-nguyen
Meta announced a new text-to-video model, Movie Gen, claiming superior adaptation of Llama 3 to video generation compared to OpenAI's Sora Diffusion Transformers, though no release is available yet. Researchers Jack Morris and Sasha Rush introduced the cde-small-v1 model with a novel contextual batching training technique and contextual embeddings, achieving strong performance with only 143M parameters. OpenAI launched Canvas, a collaborative interface for ChatGPT with synthetic data training. Google DeepMind welcomed Tim Brooks to work on video generation and world simulators. Google released Gemini 1.5 Flash-8B, improving cost and rate limits with algorithmic efficiency.
Not much technical happened today
whisper-v3-turbo llama-3 llamaindex openai poolside liquidai perplexity-ai meta-ai-fair cohere fujitsu mixture-of-experts context-windows model-optimization fine-tuning quantization model-training alignment synthetic-data model-architecture agentic-ai nick-turley arav-srinivas francois-fleuret finbarr-timbers lewtun francois-chollet jerry-j-liu mmitchell-ai jxnlco
OpenAI announced raising $6.6B in new funding at a $157B valuation, with ChatGPT reaching 250M weekly active users. Poolside raised $500M to advance AGI development. LiquidAI introduced three new MoE models (1B, 3B, 40B) with a 32k context window and efficient token handling. OpenAI released Whisper V3 Turbo, an open-source multilingual model with significant speed improvements. Meta AI FAIR is hiring research interns focusing on LLM reasoning, alignment, synthetic data, and novel architectures. Cohere partnered with Fujitsu to launch Takane, a custom Japanese model. Technical discussions included challenges in LoRA fine-tuning, float8 quantization in Keras, and new tools like create-llama for agent templates. Industry commentary raised concerns about AI development priorities and highlighted freelancing opportunities in AI.
Liquid Foundation Models: A New Transformers alternative + AINews Pod 2
llama-3-2 gemini-1.5-pro-002 gemini-1.5-flash-002 liquid-ai meta-ai-fair google-deepmind openai reinforcement-learning multimodality model-efficiency foundation-models audio-processing model-deployment open-source ylecun svpino
Liquid.ai emerged from stealth with three subquadratic foundation models demonstrating superior efficiency compared to state space models and Apple’s on-device and server models, backed by a $37M seed round. Meta AI announced Llama 3.2 with multimodal vision-enabled models and lightweight text-only variants for mobile. Google DeepMind introduced production-ready Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 models with improved pricing and rate limits, alongside AlphaChip, an AI-driven chip design system using reinforcement learning for rapid superhuman layouts. OpenAI enhanced ChatGPT Plus and Teams with Advanced Voice Mode featuring Custom Instructions, Memory, and new nature-inspired voices. California Governor vetoed SB-1047 AI regulation bill, celebrated by AI community figures like ylecun and svpino as a win for open-source AI. Google upgraded NotebookLM with audio overviews supporting YouTube and audio files, turning documents into AI-generated podcasts. "Open source in AI is thriving," noted ylecun, highlighting 1 million models on Github and HuggingFace.
not much happened today
llama-3-2 llama-3 molmo meta-ai-fair google-deepmind hugging-face on-device-ai multimodality chip-design retrieval-augmented-generation rag benchmarking reliability ai-regulation free-speech pytorch-optimization demis-hassabis clementdelangue svpino awnihannun osanseviero omarsar0 sarahookr ylecun
Meta released Llama 3.2, including lightweight 1B and 3B models for on-device AI with capabilities like summarization and retrieval-augmented generation. Molmo, a new multimodal model, was introduced with a large dense captioning dataset. Google DeepMind announced AlphaChip, an AI-driven chip design method improving TPU and CPU designs. Hugging Face surpassed 1 million free public models, highlighting the value of smaller specialized models. Discussions covered challenges in scaling RAG applications, the future of on-device AI running ChatGPT-level models, reliability issues in larger LLMs, and new Elo benchmarking accepted at NeurIPS 2024. AI ethics and regulation topics included free speech responsibilities and California's SB-1047 bill potentially affecting open-source AI. "AlphaChip transformed computer chip design," and "ChatGPT-level AI on mobile devices predicted within a year."
not much happened today
llama-3-2 llama-3 gemma-2 phi-3-5-mini claude-3-haiku gpt-4o-mini molmo gemini-1.5 gemini meta-ai-fair openai allenai google-deepmind multimodality model-optimization benchmarks ai-safety model-distillation pruning adapter-layers open-source-models performance context-windows mira-murati demis-hassabis ylecun sama
Meta AI released Llama 3.2 models including 1B, 3B text-only and 11B, 90B vision variants with 128K token context length and adapter layers for image-text integration. These models outperform competitors like Gemma 2 and Phi 3.5-mini, and are supported on major platforms including AWS, Azure, and Google Cloud. OpenAI CTO Mira Murati announced her departure. Allen AI released Molmo, an open-source multimodal model family outperforming proprietary systems. Google improved Gemini 1.5 with Flash and Pro models. Meta showcased Project Orion AR glasses and hinted at a Quest 3S priced at $300. Discussions covered new benchmarks for multimodal models, model optimization, and AI safety and alignment.
Llama 3.2: On-device 1B/3B, and Multimodal 11B/90B (with AI2 Molmo kicker)
llama-3-2 llama-3-1 claude-3-haiku gpt-4o-mini molmo-72b molmo-7b gemma-2 phi-3-5 llama-3-2-vision llama-3-2-3b llama-3-2-20b meta-ai-fair ai2 qualcomm mediatek arm ollama together-ai fireworks-ai weights-biases cohere weaviate multimodality vision context-windows quantization model-release tokenization model-performance model-optimization rag model-training instruction-following mira-murati daniel-han
Meta released Llama 3.2 with new multimodal versions including 3B and 20B vision adapters on a frozen Llama 3.1, showing competitive performance against Claude Haiku and GPT-4o-mini. AI2 launched multimodal Molmo 72B and 7B models outperforming Llama 3.2 in vision tasks. Meta also introduced new 128k-context 1B and 3B models competing with Gemma 2 and Phi 3.5, with collaborations hinted with Qualcomm, Mediatek, and Arm for on-device AI. The release includes a 9 trillion token count for Llama 1B and 3B. Partner launches include Ollama, Together AI offering free 11B model access, and Fireworks AI. Additionally, a new RAG++ course from Weights & Biases, Cohere, and Weaviate offers systematic evaluation and deployment guidance for retrieval-augmented generation systems based on extensive production experience.
not much happened today
llama-3 o1 deepseek-2.5 gpt-4 claude-3.5-sonnet 3dtopia-xl cogvideox anthropic meta-ai-fair openai deepseek-ai llamaindex langchainai retrieval-augmented-generation prompt-caching multimodality multi-agent-systems reasoning diffusion-models image-to-video prompting enterprise-ai agentic-ai long-context model-evaluation caching model-cost-efficiency
Anthropic introduced a RAG technique called Contextual Retrieval that reduces retrieval failure rates by 67% using prompt caching. Meta is teasing multimodal Llama 3 ahead of Meta Connect. OpenAI is hiring for a multi-agent research team focusing on improved AI reasoning with their o1 models, which have sparked mixed reactions. DeepSeek 2.5 is noted as a cost-effective alternative to GPT-4 and Claude 3.5 sonnet. New models like 3DTopia-XL for 3D asset generation and CogVideoX for image-to-video conversion were highlighted. Techniques to boost reasoning by re-reading questions and combining retrieval with prompt caching were shared. Industry insights emphasize the necessity of AI adoption in enterprises and the disruption of traditional ML businesses. Tools like LangChainAI's LangGraph Templates and LlamaIndex's LlamaParse Premium enhance agentic applications and multimodal content extraction. Discussions on LLM evals and caching highlight production challenges and improvements. "Companies not allowing developers to use AI are unlikely to succeed" was a key sentiment.
not much happened today
o1-preview o1-mini qwen-2.5 gpt-4o deepseek-v2.5 gpt-4-turbo-2024-04-09 grin llama-3-1-405b veo kat openai qwen deepseek-ai microsoft kyutai-labs perplexity-ai together-ai meta-ai-fair google-deepmind hugging-face google anthropic benchmarking math coding instruction-following model-merging model-expressiveness moe voice voice-models generative-video competition open-source model-deployment ai-agents hyung-won-chung noam-brown bindureddy akhaliq karpathy aravsrinivas fchollet cwolferesearch philschmid labenz ylecun
OpenAI's o1-preview and o1-mini models lead benchmarks in Math, Hard Prompts, and Coding. Qwen 2.5 72B model shows strong performance close to GPT-4o. DeepSeek-V2.5 tops Chinese LLMs, rivaling GPT-4-Turbo-2024-04-09. Microsoft's GRIN MoE achieves good results with 6.6B active parameters. Moshi voice model from Kyutai Labs runs locally on Apple Silicon Macs. Perplexity app introduces voice mode with push-to-talk. LlamaCoder by Together.ai uses Llama 3.1 405B for app generation. Google DeepMind's Veo is a new generative video model for YouTube Shorts. The 2024 ARC-AGI competition increases prize money and plans a university tour. A survey on model merging covers 50+ papers for LLM alignment. The Kolmogorov–Arnold Transformer (KAT) paper proposes replacing MLP layers with KAN layers for better expressiveness. Hugging Face Hub integrates with Google Cloud Vertex AI Model Garden for easier open-source model deployment. Agent.ai is introduced as a professional network for AI agents. "Touching grass is all you need."
Pixtral 12B: Mistral beats Llama to Multimodality
pixtral-12b mistral-nemo-12b llama-3-1-70b llama-3-1-8b deeps-eek-v2-5 gpt-4-turbo llama-3-1 strawberry claude mistral-ai meta-ai-fair hugging-face arcee-ai deepseek-ai openai anthropic vision multimodality ocr benchmarking model-release model-architecture model-performance fine-tuning model-deployment reasoning code-generation api access-control reach_vb devendra_chapilot _philschmid rohanpaul_ai
Mistral AI released Pixtral 12B, an open-weights vision-language model with a Mistral Nemo 12B text backbone and a 400M vision adapter, featuring a large vocabulary of 131,072 tokens and support for 1024x1024 pixel images. This release notably beat Meta AI in launching an open multimodal model. At the Mistral AI Summit, architecture details and benchmark performances were shared, showing strong OCR and screen understanding capabilities. Additionally, Arcee AI announced SuperNova, a distilled Llama 3.1 70B & 8B model outperforming Meta's Llama 3.1 70B instruct on benchmarks. DeepSeek released DeepSeek-V2.5, scoring 89 on HumanEval, surpassing GPT-4-Turbo, Opus, and Llama 3.1 in coding tasks. OpenAI plans to release Strawberry as part of ChatGPT soon, though its capabilities are debated. Anthropic introduced Workspaces for managing multiple Claude deployments with enhanced access controls.
not much happened today
llama-3-1 claude-3-5-sonnet llama-3-1-405b ltm-2-mini qwen2-vl gpt-4o-mini meta-ai-fair hugging-face magic-ai-labs lmsys alibaba openai long-context style-control multimodality ai-safety model-evaluation web-crawling pdf-processing ai-hype-cycles call-center-automation sam-altman ajeya-cotra fchollet rohanpaul_ai philschmid
Meta announced significant adoption of LLaMA 3.1 with nearly 350 million downloads on Hugging Face. Magic AI Labs introduced LTM-2-Mini, a long context model with a 100 million token context window, and a new evaluation method called HashHop. LMSys added style control to their Chatbot Arena leaderboard, improving rankings for models like Claude 3.5 Sonnet and LLaMA 3.1 405B. Alibaba released Qwen2-VL, a multimodal LLM under Apache 2.0 license, competitive with GPT-4o mini. OpenAI CEO Sam Altman announced collaboration with the US AI Safety Institute for pre-release model testing. Discussions on AI safety and potential AI takeover risks were highlighted by Ajeya Cotra. Tools like firecrawl for web crawling and challenges in PDF processing were noted. AI hype cycles and market trends were discussed by François Chollet, and potential AI disruption in call centers was shared by Rohan Paul.
CogVideoX: Zhipu's Open Source Sora
cogvideox llama-3-1 llama-3-405b moondream phi-3.5 llama-rank zhipu-ai alibaba meta-ai-fair google hugging-face nvidia togethercompute salesforce video-generation serverless-computing vision document-vqa text-vqa mixture-of-experts retrieval-augmented-generation long-context model-routing webgpu background-removal long-form-generation superposition-prompting rohanpaul_ai philschmid vikhyatk algo_diver jayalammar davidsholz
Zhipu AI, Alibaba's AI arm and China's 3rd largest AI lab, released the open 5B video generation model CogVIdeoX, which can run without GPUs via their ChatGLM web and desktop apps. Meta AI announced trust & safety research and CyberSecEval 3 alongside the release of Llama 3.1, with Llama 3 405B now available serverless on Google Cloud Vertex AI and Hugging Face x NVIDIA NIM API. Updates include Moondream, an open vision-language model improving DocVQA and TextVQA tasks, and the lightweight MoE chat model Phi-3.5 with 16x3.8B parameters. Together Compute introduced the Rerank API featuring Salesforce's LlamaRank model for document and code ranking. Research highlights include superposition prompting for RAG without fine-tuning, the AgentWrite pipeline for long-form content generation over 20,000 words, and a comparison showing Long Context methods outperform RAG at higher costs. Tools include Not Diamond, an AI model router, AI command line interfaces, and an open-source WebGPU background removal tool. "You don't even need GPUs to run it," referring to CogVIdeoX.
not much happened this weekend
jamba-1.5 dream-machine-1.5 ideogram-v2 mistral-nemo-minitron-8b mistral-7b llama-3-8b nous-research cursor-ai gdm george-hotz agibot unitree eth-zurich disney uc-san-diego ai21-labs luma-labs ideogram nvidia mistral-ai meta-ai-fair distributed-ai optimizer inter-gpu-communication low-latency-training open-source humanoid-robots robotics physics-based-motion teleoperation multilingual-models long-context text-to-video text-to-image model-performance george-hotz adcock_brett aman
Nous Research announced DisTrO, a new optimizer that drastically reduces inter-GPU communication by 1000x to 10,000x enabling efficient training on slow networks, offering an alternative to GDM's DiLoCo. Cursor AI gained viral attention from an 8-year-old user and announced a new fundraise, with co-host Aman returning to their podcast. George Hotz launched tinybox for sale. In robotics, AGIBOT revealed 5 new humanoid robots with open-source plans, and Unitree showcased its G1 humanoid robot nearing mass production at $16,000. ETH Zurich and Disney developed an AI system for physics-based robot motion generation from text or images. UC San Diego released ACE, an open-source teleoperation system for controlling multiple robots. AI21 Labs unveiled Jamba 1.5, a multilingual model with 256k context length and permissive licensing. Luma Labs released Dream Machine 1.5 for improved text-to-video generation. Ideogram launched v2 of its text-to-image model with near-perfect text generation. Nvidia and Mistral released Mistral-NeMo-Minitron 8B, a small model outperforming Mistral-7B and llama-3-8b on the Open LLM leaderboard.
Nvidia Minitron: LLM Pruning and Distillation updated for Llama 3.1
llama-3-1-8b llama-3-1 jamba-1.5 claude-3 dracarys-70b dracarys-72b mistral-nemo-minitron-8b mistral-7b nvidia meta-ai-fair ai21-labs anthropic hugging-face pruning knowledge-distillation weight-pruning activation-based-pruning width-pruning kl-divergence teacher-correction prompt-optimization multilinguality long-context mixture-of-experts model-fine-tuning
Nvidia and Meta researchers updated their Llama 3 results with a paper demonstrating the effectiveness of combining weight pruning and knowledge distillation to reduce training costs by training only the largest model from scratch and deriving smaller models via pruning and distillation. The process involves teacher correction, activation-based pruning (favoring width pruning), and retraining with distillation using KL Divergence loss, resulting in better-performing models at comparable sizes. However, distillation incurs some accuracy tradeoffs. Additionally, AI21 Labs launched Jamba 1.5, a hybrid SSM-Transformer MoE model with large context windows and multilingual support. Anthropic updated Claude 3 with LaTeX rendering and prompt caching. An open-source coding-focused LLM, Dracarys, was released in 70B and 72B sizes, showing improved coding performance. The Mistral Nemo Minitron 8B model outperforms Llama 3.1 8B and Mistral 7B on the Hugging Face leaderboard, highlighting pruning and distillation benefits. Research on prompt optimization reveals the complexity of prompt search spaces and the surprising effectiveness of simple algorithms like AutoPrompt/GCG.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.
not much happened today
gpt-4o claude-3.5-sonnet phi-3.5-mini phi-3.5-moe phi-3.5-vision llama-3-1-405b qwen2-math-72b openai anthropic microsoft meta-ai-fair hugging-face langchain box fine-tuning benchmarking model-comparison model-performance diffusion-models reinforcement-learning zero-shot-learning math model-efficiency ai-regulation ai-safety ai-engineering prompt-engineering swyx ylecun
OpenAI launched GPT-4o finetuning with a case study on Cosine. Anthropic released Claude 3.5 Sonnet with 8k token output. Microsoft Phi team introduced Phi-3.5 in three variants: Mini (3.8B), MoE (16x3.8B), and Vision (4.2B), noted for sample efficiency. Meta released Llama 3.1 405B, deployable on Google Cloud Vertex AI, offering GPT-4 level capabilities. Qwen2-Math-72B achieved state-of-the-art math benchmark performance with a Gradio demo. Discussions included model comparisons like ViT vs CNN and Mamba architecture. Tools updates featured DSPy roadmap, Flux Schnell improving diffusion speed on M1 Max, and LangChain community events. Research highlights zero-shot DUP prompting for math reasoning and fine-tuning best practices. AI ethics covered California's AI Safety Bill SB 1047 and regulatory concerns from Yann LeCun. Commentary on AI engineer roles by Swyx. "Chat with PDF" feature now available for Box Enterprise Plus users.
not much happened today
grok-2 claude-3.5-sonnet claude-3.5 gpt-4 chatgpt-4o-latest anthropic x-ai google-deepmind openai mistral-ai meta-ai-fair salesforce box prompt-caching model-performance vision fine-tuning multilinguality ai-safety design-automation document-processing ai-agents ai-integration ai-job-market ai-acceleration humor demis-hassabis francois-chollet
Anthropic rolled out prompt caching in its API, reducing input costs by up to 90% and latency by 80%, enabling instant fine-tuning with longer prompts. xAI released Grok-2, a new model competing with frontier models from Google DeepMind, OpenAI, Anthropic, Mistral AI, and Meta AI Fair, supporting vision and text inputs and integrating external image generation models. Claude 3.5 Sonnet is reported to outperform GPT-4 in coding and reasoning, while ChatGPT-4o-latest shows reasoning improvements. François Chollet proposed a theory defining intelligence as the efficiency of operationalizing past information for future tasks. The Aya project involves 3000 collaborators building multilingual AI datasets. Demis Hassabis discussed AI hype and safe AI development in a podcast. Tools like Dora AI for Figma and Box's AI API enhance design automation and document processing. Salesforce released DEI, an open AI software engineering agents framework with a 55% resolve rate on SWE-Bench Lite. Industry trends highlight rapid AI integration, networking importance in the AI job market, and potential OpenAI GPT-4 expansion in response to competitors. Memes include humor about Apple Vision Pro.
not much happened today
gpt-4-0613 gpt-3.5-turbo-0613 gpt-4o-2024-08-06 mistral-large-2 gpt4-turbo claude-3-opus idefics3-llama bigllama-3.1-1t-instruct llama-3-120b-instruct openai mistral-ai meta-ai-fair structured-outputs function-calling json-schema benchmarking multimodality context-windows model-scaling ai-hardware vision speech-processing robotics ai-regulation sama rohanpaul_ai corbtt guillaumelample mervenoyann maximelabonne aidan_mclau adcock_brett ylecun
OpenAI introduced structured outputs in their API with a new "strict" mode and a "response_format" parameter, supporting models like gpt-4-0613, gpt-3.5-turbo-0613, and the new gpt-4o-2024-08-06. They also halved the price of gpt-4o to $2.50 per million tokens. Mistral Large 2 outperforms gpt4-turbo and claude-3-opus on hard benchmarks and coding tasks. Idefics3-Llama offers multimodal capabilities with a 10k token context window. BigLlama-3.1-1T-Instruct is an upscaled version of llama-3-120b-instruct. New benchmark "big_model_smell" measures creativity and reliability. Figure 02 robot features advanced AI hardware with onboard vision language model, enhanced battery, and speech-to-speech reasoning. Yann LeCun expressed concerns about California's SB1047 regulation.
GPT4o August + 100% Structured Outputs for All (GPT4o August edition)
gpt-4o-2024-08-06 llama-3-1-405b llama-3 claude-3.5-sonnet gemini-1.5-pro gpt-4o yi-large-turbo openai meta-ai-fair google-deepmind yi-large nvidia groq langchain jamai langsmith structured-output context-windows model-pricing benchmarking parameter-efficient-expert-retrieval retrieval-augmented-generation mixture-of-experts model-performance ai-hardware model-deployment filtering multi-lingual vision john-carmack jonathan-ross rohanpaul_ai
OpenAI released the new gpt-4o-2024-08-06 model with 16k context window and 33-50% lower pricing than the previous 4o-May version, featuring a new Structured Output API that improves output quality and reduces retry costs. Meta AI launched Llama 3.1, a 405-billion parameter model surpassing GPT-4 and Claude 3.5 Sonnet on benchmarks, alongside expanding the Llama Impact Grant program. Google DeepMind quietly released Gemini 1.5 Pro, outperforming GPT-4o, Claude-3.5, and Llama 3.1 on LMSYS benchmarks and leading the Vision Leaderboard. Yi-Large Turbo was introduced as a cost-effective upgrade priced at $0.19 per million tokens. In hardware, NVIDIA H100 GPUs were highlighted by John Carmack for their massive AI workload power, and Groq announced plans to deploy 108,000 LPUs by Q1 2025. New AI tools and techniques include RAG (Retrieval-Augmented Generation), the JamAI Base platform for Mixture of Agents systems, and LangSmith's enhanced filtering capabilities. Google DeepMind also introduced PEER (Parameter Efficient Expert Retrieval) architecture.
How Carlini Uses AI
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b gen-3-alpha segment-anything-model-2 stable-fast-3d groq intel deepmind box figure-ai openai google meta-ai-fair nvidia stability-ai runway benchmarking adversarial-attacks large-language-models text-generation multimodality robotics emotion-detection structured-data-extraction real-time-processing teleoperation 3d-generation text-to-video nicholas-carlini chris-dixon rasbt
Groq's shareholders' net worth rises while others fall, with Intel's CEO expressing concern. Nicholas Carlini of DeepMind gains recognition and criticism for his extensive AI writings, including an 80,000-word treatise on AI use and a benchmark for large language models. Chris Dixon comments on AI Winter skepticism, emphasizing long-term impact. Box introduces an AI API for extracting structured data from documents, highlighting potential and risks of LLM-driven solutions. Recent AI developments include Figure AI launching the advanced humanoid robot Figure 02, OpenAI rolling out Advanced Voice Mode for ChatGPT with emotion detection, Google open-sourcing Gemma 2 2B model matching GPT-3.5-Turbo-0613 performance, Meta AI Fair releasing Segment Anything Model 2 (SAM 2) for real-time object tracking, NVIDIA showcasing Project GR00T for humanoid teleoperation with Apple Vision Pro, Stability AI launching Stable Fast 3D for rapid 3D asset generation, and Runway unveiling Gen-3 Alpha for AI text-to-video generation.
Execuhires: Tempting The Wrath of Khan
gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai
Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.
Gemma 2 2B + Scope + Shield
gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering
Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
Apple Intelligence Beta + Segment Anything Model 2
llama-3-405b llama-3 segment-anything-model meta-ai-fair apple image-segmentation memory-attention video-processing pretraining cloud-tpus post-training synthetic-data instruction-following reasoning writing benchmarking bindureddy maximelabonne reach_vb
Meta advanced its open source AI with a sequel to the Segment Anything Model, enhancing image segmentation with memory attention for video applications using minimal data and compute. Apple Intelligence delayed its official release to iOS 18.1 in October but launched developer previews on MacOS Sequoia, iOS 18, and iPadOS 18, accompanied by a detailed 47-page paper revealing extensive pretraining on 6.3T tokens and use of Cloud TPUs rather than Apple Silicon. The paper highlights improvements in instruction following, reasoning, and writing through post-training and synthetic data. Benchmarks show Apple’s model scores lower than Llama 3, but with trusted human evaluations. Additionally, Meta released Llama 3.1 with a 405B parameter model, marking a significant open-source frontier model release.
AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold
gemini alphageometry-2 alphaproof llama-3-1-405b llama-3-70b llama-3-8b mistral-large-2 google-deepmind meta-ai-fair mistral-ai neurosymbolic-ai mathematical-reasoning synthetic-data knowledge-sharing model-fine-tuning alpha-zero multilinguality context-windows model-scaling benchmarking performance-comparison tim-gowers guillaume-lample osanseviero
Search+Verifier highlights advances in neurosymbolic AI during the 2024 Math Olympics. Google DeepMind's combination of AlphaProof and AlphaGeometry 2 solved four out of six IMO problems, with AlphaProof being a finetuned Gemini model using an AlphaZero approach, and AlphaGeometry 2 trained on significantly more synthetic data with a novel knowledge-sharing mechanism. Despite impressive results, human judges noted the AI required much longer time than human competitors. Meanwhile, Meta AI released Llama 3.1 with a 405B parameter model and smaller variants, and Mistral AI launched Mistral Large 2 with 123B parameters and 128k context windows, outperforming Llama 3.1 on coding tasks and multilingual benchmarks. This marks significant progress in AI mathematical reasoning, model scaling, and multilingual capabilities.
Mistral Large 2 + RIP Mistral 7B, 8x7B, 8x22B
mistral-large-2 mistral-nemo-12b llama-3.1-8b llama-3.1-70b llama-3.1 llama-3-405b yi-34b-200k gpt-4o mistral-ai meta-ai-fair groq togethercompute code-generation math function-calling reasoning context-windows model-deprecation pretraining posttraining benchmarking
Mistral Large 2 introduces 123B parameters with Open Weights under a Research License, focusing on code generation, math performance, and a massive 128k context window, improving over Mistral Large 1's 32k context. It claims better function calling capabilities than GPT-4o and enhanced reasoning. Meanwhile, Meta officially released Llama-3.1 models including Llama-3.1-70B and Llama-3.1-8B with detailed pre-training and post-training insights. The Llama-3.1 8B model's 128k context performance was found underwhelming compared to Mistral Nemo and Yi 34B 200K. Mistral is deprecating older Apache open-source models, focusing on Large 2 and Mistral Nemo 12B. The news also highlights community discussions and benchmarking comparisons.
Llama 3.1: The Synthetic Data Model
llama-3-405b llama-3-1 llama-3 meta-ai-fair groq fireworks synthetic-data fine-tuning reinforcement-learning multilinguality long-context tool-use code-generation math model-licensing inference-speed model-deployment bindureddy thomas
Meta AI has released Llama 3.1, including a 405B parameter model that triggers regulatory considerations like the EU AI Act and SB 1047. The model incorporates extensive synthetic data techniques for code, math, multilinguality, long context, and tool use fine-tuning, with RLHF using synthetic preference data from Llama 2. The launch was coordinated across major inference providers, with Groq demonstrating 750 tokens per second inference speed and Fireworks leading in pricing. The updated license explicitly allows synthetic data generation, marking a significant step in open frontier-class LLMs and cost-efficiency improvements since March.
Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
llama-3-1-405b llama-3-8b llama-3-70b llama-3-1-8b gpt-4o gpt-4o-mini claude-3-5 qwen-2 meta-ai-fair openai alibaba multilinguality code-generation context-windows model-training synthetic-data benchmarking reasoning fine-tuning model-performance dataset-release swyx philschmid jjitsev lewtun teknium1 adcock_brett
Llama 3.1 leaks reveal a 405B dense model with 128k context length, trained on 39.3M GPU hours using H100-80GB GPUs, and fine-tuned with over 25M synthetic examples. The model shows significant benchmark improvements, especially for the 8B and 70B variants, with some evals suggesting the 70B outperforms GPT-4o. GPT-4o Mini launched as a cost-efficient variant with strong performance but some reasoning weaknesses. Synthetic datasets like NuminaMath enable models such as Alibaba Qwen 2 to surpass GPT-4o and Claude 3.5 in math competitions. Discussions include reasoning task benchmarks and dataset building for improved reasoning.
Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o-mini version)
gpt-4o-mini deepseek-v2-0628 mistral-nemo llama-8b openai deepseek-ai mistral-ai nvidia meta-ai-fair hugging-face langchain keras cost-efficiency context-windows open-source benchmarking neural-networks model-optimization text-generation fine-tuning developer-tools gpu-support parallelization cuda-integration multilinguality long-context article-generation liang-wenfeng
OpenAI launched the GPT-4o Mini, a cost-efficient small model priced at $0.15 per million input tokens and $0.60 per million output tokens, aiming to replace GPT-3.5 Turbo with enhanced intelligence but some performance limitations. DeepSeek open-sourced DeepSeek-V2-0628, topping the LMSYS Chatbot Arena Leaderboard and emphasizing their commitment to contributing to the AI ecosystem. Mistral AI and NVIDIA released the Mistral NeMo, a 12B parameter multilingual model with a record 128k token context window under an Apache 2.0 license, sparking debates on benchmarking accuracy against models like Meta Llama 8B. Research breakthroughs include the TextGrad framework for optimizing compound AI systems via textual feedback differentiation and the STORM system improving article writing by 25% through simulating diverse perspectives and addressing source bias. Developer tooling trends highlight LangChain's evolving context-aware reasoning applications and the Modular ecosystem's new official GPU support, including discussions on Mojo and Keras 3.0 integration.
We Solved Hallucinations
gpt-2 flashattention-3 lynx meta-ai-fair nvidia princeton colfax patronus-ai databricks mosaic-ai openai compute-hardware gpu-optimization flashattention llm-evaluation hallucination-detection vision benchmarking synthetic-data model-training karpathy tri_dao giffmana vikhyatk dbrxmosaicai
Reddit's URL structure causes link errors in AI-generated summaries, especially with NSFW content affecting models like Claude and GPT-4. The team fixed this glitch while still leveraging LLMs for summarizing Reddit content. GPT-2 training costs have dramatically dropped to ~$672 using H100 GPUs and software improvements like CUDA and FlashAttention. FlashAttention-3 was released, achieving up to 740 TFLOPS on H100 GPUs, with FP8 nearing 1.2 PFLOPS, developed collaboratively by Meta, NVIDIA, Princeton, and Colfax. Hopper GPUs enable major speedups with new hardware features. Synthetic data may not improve vision tasks, as shown in recent research. The Avocado360 benchmark evaluates vision-language models' ability to detect avocados in images. Lynx, a hallucination detection model for LLMs, was introduced for real-world healthcare and fintech applications, trained by Patronus AI on Databricks Mosaic AI using Composer.
Nothing much happened today
chameleon-7b chameleon-30b xlam-1b gpt-3.5 phi-3-mini mistral-7b-v3 huggingface truth_terminal microsoft apple openai meta-ai-fair yi axolotl amd salesforce function-calling multimodality model-releases model-updates model-integration automaticity procedural-memory text-image-video-generation
HuggingFace released a browser-based timestamped Whisper using transformers.js. A Twitter bot by truth_terminal became the first "semiautonomous" bot to secure VC funding. Microsoft and Apple abruptly left the OpenAI board amid regulatory scrutiny. Meta is finalizing a major upgrade to Reddit comments addressing hallucination issues. The Yi model gained popularity on GitHub with 7.4K stars and 454 forks, with potential integration with Axolotl for pregeneration and preprocessing. AMD technologies enable household/small business AI appliances. Meta released Chameleon-7b and Chameleon-30b models on HuggingFace supporting unified text and image tokenization. Salesforce's xLAM-1b model outperforms GPT-3.5 in function calling despite its smaller size. Anole pioneered open-source multimodal text-image-video generation up to 720p 144fps. Phi-3 Mini expanded from 3.8B to 4.7B parameters with function calling, competing with Mistral-7b v3. "System 2 distillation" in humans relates to automaticity and procedural memory.
Test-Time Training, MobileLLM, Lilian Weng on Hallucination (Plus: Turbopuffer)
llama-2-7b codegeex4-all-9b mamba facebook-research meta-ai-fair tsinghua-university hallucination-detection anti-hallucination-methods on-device-ai model-architecture rnn long-context-modeling model-scaling expressive-hidden-states code-generation lilian-weng yann-lecun
Lilian Weng released a comprehensive literature review on hallucination detection and anti-hallucination methods including techniques like FactualityPrompt, SelfCheckGPT, and WebGPT. Facebook AI Research (FAIR) published MobileLLM, a sub-billion parameter on-device language model architecture achieving performance comparable to llama-2-7b with innovations like thin and deep models and shared weights. A new RNN-based LLM architecture with expressive hidden states was introduced, replacing attention mechanisms and scaling better than Mamba and Transformer models for long-context modeling. Additionally, Tsinghua University open sourced CodeGeeX4-ALL-9B, a multilingual code generation model excelling in code assistance.
Problems with MMLU-Pro
mmlu-pro llama-3-8b-q8 gpt4all-3.0 chatgpt claude llama gemini mobilellm runway-gen-3-alpha meta-3d-gen huggingface meta-ai-fair salesforce runway nomic-ai pineapple argil-ai benchmarking prompt-engineering model-evaluation model-performance multimodality automated-dataset-generation video-generation open-source-models ai-assistants text-to-3d deepfake transformers reasoning wenhu-chen danhendrycks clementine ylecun adcock_brett svpino rohanpaul_ai
MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips. Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.
That GPT-4o Demo
gpt-4o gemma-2 meta-code-llama openai google-deepmind meta-ai-fair voice-generation ocr screen-sharing vision code-understanding model-customization efficiency textual-intelligence multimodal-agents sft distillation rlhf model-merging model-optimization safety romain-huet fchollet
Romain Huet demonstrated an unreleased version of GPT-4o on ChatGPT Desktop showcasing capabilities like low latency voice generation, whisper tone moderation, camera mode streaming video to GPT-4o, rapid OCR, screen sharing with ChatGPT for programming help, clipboard reading, and vision-based code conversation. OpenAI's four investment areas highlighted include textual intelligence, efficiency/cost, model customization, and multimodal agents. Google DeepMind released Gemma 2 models in 9B and 27B sizes trained on 8T and 13T tokens respectively, using SFT, distillation, RLHF, and model merging, optimized for TPUv5e with strong performance and safety measures. Meta AI announced the Meta LLM Compiler built on Meta Code Llama with enhanced code optimization and compiler features.
Gemini launches context caching... or does it?
nemotron llama-3-70b chameleon-7b chameleon-34b gemini-1.5-pro deepseek-coder-v2 gpt-4-turbo claude-3-opus gemini-1.5-pro nvidia meta-ai-fair google deepseek hugging-face context-caching model-performance fine-tuning reinforcement-learning group-relative-policy-optimization large-context model-training coding model-release rohanpaul_ai _philschmid aman-sanger
Nvidia's Nemotron ranks #1 open model on LMsys and #11 overall, surpassing Llama-3-70b. Meta AI released Chameleon 7B/34B models after further post-training. Google's Gemini introduced context caching, offering a cost-efficient middle ground between RAG and finetuning, with a minimum input token count of 33k and no upper limit on cache duration. DeepSeek launched DeepSeek-Coder-V2, a 236B parameter model outperforming GPT-4 Turbo, Claude-3-Opus, and Gemini-1.5-Pro in coding tasks, supporting 338 programming languages and extending context length to 128K. It was trained on 6 trillion tokens using the Group Relative Policy Optimization (GRPO) algorithm and is available on Hugging Face with a commercial license. These developments highlight advances in model performance, context caching, and large-scale coding models.
Qwen 2 beats Llama 3 (and we don't know how)
qwen-2 llama-3 llama-3-70b gpt-4 nllb alibaba groq meta-ai-fair multilinguality benchmarking inference-speed sparse-autoencoders scaling-laws post-training instruction-following rejection-sampling execution-feedback model-release multilingual-models model-training philschmid huybery jonathanross321 awnihannun gdb nabla_theta ylecun
Alibaba released Qwen 2 models under Apache 2.0 license, claiming to outperform Llama 3 in open models with multilingual support in 29 languages and strong benchmark scores like MMLU 82.3 and HumanEval 86.0. Groq demonstrated ultra-fast inference speed on Llama-3 70B at 40,792 tokens/s and running 4 Wikipedia articles in 200ms. Research on sparse autoencoders (SAEs) for interpreting GPT-4 neural activity showed new training methods, metrics, and scaling laws. Meta AI announced the No Language Left Behind (NLLB) model capable of high-quality translations between 200 languages, including low-resource ones. "Our post-training phase is designed with the principle of scalable training with minimal human annotation," highlighting techniques like rejection sampling for math and execution feedback for coding.
Contextual Position Encoding (CoPE)
cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas
Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.
Somebody give Andrej some H100s already
gpt-2 openai fineweb meta-ai-fair nvidia tesla cuda fine-tuning training-time gpu-acceleration convolutional-neural-networks real-time-processing ai-safety ai-regulation andrej-karpathy yann-lecun elon-musk francois-chollet svpino mervenoyann
OpenAI's GPT-2 sparked controversy five years ago for being "too dangerous to release." Now, with FineWeb and llm.c, a tiny GPT-2 model can be trained in 90 minutes for $20 using 8xA100 GPUs, with the full 1.6B model estimated to take 1 week and $2.5k. The project is notable for its heavy use of CUDA (75.8%) aiming to simplify the training stack. Meanwhile, a Twitter debate between Yann LeCun and Elon Musk highlighted the importance of convolutional neural networks (CNNs) in real-time image processing for autonomous driving, with LeCun emphasizing scientific research's role in technological progress. LeCun also criticized AI doomsday scenarios, arguing for cautious optimism about AI safety and regulation.
Life after DPO (RewardBench)
gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan
xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.
Clémentine Fourrier on LLM evals
claude-3-opus huggingface meta-ai-fair llm-evaluation automated-benchmarking human-evaluation model-bias data-contamination elo-ranking systematic-annotations preference-learning evaluation-metrics prompt-sensitivity clem_fourrier
Clémentine Fourrier from Huggingface presented at ICLR about GAIA with Meta and shared insights on LLM evaluation methods. The blog outlines three main evaluation approaches: Automated Benchmarking using sample inputs/outputs and metrics, Human Judges involving grading and ranking with methods like Vibe-checks, Arena, and systematic annotations, and Models as Judges using generalist or specialist models with noted biases. Challenges include data contamination, subjectivity, and bias in scoring. These evaluations help prevent regressions, rank models, and track progress in the field.
Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model
chameleon gpt-4o gemini-1.5-flash claude-3 meta-ai-fair openai google-deepmind anthropic reddit multimodality early-fusion benchmarking model-training tokenization streaming tool-use vision coding hallucination-detection model-performance armen-aghajanyan sama alexandr-wang abacaj alexalbert__
Meta AI FAIR introduced Chameleon, a new multimodal model family with 7B and 34B parameter versions trained on 10T tokens of interleaved text and image data enabling "early fusion" multimodality that can natively output any modality. While reasoning benchmarks are modest, its "omnimodality" approach competes well with pre-GPT4o multimodal models. OpenAI launched GPT-4o, a model excelling in benchmarks like MMLU and coding tasks, with strong multimodal capabilities but some regression in ELO scores and hallucination issues. Google DeepMind announced Gemini 1.5 Flash, a small model with 1M context window and flash performance, highlighting convergence trends between OpenAI and Google models. Anthropic updated Claude 3 with streaming support, forced tool use, and vision tool integration for multimodal knowledge extraction. OpenAI also partnered with Reddit, raising industry attention.
Evals: The Next Generation
gpt-4 gpt-5 gpt-3.5 phi-3 mistral-7b llama-3 scale-ai mistral-ai reka-ai openai moderna sanctuary-ai microsoft mit meta-ai-fair benchmarking data-contamination multimodality fine-tuning ai-regulation ai-safety ai-weapons neural-networks model-architecture model-training model-performance robotics activation-functions long-context sam-altman jim-fan
Scale AI highlighted issues with data contamination in benchmarks like MMLU and GSM8K, proposing a new benchmark where Mistral overfits and Phi-3 performs well. Reka released the VibeEval benchmark for multimodal models addressing multiple choice benchmark limitations. Sam Altman of OpenAI discussed GPT-4 as "dumb" and hinted at GPT-5 with AI agents as a major breakthrough. Researchers jailbroke GPT-3.5 via fine-tuning. Global calls emerged to ban AI-powered weapons, with US officials urging human control over nuclear arms. Ukraine launched an AI consular avatar, while Moderna partnered with OpenAI for medical AI advancements. Sanctuary AI and Microsoft collaborate on AI for general-purpose robots. MIT introduced Kolmogorov-Arnold networks with improved neural network efficiency. Meta AI is training Llama 3 models with over 400 billion parameters, featuring multimodality and longer context.
Apple's OpenELM beats OLMo with 50% of its dataset, using DeLighT
openelm llama-3 llama-3-8b-instruct llama-3-70b apple meta-ai-fair google layer-wise-scaling context-length quantization ai-alignment open-source ai-regulation eric-schmidt sebastian-raschka
Apple advances its AI presence with the release of OpenELM, its first relatively open large language model available in sizes from 270M to 3B parameters, featuring a novel layer-wise scaling architecture inspired by the DeLight paper. Meanwhile, Meta's LLaMA 3 family pushes context length boundaries with models supporting over 160K tokens and an 8B-Instruct model with 262K context length released on Hugging Face, alongside performance improvements in quantized versions. A new paper on AI alignment highlights KTO as the best-performing method, with sensitivity to training data volume noted. In AI ethics and regulation, former Google CEO Eric Schmidt warns about the risks of open-source AI empowering bad actors and geopolitical rivals, while a U.S. proposal aims to enforce "Know Your Customer" rules to end anonymous cloud usage.
Perplexity, the newest AI unicorn
llama-3-8b llama-3-70b llama-3 llava-llama-3-8b-v1_1 phi-3 gpt-3.5 perplexity-ai meta-ai-fair hugging-face groq context-length fine-tuning quantization instruction-following model-comparison multimodality benchmarking memory-optimization model-performance daniel-gross aravind-srinivas
Perplexity doubles its valuation shortly after its Series B with a Series B-1 funding round. Significant developments around Llama 3 include context length extension to 16K tokens, new multimodal LLaVA models outperforming Llama 2, and fine-tuning improvements like QDoRA surpassing QLoRA. The Llama-3-70B model is praised for instruction following and performance across quantization formats. Phi-3 models by Meta AI released in multiple sizes show competitive benchmark results, with the 14B model achieving 78% on MMLU and the 3.8B model nearing GPT-3.5 performance.
FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security
2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.
Llama-3-70b is GPT-4-level Open Model
llama-3-70b llama-3-8b llama-3 llama-2-70b mistral-7b grok-3 stable-diffusion-3 vasa-1 meta-ai-fair groq nvidia amazon microsoft benchmarking model-performance fine-tuning function-calling arithmetic image-generation video-generation energy-usage gpu-demand political-bias ai-safety scaling context-windows tokenization elon-musk
Meta has released Llama 3, their most capable open large language model with 8B and 70B parameter versions supporting 8K context length and outperforming previous models including Llama 2 and Mistral 7B. Groq serves the Llama 3 70B model at 500-800 tokens/second, making it the fastest GPT-4-level token source. Discussions highlight AI scaling challenges with Elon Musk stating that training Grok 3 will require 100,000 Nvidia H100 GPUs, and AWS planning to acquire 20,000 B200 GPUs for a 27 trillion parameter model. Microsoft unveiled VASA-1 for lifelike talking face generation, while Stable Diffusion 3 and its extensions received mixed impressions. Concerns about AI energy usage and political bias in AI were also discussed.
Meta Llama 3 (8B, 70B)
llama-3-8b llama-3-70b llama-3-400b stable-diffusion-3 mixtral-8x22b-instruct-v0.1 vasa-1 meta-ai-fair stability-ai boston-dynamics microsoft mistral-ai hugging-face transformer tokenization model-training benchmarking robotics natural-language-processing real-time-processing synthetic-data dataset-cleaning behavior-trees ai-safety model-accuracy api model-release humor helen-toner
Meta partially released Llama 3 models including 8B and 70B variants, with a 400B variant still in training, touted as the first GPT-4 level open-source model. Stability AI launched Stable Diffusion 3 API with model weights coming soon, showing competitive realism against Midjourney V6. Boston Dynamics unveiled an electric humanoid robot Atlas, and Microsoft introduced the VASA-1 model generating lifelike talking faces at 40fps on RTX 4090. Mistral AI, a European OpenAI rival, is seeking $5B funding with its Mixtral-8x22B-Instruct-v0.1 model achieving 100% accuracy on 64K context benchmarks. AI safety discussions include calls from former OpenAI board member Helen Toner for audits of top AI companies, and the Mormon Church released AI usage principles. New AI development tools include Ctrl-Adapter for diffusion models, Distilabel 1.0.0 for synthetic dataset pipelines, Data Bonsai for data cleaning with LLMs, and Dendron for building LLM agents with behavior trees. Memes highlight AI development humor and cultural references. The release of Llama 3 models features improved reasoning, a 128K token vocabulary, 8K token sequences, and grouped query attention.
Mergestral, Meta MTIAv2, Cohere Rerank 3, Google Infini-Attention
mistral-8x22b command-r-plus rerank-3 infini-attention llama-3 sd-1.5 cosxl meta-ai-fair mistral-ai cohere google stability-ai hugging-face ollama model-merging training-accelerators retrieval-augmented-generation linear-attention long-context foundation-models image-generation rag-pipelines model-benchmarking context-length model-performance aidan_gomez ylecun swyx
Meta announced their new MTIAv2 chips designed for training and inference acceleration with improved architecture and integration with PyTorch 2.0. Mistral released the 8x22B Mixtral model, which was merged back into a dense model to effectively create a 22B Mistral model. Cohere launched Rerank 3, a foundation model enhancing enterprise search and retrieval-augmented generation (RAG) systems supporting 100+ languages. Google published a paper on Infini-attention, an ultra-scalable linear attention mechanism demonstrated on 1B and 8B models with 1 million sequence length. Additionally, Meta's Llama 3 is expected to start rolling out soon. Other notable updates include Command R+, an open model surpassing GPT-4 in chatbot performance with 128k context length, and advancements in Stable Diffusion models and RAG pipelines.
Gemini Pro and GPT4T Vision go GA on the same day by complete coincidence
gemini-1.5-pro gpt-4-turbo llama-3 orca-2.5-7b functionary-v2.4 cosxl google openai meta-ai-fair hugging-face cohere million-token-context-window audio-processing file-api text-embedding function-calling reasoning direct-nash-optimization contrastive-learning code-interpreter diffusion-models neural-odes inference-speed multilingual-dataset image-editing no-code-development
At Google Cloud Next, Gemini 1.5 Pro was released with a million-token context window, available in 180+ countries, featuring 9.5 hours of audio understanding, a new File API for nearly unlimited free uploads, and the Gecko-1b-256/768 embedding model. GPT-4 Turbo with Vision became generally available in the API with a major update improving reasoning capabilities. Meta Platforms plans to launch smaller versions of Llama 3 next week. The Orca 2.5 7B model using Direct Nash Optimization outperforms older GPT-4 versions in AlpacaEval. New releases include Functionary-V2.4 with enhanced function calling and code interpretation, and CosXL models for image editing. Research highlights include continuous U-Nets for diffusion models achieving up to 80% faster inference and a massive multilingual dataset with ~5.6 trillion word tokens. Creative applications include a no-code touch screen game made with Gemini 1.5 and AI-generated novel trailers.
Cohere Command R+, Anthropic Claude Tool Use, OpenAI Finetuning
c4ai-command-r-plus claude-3 gpt-3.5-turbo gemini mistral-7b gemma-2 claude-3-5 llama-3 vicuna cohere anthropic openai microsoft stability-ai opera-software meta-ai-fair google-deepmind mistral-ai tool-use multilingual-models rag fine-tuning quantum-computing audio-generation local-inference context-windows model-size-analysis model-comparison
Cohere launched Command R+, a 104B dense model with 128k context length focusing on RAG, tool-use, and multilingual capabilities across 10 key languages. It supports Multi-Step Tool use and offers open weights for research. Anthropic introduced tool use in beta for Claude, supporting over 250 tools with new cookbooks for practical applications. OpenAI enhanced its fine-tuning API with new upgrades and case studies from Indeed, SK Telecom, and Harvey, promoting DIY fine-tuning and custom model training. Microsoft achieved a quantum computing breakthrough with an 800x error rate improvement and the most usable qubits to date. Stability AI released Stable Audio 2.0, improving audio generation quality and control. The Opera browser added local inference support for large language models like Meta's Llama, Google's Gemma, and Vicuna. Discussions on Reddit highlighted Gemini's large context window, analysis of GPT-3.5-Turbo model size, and a battle simulation between Claude 3 and ChatGPT using local 7B models like Mistral and Gemma.
DeepMind SIMA: one AI, 9 games, 600 tasks, vision+language ONLY
llama-3 claude-3-opus claude-3 gpt-3.5-turbo deepmind cognition-labs deepgram modal-labs meta-ai-fair anthropic multimodality transformer software-engineering ai-agents ai-infrastructure training text-to-speech speech-to-text real-time-processing model-architecture benchmarking andrej-karpathy arav-srinivas francois-chollet yann-lecun soumith-chintala john-carmack
DeepMind SIMA is a generalist AI agent for 3D virtual environments evaluated on 600 tasks across 9 games using only screengrabs and natural language instructions, achieving 34% success compared to humans' 60%. The model uses a multimodal Transformer architecture. Andrej Karpathy outlines AI autonomy progression in software engineering, while Arav Srinivas praises Cognition Labs' AI agent demo. François Chollet expresses skepticism about automating software engineering fully. Yann LeCun suggests moving away from generative models and reinforcement learning towards human-level AI. Meta's Llama-3 training infrastructure with 24k H100 Cluster Pods is shared by Soumith Chintala and Yann LeCun. Deepgram's Aura offers low-latency speech APIs, and Modal Labs' Devin AI demonstrates document navigation and interaction with ComfyUI. Memes and humor circulate in the AI community.
FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs
qlora fsdp inflection-2.5 gpt-4 answer.ai hugging-face meta-ai-fair nvidia inflectionai model-training quantization memory-optimization gradient-checkpointing cpu-offloading fine-tuning model-sharding reinforcement-learning chain-of-thought benchmarking jeremy_howard tim_dettmers yann_lecun
Jeremy Howard and collaborators released a new tool combining FSDP, QLoRA, and HQQ to enable training 70b-parameter models on affordable consumer GPUs like RTX 4090s with only 24GB RAM, overcoming traditional memory constraints that required expensive data center GPUs costing over $150k. The approach shards quantized models across multiple GPUs and uses techniques like gradient checkpointing and CPU offloading to achieve efficient training on desktop-class hardware. The blogpost details challenges and solutions integrating these methods, highlighting a significant cost reduction from $150k to under $2.5k for training large language models. Additionally, Twitter recaps mention Inflection AI's Inflection-2.5 model rivaling GPT-4 in benchmarks with less compute, and Grok improving speed by 3x. Yann LeCun discusses multi-step reasoning training for LLMs.
Qwen 1.5 Released
qwen-1.5 mistral-7b sparsetral-16x7b-v2 bagel-7b-v0.4 deepseek-math-7b-instruct deepseek qwen mistral-ai hugging-face meta-ai-fair quantization token-context multilinguality retrieval-augmented-generation agent-planning code-generation sparse-moe model-merging fine-tuning direct-preference-optimization character-generation ascii-art kanji-generation vr retinal-resolution light-field-passthrough frozen-networks normalization-layers
Chinese AI models Yi, Deepseek, and Qwen are gaining attention for strong performance, with Qwen 1.5 offering up to 32k token context and compatibility with Hugging Face transformers and quantized models. The TheBloke Discord discussed topics like quantization of a 70B LLM, the introduction of the Sparse MoE model Sparsetral based on Mistral, debates on merging vs fine-tuning, and Direct Preference Optimization (DPO) for character generation. The Nous Research AI Discord covered challenges in Japanese Kanji generation, AI scams on social media, and Meta's VR headset prototypes showcased at SIGGRAPH 2023. Discussions also included fine-tuning frozen networks and new models like bagel-7b-v0.4, DeepSeek-Math-7b-instruct, and Sparsetral-16x7B-v2.
CodeLLama 70B beats GPT4 on HumanEval
codellama miqu mistral-medium llama-2-70b aphrodite-engine mixtral flatdolphinmaid noromaid rpcal chatml mistral-7b activation-beacon eagle-7b rwkv-v5 openhermes2.5 nous-hermes-2-mixtral-8x7b-dpo imp-v1-3b bakllava moondream qwen-vl meta-ai-fair ollama nous-research mistral-ai hugging-face ai-ethics alignment gpu-optimization direct-prompt-optimization fine-tuning cuda-programming optimizer-technology quantization multimodality context-length dense-retrieval retrieval-augmented-generation multilinguality model-performance open-source code-generation classification vision
Meta AI surprised the community with the release of CodeLlama, an open-source model now available on platforms like Ollama and MLX for local use. The Miqu model sparked debate over its origins, possibly linked to Mistral Medium or a fine-tuned Llama-2-70b, alongside discussions on AI ethics and alignment risks. The Aphrodite engine showed strong performance on A6000 GPUs with specific configurations. Role-playing AI models such as Mixtral and Flatdolphinmaid faced challenges with repetitiveness, while Noromaid and Rpcal performed better, with ChatML and DPO recommended for improved responses. Learning resources like fast.ai's course were highlighted for ML/DL beginners, and fine-tuning techniques with optimizers like Paged 8bit lion and adafactor were discussed.
At Nous Research AI, the Activation Beacon project introduced a method for unlimited context length in LLMs using "global state" tokens, potentially transforming retrieval-augmented models. The Eagle-7B model, based on RWKV-v5, outperformed Mistral in benchmarks with efficiency and multilingual capabilities. OpenHermes2.5 was recommended for consumer hardware due to its quantization methods. Multimodal and domain-specific models like IMP v1-3b, Bakllava, Moondream, and Qwen-vl were explored for classification and vision-language tasks. The community emphasized centralizing AI resources for collaborative research.
RIP Latent Diffusion, Hello Hourglass Diffusion
gpt-4 latent-diffusion stable-diffusion meta-ai-fair openai hugging-face diffusion-models transformers image-generation model-efficiency fine-tuning quantization prompt-engineering roleplay training-optimization katherine-crowson lucidrains
Katherine Crowson from Stable Diffusion introduces a hierarchical pure transformer backbone for diffusion-based image generation that efficiently scales to megapixel resolutions with under 600 million parameters, improving upon the original ~900M parameter model. This architecture processes local and global image phenomena separately, enhancing efficiency and resolution without latent steps. Additionally, Meta's Self Rewarding LM paper has inspired lucidrains to begin an implementation. Discord summaries highlight GPT-4's robustness against quantification tricks, discussions on open-source GPT-0 alternatives, challenges in DPO training on limited VRAM with suggestions like QLoRA and rmsprop, and efforts to improve roleplay model consistency through fine-tuning and merging. Philosophical debates on AI sentience and GPT-4 customization for markdown and translation tasks were also noted.
1/2/2024: Smol tweaks to Smol Talk
claude-2 bard copilot meta-ai gemini-ultra chatgpt openai meta-ai-fair perplexity-ai prompt-engineering api json yaml markdown chatbot image-generation vpn browser-compatibility personality-tuning plugin-issues
OpenAI Discord discussions highlight a detailed comparison of AI search engines including Perplexity, Copilot, Bard, and Claude 2, with Bard and Claude 2 trailing behind. Meta AI chatbot by Meta is introduced, available on Instagram and Whatsapp, featuring image generation likened to a free GPT version. Users report multiple browser issues with ChatGPT, including persistent captchas when using VPNs and plugin malfunctions. Debates cover prompt engineering, API usage, and data formats like JSON, YAML, and Markdown. Discussions also touch on ChatGPT's personality tuning and model capability variations. "Meta AI includes an image generation feature, which he likened to a free version of GPT."
12/26/2023: not much happened today
llava exllama2 meta-ai-fair google-deepmind gpu-offloading vram-utilization model-conversion moe-models multimodality model-performance hardware-configuration model-saving chatml installation-issues music-generation
LM Studio users extensively discussed its performance, installation issues on macOS, and upcoming features like Exllama2 support and multimodality with the Llava model. Conversations covered GPU offloading, vRAM utilization, MoE model expert selection, and model conversion compatibility. The community also addressed inefficient help requests referencing the blog 'Don't Ask to Ask, Just Ask'. Technical challenges with ChromaDB Plugin, server vs desktop hardware performance, and saving model states with Autogen were highlighted. Discussions included comparisons with other chatbots and mentions of AudioCraft from meta-ai-fair and MusicLM from google-deepmind for music generation.