All tags
Model: "llama-3-70b"
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
not much happened today
zonos-v0.1 audiobox-aesthetics moshi sonar llama-3-70b gpt-4o-mini claude-3.5-haiku gpt-4o claude-3.5-sonnet deepseek-r1-distilled-qwen-1.5b reasonflux-32b o1-preview zyphra-ai meta-ai-fair kyutai-labs perplexity-ai cerebras uc-berkeley brilliant-labs google-deepmind text-to-speech speech-to-speech benchmarking model-performance reinforcement-learning math real-time-processing open-source cross-platform-integration multilinguality zero-shot-learning danhendrycks
Zyphra AI launched Zonos-v0.1, a leading open-weight text-to-speech model supporting multiple languages and zero-shot voice cloning. Meta FAIR released the open-source Audiobox Aesthetics model trained on 562 hours of audio data. Kyutai Labs introduced Moshi, a real-time speech-to-speech system with low latency. Perplexity AI announced the Sonar model based on Llama 3.3 70b, outperforming top models like GPT-4o and Claude 3.5 Sonnet with 1200 tokens/second speed, powered by Cerebras infrastructure. UC Berkeley open-sourced a 1.5B model trained with reinforcement learning that beats o1-preview on math tasks. ReasonFlux-32B achieved 91.2% on the MATH benchmark, outperforming OpenAI o1-preview. CrossPoster, an AI agent for cross-platform posting, was released using LlamaIndex workflows. Brilliant Labs integrated the Google DeepMind Gemini Live API into smart glasses for real-time translation and object identification.
small little news items
r7b llama-3-70b minicpm-o-2.6 gpt-4v qwen2.5-math-prm ollama cohere togethercompute openbmb qwen langchain openai rag tool-use-tasks quality-of-life new-engine multimodality improved-reasoning math-capabilities process-reward-models llm-reasoning mathematical-reasoning beta-release task-scheduling ambient-agents email-assistants ai-software-engineering codebase-analysis test-case-generation security-infrastructure llm-scaling-laws power-law plateauing-improvements gans-revival
Ollama enhanced its models by integrating Cohere's R7B, optimized for RAG and tool use tasks, and released Ollama v0.5.5 with quality updates and a new engine. Together AI launched the Llama 3.3 70B multimodal model with improved reasoning and math capabilities, while OpenBMB introduced the MiniCPM-o 2.6, outperforming GPT-4V on visual tasks. Insights into Process Reward Models (PRM) were shared to boost LLM reasoning, alongside Qwen2.5-Math-PRM models excelling in mathematical reasoning. LangChain released a beta for ChatGPT Tasks enabling scheduling of reminders and summaries, and introduced open-source ambient agents for email assistance. OpenAI rolled out Tasks for scheduling actions in ChatGPT for Plus, Pro, and Teams users. AI software engineering is rapidly advancing, predicted to match human capabilities within 18 months. Research on LLM scaling laws highlights power law relationships and plateauing improvements, while GANs are experiencing a revival.
Genesis: Generative Physics Engine for Robotics (o1-mini version)
o1 o1-preview gpt-4o claude-3.5-sonnet gemini-2.0-pro llama-3-3b llama-3-70b openai google-deepmind meta-ai-fair hugging-face function-calling structured-outputs vision performance-benchmarks sdk webrtc reasoning math code-generation transformer-architecture model-training humanoid-robots search model-efficiency dataset-sharing aidan_mclau sundarpichai adcock_brett
OpenAI launched the o1 model API featuring function calling, structured outputs, vision support, and developer messages, achieving 60% fewer reasoning tokens than its preview. The model excels in math and code with a 0.76 LiveBench Coding score, outperforming Sonnet 3.5. Beta SDKs for Go and Java and WebRTC support with 60% lower prices were also released. Google Gemini 2.0 Pro (Gemini Exp 1206) deployment accelerated, showing improved coding, math, and reasoning performance. Meta AI FAIR introduced research on training transformers directly on raw bytes using dynamic entropy-based patching. Commercial humanoid robots were successfully deployed by an industry player. Hugging Face researchers demonstrated that their 3B Llama model can outperform the 70B Llama model on MATH-500 accuracy using search techniques, highlighting efficiency gains with smaller models. Concerns about reproducibility and domain-specific limitations were noted.
OpenAI Voice Mode Can See Now - After Gemini Does
gemini-2.0-flash claude claude-3.5-sonnet llama-3-70b llama-3 mistral-large gpt-4o openai google-deepmind anthropic togethercompute scale-ai meta-ai-fair mistral-ai multimodality real-time-streaming roleplay prompt-handling model-comparison model-training creative-writing model-censorship code-execution developer-ecosystem ai-humor bindureddy
OpenAI launched Realtime Video shortly after Gemini, which led to less impact due to Gemini's earlier arrival with lower cost and fewer rate limits. Google DeepMind released Gemini 2.0 Flash featuring enhanced multimodal capabilities and real-time streaming. Anthropic introduced Clio, a system analyzing real-world usage of Claude models. Together Computing acquired CodeSandbox to launch a code interpreter tool. Discussions highlighted Meta's Llama 3.3-70B for its advanced roleplay and prompt handling abilities, outperforming models like Mistral Large and GPT-4o in expressiveness and censorship. The AI community also engaged in humorous takes on AI outages and model competition, with ChatGPT adding a Santa mode for holiday interactions. "Anthropic is capturing the developer ecosystem, Gemini has AI enthusiast mindshare, ChatGPT reigns over AI dabblers" was a noted observation from the community.
Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights
apollo-1b apollo-3b apollo-7b veo-2 imagen-3 llama-3-70b llama-3b command-r7b llama-1b llama-8b chatgpt meta-ai-fair hugging-face google-deepmind openai figure-ai klarna cohere notion video-understanding scaling-consistency benchmarking temporal-ocr egocentric-perception spatial-perception reasoning video-generation physics-simulation voice-features map-integration language-expansion test-time-compute-scaling humanoid-robots ai-integration search-optimization self-recognition self-preference-bias akhaliq _lewtun clementdelangue adcock_brett rohanpaul_ai swyx shaneguML
Meta released Apollo, a new family of state-of-the-art video-language models available in 1B, 3B, and 7B sizes, featuring "Scaling Consistency" for efficient scaling and introducing ApolloBench, which speeds up video understanding evaluation by 41× across five temporal perception categories. Google Deepmind launched Veo 2, a 4K video generation model with improved physics and camera control, alongside an enhanced Imagen 3 image model. OpenAI globally rolled out ChatGPT search with advanced voice and map features and discussed a potential $2,000/month "ChatGPT Max" tier. Research highlights include achieving Llama 70B performance using Llama 3B via test-time compute scaling and expanding Command R7B language support from 10 to 23 languages. Industry updates feature Figure AI delivering humanoid robots commercially and Klarna reducing workforce through AI. Notion integrated Cohere Rerank for better search. Studies reveal LLMs can recognize their own writing style and show self-preference bias. Discussions note video processing progress outpacing text due to better signal-per-compute and data evaluation.
ChatGPT Canvas GA
llama-3-70b llama-3-1-8b tgi-v3 deepseek-v2.5-1210 coconut openai deepseek-ai meta-ai-fair huggingface cognition-labs hyperbolic google-deepmind code-execution gpt-integration model-finetuning gradient-checkpointing context-length latent-space-reasoning performance-optimization gpu-memory-optimization kubernetes gpu-marketplace ai-capabilities employment-impact neurips-2024 ai-scaling humor arav_srinivas sama jonathan-frankle dylan
OpenAI launched ChatGPT Canvas to all users, featuring code execution and GPT integration, effectively replacing Code Interpreter with a Google Docs-like interface. Deepseek AI announced their V2.5-1210 update improving performance on MATH-500 (82.8%) and LiveCodebench. Meta AI Fair introduced COCONUT, a new continuous latent space reasoning paradigm. Huggingface released TGI v3, processing 3x more tokens and running 13x faster than vLLM on long prompts. Cognition Labs released Devin, an AI developer building Kubernetes operators. Hyperbolic raised $12M Series A to build an open AI platform with an H100 GPU marketplace. Discussions included AI capabilities and employment impact, and NeurIPS 2024 announcements with Google DeepMind demos and a debate on AI scaling. On Reddit, Llama 3.3-70B supports 90K context length finetuning using Unsloth with gradient checkpointing and Apple's Cut Cross Entropy (CCE) algorithm, fitting on 41GB VRAM. Llama 3.1-8B reaches 342K context lengths with Unsloth, surpassing native limits.
Meta Llama 3.3: 405B/Nova Pro performance at 70B price
llama-3-70b llama-3.3-70b gpt-4o gemini-exp-1206 meta-ai-fair openai google-deepmind hugging-face llamacloud reinforcement-learning fine-tuning model-performance document-processing pricing-models alignment online-rl sama steven-heidel aidan_mclau lmarena_ai oriolvinyalsml jerryjliu0
Meta AI released Llama 3.3 70B, matching the performance of the 405B model with improved efficiency using "a new alignment process and progress in online RL techniques". OpenAI announced Reinforcement Fine-Tuning (RFT) for building expert models with limited data, offering alpha access to researchers and enterprises. Google DeepMind's Gemini-Exp-1206 leads benchmarks, tying with GPT-4o in coding performance. LlamaCloud enhanced document processing with table extraction and analytics. Discussions on OpenAI's pricing plans continue in the community.
Olympus has dropped (aka, Amazon Nova Micro|Lite|Pro|Premier|Canvas|Reel)
amazon-nova claude-3 llama-3-70b gemini-1.5-flash gpt-4o amazon anthropic google-deepmind sakana-ai-labs multimodality benchmarking model-merging model-performance model-architecture model-optimization population-based-learning philschmid bindureddy
Amazon announced the Amazon Nova family of multimodal foundation models at AWS Re:Invent, available immediately with no waitlist in configurations like Micro, Lite, Pro, Canvas, and Reel, with Premier and speech-to-speech coming next year. These models offer 2-4x faster token speeds and are 25%-400% cheaper than competitors like Anthropic Claude models, positioning Nova as a serious contender in AI engineering. Pricing undercuts models such as Google DeepMind Gemini Flash 8B, and some Nova models extend context length up to 300k tokens. However, benchmarking controversy exists as some evaluations show Nova scoring below Llama-3 70B in LiveBench AI metrics. Separately, CycleQD was introduced by Sakana AI Labs, using evolutionary computation for population-based model merging to develop niche LLM agents.
s{imple|table|calable} Consistency Models
llama-3-70b llama-3-405b llama-3-1 stable-diffusion-3.5 gpt-4 stability-ai tesla cerebras cohere langchain model-distillation diffusion-models continuous-time-consistency-models image-generation ai-hardware inference-speed multilingual-models yang-song
Model distillation significantly accelerates diffusion models, enabling near real-time image generation with only 1-4 sampling steps, as seen in BlinkShot and Flux Schnell. Research led by Yang Song introduced simplified continuous-time consistency models (sCMs), achieving under 10% FID difference in just 2 steps and scaling up to 1.5B parameters for higher quality. On AI hardware, Tesla is deploying a 50k H100 cluster potentially capable of completing GPT-4 training in under three weeks, while Cerebras Systems set a new inference speed record on Llama 3.1 70B with their wafer-scale AI chips. Stability AI released Stable Diffusion 3.5 and its Turbo variant, and Cohere launched new multilingual models supporting 23 languages with state-of-the-art performance. LangChain also announced ecosystem updates.
AIPhone 16: the Visual Intelligence Phone
reflection-70b llama-3-70b qwen-2-72b llama-3-1-405b claude gpt-4 gemini apple openai weights-biases vision video-understanding benchmarking planning model-evaluation privacy ai-integration instruction-following yann-lecun
Apple announced the new iPhone 16 lineup featuring Visual Intelligence, a new AI capability integrated with Camera Control, Apple Maps, and Siri, emphasizing privacy and default service use over third-party AI like OpenAI. Apple Photos now includes advanced video understanding with timestamp recognition. Meanwhile, Reflection-70B claims to be a top open-source model but benchmarks show it performs close to Llama 3 70B and slightly worse than Qwen 2 72B. Yann LeCun highlighted ongoing challenges with LLM planning abilities, noting models like Llama-3.1-405b and Claude show some skill, while GPT-4 and Gemini lag behind. Weights & Biases is sponsoring an event to advance LLM evaluation techniques with prizes and API access.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.
AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold
gemini alphageometry-2 alphaproof llama-3-1-405b llama-3-70b llama-3-8b mistral-large-2 google-deepmind meta-ai-fair mistral-ai neurosymbolic-ai mathematical-reasoning synthetic-data knowledge-sharing model-fine-tuning alpha-zero multilinguality context-windows model-scaling benchmarking performance-comparison tim-gowers guillaume-lample osanseviero
Search+Verifier highlights advances in neurosymbolic AI during the 2024 Math Olympics. Google DeepMind's combination of AlphaProof and AlphaGeometry 2 solved four out of six IMO problems, with AlphaProof being a finetuned Gemini model using an AlphaZero approach, and AlphaGeometry 2 trained on significantly more synthetic data with a novel knowledge-sharing mechanism. Despite impressive results, human judges noted the AI required much longer time than human competitors. Meanwhile, Meta AI released Llama 3.1 with a 405B parameter model and smaller variants, and Mistral AI launched Mistral Large 2 with 123B parameters and 128k context windows, outperforming Llama 3.1 on coding tasks and multilingual benchmarks. This marks significant progress in AI mathematical reasoning, model scaling, and multilingual capabilities.
Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
llama-3-1-405b llama-3-8b llama-3-70b llama-3-1-8b gpt-4o gpt-4o-mini claude-3-5 qwen-2 meta-ai-fair openai alibaba multilinguality code-generation context-windows model-training synthetic-data benchmarking reasoning fine-tuning model-performance dataset-release swyx philschmid jjitsev lewtun teknium1 adcock_brett
Llama 3.1 leaks reveal a 405B dense model with 128k context length, trained on 39.3M GPU hours using H100-80GB GPUs, and fine-tuned with over 25M synthetic examples. The model shows significant benchmark improvements, especially for the 8B and 70B variants, with some evals suggesting the 70B outperforms GPT-4o. GPT-4o Mini launched as a cost-efficient variant with strong performance but some reasoning weaknesses. Synthetic datasets like NuminaMath enable models such as Alibaba Qwen 2 to surpass GPT-4o and Claude 3.5 in math competitions. Discussions include reasoning task benchmarks and dataset building for improved reasoning.
GraphRAG: The Marriage of Knowledge Graphs and RAG
gemma-2 llama-3-70b claude-3.5-sonnet nemotron-340b qwen2-72b llama-3 microsoft-research anthropic nvidia hugging-face retrieval-augmented-generation knowledge-graphs token-usage inference-time attention-mechanisms instruction-following coding math long-range-reasoning synthetic-data dataset-release fine-tuning context-windows function-calling travis-fischer rasbt alexandr-wang osanseviero rohanpaul_ai hamelhusain svpino aaaazzam omarsar0
Microsoft Research open sourced GraphRAG, a retrieval augmented generation (RAG) technique that extracts knowledge graphs from sources and clusters them for improved LLM answers, though it increases token usage and inference time. Gemma 2 models were released focusing on efficient small LLMs with innovations like sliding window attention and RMS norm, nearly matching the larger Llama 3 70B. Anthropic's Claude 3.5 Sonnet leads in instruction following and coding benchmarks, while Nvidia's Nemotron 340B model was released in June. Qwen2-72B tops the HuggingFace Open LLM leaderboard excelling in math and long-range reasoning. Discussions on RAG highlighted its limitations and improvements in context usage via function calls. A persona-driven synthetic data generation approach introduced 1 billion personas, with a fine-tuned model matching GPT-4 performance on math benchmarks at 7B scale. The 200GB AutoMathText dataset was also noted for math data synthesis.
Gemini launches context caching... or does it?
nemotron llama-3-70b chameleon-7b chameleon-34b gemini-1.5-pro deepseek-coder-v2 gpt-4-turbo claude-3-opus gemini-1.5-pro nvidia meta-ai-fair google deepseek hugging-face context-caching model-performance fine-tuning reinforcement-learning group-relative-policy-optimization large-context model-training coding model-release rohanpaul_ai _philschmid aman-sanger
Nvidia's Nemotron ranks #1 open model on LMsys and #11 overall, surpassing Llama-3-70b. Meta AI released Chameleon 7B/34B models after further post-training. Google's Gemini introduced context caching, offering a cost-efficient middle ground between RAG and finetuning, with a minimum input token count of 33k and no upper limit on cache duration. DeepSeek launched DeepSeek-Coder-V2, a 236B parameter model outperforming GPT-4 Turbo, Claude-3-Opus, and Gemini-1.5-Pro in coding tasks, supporting 338 programming languages and extending context length to 128K. It was trained on 6 trillion tokens using the Group Relative Policy Optimization (GRPO) algorithm and is available on Hugging Face with a commercial license. These developments highlight advances in model performance, context caching, and large-scale coding models.
Qwen 2 beats Llama 3 (and we don't know how)
qwen-2 llama-3 llama-3-70b gpt-4 nllb alibaba groq meta-ai-fair multilinguality benchmarking inference-speed sparse-autoencoders scaling-laws post-training instruction-following rejection-sampling execution-feedback model-release multilingual-models model-training philschmid huybery jonathanross321 awnihannun gdb nabla_theta ylecun
Alibaba released Qwen 2 models under Apache 2.0 license, claiming to outperform Llama 3 in open models with multilingual support in 29 languages and strong benchmark scores like MMLU 82.3 and HumanEval 86.0. Groq demonstrated ultra-fast inference speed on Llama-3 70B at 40,792 tokens/s and running 4 Wikipedia articles in 200ms. Research on sparse autoencoders (SAEs) for interpreting GPT-4 neural activity showed new training methods, metrics, and scaling laws. Meta AI announced the No Language Left Behind (NLLB) model capable of high-quality translations between 200 languages, including low-resource ones. "Our post-training phase is designed with the principle of scalable training with minimal human annotation," highlighting techniques like rejection sampling for math and execution feedback for coding.
Mamba-2: State Space Duality
mamba-2 mamba transformer++ llama-3-70b gpt-3 hugging-face state-space-models perplexity training-efficiency data-pruning benchmarking multimodality video-analysis _albertgu tri_dao arankomatsuzaki _akhaliq clementdelangue karpathy
Mamba-2, a new state space model (SSM), outperforms previous models like Mamba and Transformer++ in perplexity and wall-clock time, featuring 8x larger states and 50% faster training. It introduces the concept of state space duality (SSD) connecting SSMs and linear attention. The FineWeb-Edu dataset, a high-quality subset of the 15 trillion token FineWeb dataset, filtered using llama-3-70b for educational quality, enables better and faster LLM learning, potentially reducing tokens needed to surpass GPT-3 performance. Additionally, perplexity-based data pruning using a 125M parameter model improves downstream performance and reduces pretraining steps by up to 1.45x. The Video-MME benchmark evaluates multi-modal LLMs on video analysis across multiple visual domains and video lengths.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."
Quis promptum ipso promptiet?
llama-3-70b llama-3-120b llama-3 llama-cpp anthropic openai zoominfo neuralink prompt-engineering chain-of-thought rag quantization cuda-graphs gpu-optimization thought-controlled-devices modeling-consciousness conference sama gdb bindureddy svpino rohanpaul_ai alexalbert__ abacaj
Anthropic released upgrades to their Workbench Console, introducing new prompt engineering features like chain-of-thought reasoning and prompt generators that significantly reduce development time, exemplified by their customer Zoominfo. OpenAI teased a "magic" new development coming soon, speculated to be a new LLM replacing GPT-3.5 in the free tier or a search competitor. The open-source community highlighted Llama 3 70B as "game changing" with new quantized weights for Llama 3 120B and CUDA graph support for llama.cpp improving GPU performance. Neuralink demonstrated a thought-controlled mouse, sparking interest in modeling consciousness from brain signals. The ICLR 2024 conference is being held in Asia for the first time, generating excitement.
LMSys advances Llama 3 eval analysis
llama-3-70b llama-3 claude-3-sonnet alphafold-3 lmsys openai google-deepmind isomorphic-labs benchmarking model-behavior prompt-complexity model-specification molecular-structure-prediction performance-analysis leaderboards demis-hassabis sam-altman miranda-murati karina-nguyen joanne-jang john-schulman
LMSys is enhancing LLM evaluation by categorizing performance across 8 query subcategories and 7 prompt complexity levels, revealing uneven strengths in models like Llama-3-70b. DeepMind released AlphaFold 3, advancing molecular structure prediction with holistic modeling of protein-DNA-RNA complexes, impacting biology and genetics research. OpenAI introduced the Model Spec, a public standard to clarify model behavior and tuning, inviting community feedback and aiming for models to learn directly from it. Llama 3 has reached top leaderboard positions on LMSys, nearly matching Claude-3-sonnet in performance, with notable variations on complex prompts. The analysis highlights the evolving landscape of model benchmarking and behavior shaping.
$100k to predict LMSYS human preferences in a Kaggle contest
llama-3-70b llama-3 gpt-4 claude-3-opus prometheus-2 groq openai lmsys scale-ai ai2 nvidia benchmarking datasets fine-tuning reinforcement-learning model-alignment hallucination parameter-efficient-fine-tuning scalable-training factuality chatbot-performance bindureddy drjimfan percyliang seungonekim mobicham clefourrier
Llama 3 models are making breakthroughs with Groq's 70B model achieving record low costs per million tokens. A new Kaggle competition offers a $100,000 prize to develop models predicting human preferences from a dataset of over 55,000 user-LLM conversations. Open source evaluator LLMs like Prometheus 2 outperform proprietary models such as GPT-4 and Claude 3 Opus in judgment tasks. New datasets like WildChat1M provide over 1 million ChatGPT interaction logs with diverse and toxic examples. Techniques like LoRA fine-tuning show significant performance gains, and NVIDIA's NeMo-Aligner toolkit enables scalable LLM alignment across hundreds of GPUs. Factuality-aware alignment methods are proposed to reduce hallucinations in LLM outputs.
A quiet weekend
llama-3 dolphin-2.9 pixart-sigma llama-3-70b microsoft coca-cola uber lmsys nous-research mistral-ai ar-interfaces transformers algorithmic-tasks turing-test graph-algorithms embeddings generative-ai model-optimization llm-inference quantization model-deployment yann-lecun
Yann LeCun predicts a shift to AR interfaces with AI assistants in 10-15 years, moving away from smartphones. The Dolphin-2.9 model based on Llama-3 was released, improving quality issues. PixArt Sigma, a 0.6B parameter model, achieves Stable Diffusion 3.0 level performance with complete prompt adherence and local usability. Research shows transformers can use meaningless filler tokens for algorithmic tasks with dense supervision. AI-generated restaurant reviews can pass the Turing test, fooling humans and AI detectors. Uber uses graph algorithms and learned embeddings for ETA prediction. Coca-Cola and Microsoft announced a 5-year AI partnership to accelerate cloud and generative AI initiatives. The Llama-3 70B model can run on a single 4GB GPU using AirLLM optimization without quantization but is slow. Mistral.rs is introduced as a fast LLM inference platform with quantization and OpenAI API compatibility. Only 5% of LLMs make it from prototype to production due to challenges, especially in enterprise. EXL2 and GGUF quantization methods for Llama models show similar perplexity vs model size, with Llama-3 and Llama-2 degrading more under quantization compared to full precision.
Apple's OpenELM beats OLMo with 50% of its dataset, using DeLighT
openelm llama-3 llama-3-8b-instruct llama-3-70b apple meta-ai-fair google layer-wise-scaling context-length quantization ai-alignment open-source ai-regulation eric-schmidt sebastian-raschka
Apple advances its AI presence with the release of OpenELM, its first relatively open large language model available in sizes from 270M to 3B parameters, featuring a novel layer-wise scaling architecture inspired by the DeLight paper. Meanwhile, Meta's LLaMA 3 family pushes context length boundaries with models supporting over 160K tokens and an 8B-Instruct model with 262K context length released on Hugging Face, alongside performance improvements in quantized versions. A new paper on AI alignment highlights KTO as the best-performing method, with sensitivity to training data volume noted. In AI ethics and regulation, former Google CEO Eric Schmidt warns about the risks of open-source AI empowering bad actors and geopolitical rivals, while a U.S. proposal aims to enforce "Know Your Customer" rules to end anonymous cloud usage.
Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM
snowflake-arctic phi-3 llama-3-70b llama-3 stable-diffusion-3 sd3-turbo gpt-3.5-turbo snowflake databricks deepseek deepspeed nvidia stable-diffusion adobe apple llamaindex lmsys openai mixture-of-experts curriculum-learning model-release image-generation video-upscaling quantization inference-speed benchmarking model-comparison open-source on-device-ai
Snowflake Arctic is a notable new foundation language model released under Apache 2.0, claiming superiority over Databricks in data warehouse AI applications and adopting a mixture-of-experts architecture inspired by DeepSeekMOE and DeepSpeedMOE. The model employs a 3-stage curriculum training strategy similar to the recent Phi-3 paper. In AI image and video generation, Nvidia introduced the Align Your Steps technique improving image quality at low step counts, while Stable Diffusion 3 and SD3 Turbo models were compared for prompt understanding and image quality. Adobe launched an AI video upscaling project enhancing blurry videos to HD, though with some high-resolution artifacts. Apple released open-source on-device language models with code and training logs, diverging from typical weight-only releases. The Llama-3-70b model ties for first place on the LMSYS leaderboard for English queries, and Phi-3 (4B params) outperforms GPT-3.5 Turbo in the banana logic benchmark. Fast inference and quantization of Llama 3 models were demonstrated on MacBook devices.
OpenAI's Instruction Hierarchy for the LLM OS
phi-3-mini openelm claude-3-opus gpt-4-turbo gpt-3.5-turbo llama-3-70b rho-1 mistral-7b llama-3-8b llama-3 openai microsoft apple deepseek mistral-ai llamaindex wendys prompt-injection alignment benchmarking instruction-following context-windows model-training model-deployment inference performance-optimization ai-application career-advice drive-thru-ai
OpenAI published a paper introducing the concept of privilege levels for LLMs to address prompt injection vulnerabilities, improving defenses by 20-30%. Microsoft released the lightweight Phi-3-mini model with 4K and 128K context lengths. Apple open-sourced the OpenELM language model family with an open training and inference framework. An instruction accuracy benchmark compared 12 models, with Claude 3 Opus, GPT-4 Turbo, and Llama 3 70B performing best. The Rho-1 method enables training state-of-the-art models using only 3% of tokens, boosting models like Mistral. Wendy's deployed AI-powered drive-thru ordering, and a study found Gen Z workers prefer generative AI for career advice. Tutorials on deploying Llama 3 models on AWS EC2 highlight hardware requirements and inference server use.
Perplexity, the newest AI unicorn
llama-3-8b llama-3-70b llama-3 llava-llama-3-8b-v1_1 phi-3 gpt-3.5 perplexity-ai meta-ai-fair hugging-face groq context-length fine-tuning quantization instruction-following model-comparison multimodality benchmarking memory-optimization model-performance daniel-gross aravind-srinivas
Perplexity doubles its valuation shortly after its Series B with a Series B-1 funding round. Significant developments around Llama 3 include context length extension to 16K tokens, new multimodal LLaVA models outperforming Llama 2, and fine-tuning improvements like QDoRA surpassing QLoRA. The Llama-3-70B model is praised for instruction following and performance across quantization formats. Phi-3 models by Meta AI released in multiple sizes show competitive benchmark results, with the 14B model achieving 78% on MMLU and the 3.8B model nearing GPT-3.5 performance.
FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security
2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.
Llama-3-70b is GPT-4-level Open Model
llama-3-70b llama-3-8b llama-3 llama-2-70b mistral-7b grok-3 stable-diffusion-3 vasa-1 meta-ai-fair groq nvidia amazon microsoft benchmarking model-performance fine-tuning function-calling arithmetic image-generation video-generation energy-usage gpu-demand political-bias ai-safety scaling context-windows tokenization elon-musk
Meta has released Llama 3, their most capable open large language model with 8B and 70B parameter versions supporting 8K context length and outperforming previous models including Llama 2 and Mistral 7B. Groq serves the Llama 3 70B model at 500-800 tokens/second, making it the fastest GPT-4-level token source. Discussions highlight AI scaling challenges with Elon Musk stating that training Grok 3 will require 100,000 Nvidia H100 GPUs, and AWS planning to acquire 20,000 B200 GPUs for a 27 trillion parameter model. Microsoft unveiled VASA-1 for lifelike talking face generation, while Stable Diffusion 3 and its extensions received mixed impressions. Concerns about AI energy usage and political bias in AI were also discussed.
Meta Llama 3 (8B, 70B)
llama-3-8b llama-3-70b llama-3-400b stable-diffusion-3 mixtral-8x22b-instruct-v0.1 vasa-1 meta-ai-fair stability-ai boston-dynamics microsoft mistral-ai hugging-face transformer tokenization model-training benchmarking robotics natural-language-processing real-time-processing synthetic-data dataset-cleaning behavior-trees ai-safety model-accuracy api model-release humor helen-toner
Meta partially released Llama 3 models including 8B and 70B variants, with a 400B variant still in training, touted as the first GPT-4 level open-source model. Stability AI launched Stable Diffusion 3 API with model weights coming soon, showing competitive realism against Midjourney V6. Boston Dynamics unveiled an electric humanoid robot Atlas, and Microsoft introduced the VASA-1 model generating lifelike talking faces at 40fps on RTX 4090. Mistral AI, a European OpenAI rival, is seeking $5B funding with its Mixtral-8x22B-Instruct-v0.1 model achieving 100% accuracy on 64K context benchmarks. AI safety discussions include calls from former OpenAI board member Helen Toner for audits of top AI companies, and the Mormon Church released AI usage principles. New AI development tools include Ctrl-Adapter for diffusion models, Distilabel 1.0.0 for synthetic dataset pipelines, Data Bonsai for data cleaning with LLMs, and Dendron for building LLM agents with behavior trees. Memes highlight AI development humor and cultural references. The release of Llama 3 models features improved reasoning, a 128K token vocabulary, 8K token sequences, and grouped query attention.