All tags
Company: "lmsys"
Anthropic's $61.5B Series E
gpt-4.5 claude-3.7-sonnet deepseek-r1 anthropic openai deepseek lmsys perplexity-ai deutsche-telekom model-performance benchmarking style-control coding multi-turn funding partnerships workflow lmarena_ai teortaxestex casper_hansen_ omarsar0 aidan_mclau willdepue vikhyatk teknim1 reach_vb _aidan_clark_ cto_junior aravsrinivas
Anthropic raised a $3.5 billion Series E funding round at a $61.5 billion valuation, signaling strong financial backing for the Claude AI model. GPT-4.5 achieved #1 rank across all categories on the LMArena leaderboard, excelling in multi-turn conversations, coding, math, creative writing, and style control. DeepSeek R1 tied with GPT-4.5 for top performance on hard prompts with style control. Discussions highlighted comparisons between GPT-4.5 and Claude 3.7 Sonnet in coding and workflow applications. The importance of the LMSYS benchmark was emphasized, though some questioned the relevance of benchmarks versus user acquisition. Additionally, Perplexity AI partnered with Deutsche Telekom to integrate the Perplexity Assistant into a new AI phone.
nothing much happened today
o1 chatgpt-4o llama-3-1-405b openai lmsys scale-ai cognition langchain qdrant rohanpaul_ai reinforcement-learning model-merging embedding-models toxicity-detection image-editing dependency-management automated-code-review visual-search benchmarking denny_zhou svpino alexandr_wang cwolferesearch rohanpaul_ai _akhaliq kylebrussell
OpenAI's o1 model faces skepticism about open-source replication due to its extreme restrictions and unique training advances like RL on CoT. ChatGPT-4o shows significant performance improvements across benchmarks. Llama-3.1-405b fp8 and bf16 versions perform similarly with cost benefits for fp8. A new open-source benchmark "Humanity's Last Exam" offers $500K in prizes to challenge LLMs. Model merging benefits from neural network sparsity and linear mode connectivity. Embedding-based toxic prompt detection achieves high accuracy with low compute. InstantDrag enables fast, optimization-free drag-based image editing. LangChain v0.3 releases with improved dependency management. Automated code review tool CodeRabbit adapts to team coding styles. Visual search advances integrate multimodal data for better product search. Experts predict AI will be default software by 2030.
not much happened today + AINews Podcast?
superforecaster-ai llama-3 reflection-70b glean sambanova cerebras stanford google apple hugging-face lmsys prompt-engineering research-ideas inference-speed retrieval-augmented-generation evaluation-methods visual-intelligence on-device-ai model-performance benchmarking novelty-detection danhendrycks benjamin-clavie bclavie bindureddy swyx borismpower corbtt drjimfan clementdelangue rohanpaul_ai
Glean doubled its valuation again. Dan Hendrycks' Superforecaster AI generates plausible election forecasts with interesting prompt engineering. A Stanford study found that LLM-generated research ideas are statistically more novel than those by expert humans. SambaNova announced faster inference for llama-3 models, surpassing Cerebras. Benjamin Clavie gave a notable talk on retrieval-augmented generation techniques. Strawberry is reported to launch in two weeks. Google Illuminate offers AI-generated podcast discussions about papers and books. Apple unveiled new AI features in iOS 18, including visual intelligence and improved Siri, with on-device and cloud processing for camera-based event additions. The Reflection 70B model sparked controversy over performance claims. Experts highlighted the unreliability of traditional benchmarks like MMLU and HumanEval, recommending alternative evaluation methods such as LMSys Chatbot Arena and Hugging Face's open-sourced Lighteval suite. The AI research community continues to explore AI's role in generating novel research ideas and improving benchmarking.
not much happened today
llama-3-1 claude-3-5-sonnet llama-3-1-405b ltm-2-mini qwen2-vl gpt-4o-mini meta-ai-fair hugging-face magic-ai-labs lmsys alibaba openai long-context style-control multimodality ai-safety model-evaluation web-crawling pdf-processing ai-hype-cycles call-center-automation sam-altman ajeya-cotra fchollet rohanpaul_ai philschmid
Meta announced significant adoption of LLaMA 3.1 with nearly 350 million downloads on Hugging Face. Magic AI Labs introduced LTM-2-Mini, a long context model with a 100 million token context window, and a new evaluation method called HashHop. LMSys added style control to their Chatbot Arena leaderboard, improving rankings for models like Claude 3.5 Sonnet and LLaMA 3.1 405B. Alibaba released Qwen2-VL, a multimodal LLM under Apache 2.0 license, competitive with GPT-4o mini. OpenAI CEO Sam Altman announced collaboration with the US AI Safety Institute for pre-release model testing. Discussions on AI safety and potential AI takeover risks were highlighted by Ajeya Cotra. Tools like firecrawl for web crawling and challenges in PDF processing were noted. AI hype cycles and market trends were discussed by François Chollet, and potential AI disruption in call centers was shared by Rohan Paul.
Execuhires: Tempting The Wrath of Khan
gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai
Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.
Gemma 2 2B + Scope + Shield
gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering
Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.
Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)
gpt-4o-mini mistral-nemo llama-3 llama-3-400b deepseek-v2 openai nvidia mistral-ai togethercompute deepseek-ai lmsys model-quantization context-windows instruction-following model-performance cost-efficiency multimodality benchmarking open-source model-release sam-altman
GPT-4o-mini launches with a 99% price reduction compared to text-davinci-003, offering 3.5% the price of GPT-4o and matching Opus-level benchmarks. It supports 16k output tokens, is faster than previous models, and will soon support text, image, video, and audio inputs and outputs. Mistral Nemo, a 12B parameter model developed with Nvidia, features a 128k token context window, FP8 checkpoint, and strong benchmark performance. Together Lite and Turbo offer fp8/int4 quantizations of Llama 3 with up to 4x throughput and significantly reduced costs. DeepSeek V2 is now open-sourced. Upcoming releases include at least 5 unreleased models and Llama 4 leaks ahead of ICML 2024.
RouteLLM: RIP Martian? (Plus: AINews Structured Summaries update)
gpt-4 gemma-2-27b gemma-2-9b lmsys openai llm-routing cost-efficiency model-performance model-optimization data-augmentation syntax-based-routing mixture-of-experts inference-throughput software-2.0 computer-vision karpathy bindureddy armand-joulin
LMSys introduces RouteLLM, an open-source router framework trained on preference data from Chatbot Arena, achieving cost reductions over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while maintaining 95% of GPT-4's performance. This approach surpasses previous task-specific routing by using syntax-based Mixture of Experts (MoE) routing and data augmentation, beating commercial solutions by 40%. The update highlights advances in LLM routing, cost-efficiency, and model performance optimization across multiple models rather than single-model or MoE-level improvements. Additionally, the AI Twitter recap notes the Gemma 2 model family as a top open model, the Block Transformer architecture for improved inference throughput, and a proposal for a fully Software 2.0 computer vision system by karpathy.
Shall I compare thee to a Sonnet's day?
claude-3.5-sonnet claude-3.5 gpt-4o gemini-1.5-pro anthropic lmsys glif comfyui hard-prompts json json-extraction meme-generation instruction-following app-development fusion-energy nuclear-fission productivity fchollet mustafasuleyman
Claude 3.5 Sonnet from Anthropic achieves top rankings in coding and hard prompt arenas, surpassing GPT-4o and competing with Gemini 1.5 Pro at lower cost. Glif demonstrates a fully automated Wojak meme generator using Claude 3.5 for JSON generation and ComfyUI for images, showcasing new JSON extractor capabilities. Artifacts enables rapid creation of niche apps, exemplified by a dual monitor visualizer made in under 5 minutes. François Chollet highlights that fusion energy is not a near-term solution compared to existing nuclear fission plants. Mustafa Suleyman notes that 75% of desk workers now use AI, marking a shift toward AI-assisted productivity.
Not much happened today
gemini-1.5-flashmodel gemini-pro mixtral mamba-2 phi-3-medium phi-3-small gpt-3.5-turbo-0613 llama-3-8b llama-2-70b mistral-finetune twelve-labs livekit groq openai nea nvidia lmsys mistral-ai model-performance prompt-engineering data-curation ai-safety model-benchmarking model-optimization training sequence-models state-space-models daniel-kokotajlo rohanpaul_ai _arohan_ tri_dao _albertgu _philschmid sarahcat21 hamelhusain jachiam0 willdepue teknium1
Twelve Labs raised $50m in Series A funding co-led by NEA and NVIDIA's NVentures to advance multimodal AI. Livekit secured $22m in funding. Groq announced running at 800k tokens/second. OpenAI saw a resignation from Daniel Kokotajlo. Twitter users highlighted Gemini 1.5 FlashModel for high performance at low cost and Gemini Pro ranking #2 in Japanese language tasks. Mixtral models can run up to 8x faster on NVIDIA RTX GPUs using TensorRT-LLM. Mamba-2 model architecture introduces state space duality for larger states and faster training, outperforming previous models. Phi-3 Medium (14B) and Small (7B) models benchmark near GPT-3.5-Turbo-0613 and Llama 3 8B. Prompt engineering is emphasized for unlocking LLM capabilities. Data quality is critical for model performance, with upcoming masterclasses on data curation. Discussions on AI safety include a Frontier AI lab employee letter advocating whistleblower protections and debates on aligning AI to user intent versus broader humanity interests.
GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4O version)
gpt-4o gpt-4-turbo openai lmsys multion adept multimodality vision speech-recognition tokenization real-time-processing coding model-performance model-optimization desktop-agents sama gdb
OpenAI has released GPT-4o, a new multimodal model capable of reasoning across text, audio, and video in real time with low latency (~300ms). It features voice and vision capabilities, improved non-English language performance with an expanded 200k vocabulary tokenizer, and is available to all ChatGPT users including free plans. GPT-4o is half the price and twice as fast as GPT-4-turbo with 5x rate limits. The model supports real-time voice and video input/output and shows strong coding capabilities. The release includes a new desktop app that can read screen and clipboard history, challenging existing desktop agent startups. The announcement was accompanied by demos including image generation and 3D object handling, with OpenAI achieving state-of-the-art performance in ASR and vision tasks. The update was widely discussed on social media, with comparisons to GPT-4T highlighting GPT-4o's speed and versatility. "GPT-4o is smart, fast, natively multimodal, and a step towards more natural human-computer interaction" and "extremely versatile and fun to play with".
LMSys advances Llama 3 eval analysis
llama-3-70b llama-3 claude-3-sonnet alphafold-3 lmsys openai google-deepmind isomorphic-labs benchmarking model-behavior prompt-complexity model-specification molecular-structure-prediction performance-analysis leaderboards demis-hassabis sam-altman miranda-murati karina-nguyen joanne-jang john-schulman
LMSys is enhancing LLM evaluation by categorizing performance across 8 query subcategories and 7 prompt complexity levels, revealing uneven strengths in models like Llama-3-70b. DeepMind released AlphaFold 3, advancing molecular structure prediction with holistic modeling of protein-DNA-RNA complexes, impacting biology and genetics research. OpenAI introduced the Model Spec, a public standard to clarify model behavior and tuning, inviting community feedback and aiming for models to learn directly from it. Llama 3 has reached top leaderboard positions on LMSys, nearly matching Claude-3-sonnet in performance, with notable variations on complex prompts. The analysis highlights the evolving landscape of model benchmarking and behavior shaping.
$100k to predict LMSYS human preferences in a Kaggle contest
llama-3-70b llama-3 gpt-4 claude-3-opus prometheus-2 groq openai lmsys scale-ai ai2 nvidia benchmarking datasets fine-tuning reinforcement-learning model-alignment hallucination parameter-efficient-fine-tuning scalable-training factuality chatbot-performance bindureddy drjimfan percyliang seungonekim mobicham clefourrier
Llama 3 models are making breakthroughs with Groq's 70B model achieving record low costs per million tokens. A new Kaggle competition offers a $100,000 prize to develop models predicting human preferences from a dataset of over 55,000 user-LLM conversations. Open source evaluator LLMs like Prometheus 2 outperform proprietary models such as GPT-4 and Claude 3 Opus in judgment tasks. New datasets like WildChat1M provide over 1 million ChatGPT interaction logs with diverse and toxic examples. Techniques like LoRA fine-tuning show significant performance gains, and NVIDIA's NeMo-Aligner toolkit enables scalable LLM alignment across hundreds of GPUs. Factuality-aware alignment methods are proposed to reduce hallucinations in LLM outputs.
A quiet weekend
llama-3 dolphin-2.9 pixart-sigma llama-3-70b microsoft coca-cola uber lmsys nous-research mistral-ai ar-interfaces transformers algorithmic-tasks turing-test graph-algorithms embeddings generative-ai model-optimization llm-inference quantization model-deployment yann-lecun
Yann LeCun predicts a shift to AR interfaces with AI assistants in 10-15 years, moving away from smartphones. The Dolphin-2.9 model based on Llama-3 was released, improving quality issues. PixArt Sigma, a 0.6B parameter model, achieves Stable Diffusion 3.0 level performance with complete prompt adherence and local usability. Research shows transformers can use meaningless filler tokens for algorithmic tasks with dense supervision. AI-generated restaurant reviews can pass the Turing test, fooling humans and AI detectors. Uber uses graph algorithms and learned embeddings for ETA prediction. Coca-Cola and Microsoft announced a 5-year AI partnership to accelerate cloud and generative AI initiatives. The Llama-3 70B model can run on a single 4GB GPU using AirLLM optimization without quantization but is slow. Mistral.rs is introduced as a fast LLM inference platform with quantization and OpenAI API compatibility. Only 5% of LLMs make it from prototype to production due to challenges, especially in enterprise. EXL2 and GGUF quantization methods for Llama models show similar perplexity vs model size, with Llama-3 and Llama-2 degrading more under quantization compared to full precision.
Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM
snowflake-arctic phi-3 llama-3-70b llama-3 stable-diffusion-3 sd3-turbo gpt-3.5-turbo snowflake databricks deepseek deepspeed nvidia stable-diffusion adobe apple llamaindex lmsys openai mixture-of-experts curriculum-learning model-release image-generation video-upscaling quantization inference-speed benchmarking model-comparison open-source on-device-ai
Snowflake Arctic is a notable new foundation language model released under Apache 2.0, claiming superiority over Databricks in data warehouse AI applications and adopting a mixture-of-experts architecture inspired by DeepSeekMOE and DeepSpeedMOE. The model employs a 3-stage curriculum training strategy similar to the recent Phi-3 paper. In AI image and video generation, Nvidia introduced the Align Your Steps technique improving image quality at low step counts, while Stable Diffusion 3 and SD3 Turbo models were compared for prompt understanding and image quality. Adobe launched an AI video upscaling project enhancing blurry videos to HD, though with some high-resolution artifacts. Apple released open-source on-device language models with code and training logs, diverging from typical weight-only releases. The Llama-3-70b model ties for first place on the LMSYS leaderboard for English queries, and Phi-3 (4B params) outperforms GPT-3.5 Turbo in the banana logic benchmark. Fast inference and quantization of Llama 3 models were demonstrated on MacBook devices.
FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security
2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.
Music's Dall-E moment
griffin command-r-plus gpt-4-0613 gpt-4-0314 mistral-8x22b codegemma stable-diffusion-1.5 command-r gemini-1.5 google mistral-ai lmsys cohere model-architecture benchmarking open-source model-quantization memory-optimization inference-speed multimodality finetuning performance-optimization audio-processing andrej-karpathy
Google's Griffin architecture outperforms transformers with faster inference and lower memory usage on long contexts. Command R+ climbs to 6th place on the LMSYS Chatbot Arena leaderboard, surpassing GPT-4-0613 and GPT-4-0314. Mistral AI releases an open-source 8x22B model with a 64K context window and around 130B total parameters. Google open-sources CodeGemma models with pre-quantized 4-bit versions for faster downloads. Ella weights enhance Stable Diffusion 1.5 with LLM for semantic alignment. Unsloth enables 4x larger context windows and 80% memory reduction for finetuning. Andrej Karpathy releases LLMs implemented in pure C for potential performance gains. Command R+ runs in realtime on M2 Max MacBook using iMat q1 quantization. Cohere's Command R model offers low API costs and strong leaderboard performance. Gemini 1.5 impresses with audio capabilities recognizing speech tone and speaker identification from audio clips.
RWKV "Eagle" v5: Your move, Mamba
rwkv-v5 mistral-7b miqu-1-70b mistral-medium llama-2 mistral-instruct-v0.2 mistral-tuna llama-2-13b kunoichi-dpo-v2-7b gpt-4 eleutherai mistral-ai hugging-face llamaindex nous-research rwkv lmsys fine-tuning multilinguality rotary-position-embedding model-optimization model-performance quantization speed-optimization prompt-engineering model-benchmarking reinforcement-learning andrej-karpathy
RWKV v5 Eagle was released with better-than-mistral-7b evaluation results, trading some English performance for multilingual capabilities. The mysterious miqu-1-70b model sparked debate about its origins, possibly a leak or distillation of Mistral Medium or a fine-tuned Llama 2. Discussions highlighted fine-tuning techniques, including the effectiveness of 1,000 high-quality prompts over larger mixed-quality datasets, and tools like Deepspeed, Axolotl, and QLoRA. The Nous Research AI community emphasized the impact of Rotary Position Embedding (RoPE) theta settings on LLM extrapolation, improving models like Mistral Instruct v0.2. Speed improvements in Mistral Tuna kernels reduced token processing costs, enhancing efficiency. The launch of Eagle 7B with 7.52B parameters showcased strong multilingual performance, surpassing other 7B class models.
12/15/2023: Mixtral-Instruct beats Gemini Pro (and matches GPT3.5)
mixtral gemini-pro gpt-3.5 gpt-4.5 gpt-4 chatgpt lmsys openai deepseek cloudflare huggingface performance context-window prompt-engineering privacy local-gpu cloud-gpu code-generation model-comparison model-usage api-errors karpathy
Thanks to a karpathy shoutout, lmsys now has enough data to rank mixtral and gemini pro. The discussion highlights the impressive performance of these state-of-the-art open-source models that can run on laptops. In the openai Discord, users compared AI tools like perplexity and chatgpt's browsing tool, favoring Perplexity for its superior data gathering, pricing, and usage limits. Interest was shown in AI's ability to convert large code files with deepseek coder recommended. Debates on privacy implications for AI advancement and challenges of running LLMs on local and cloud GPUs were prominent. Users reported issues with chatgpt including performance problems, loss of access to custom GPTs, and unauthorized access. Discussions also covered prompt engineering for large context windows and speculations about gpt-4.5 and gpt-4 future developments.