All tags

Topic: "benchmarking"

    Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI
    Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro
    Gemini 2.5 Pro (06-05) launched at AI Engineer World's Fair
    not much happened today
    DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
    not much happened today
    not much happened today
    not much happened today
    Granola launches team notes, while Notion launches meeting transcription
    not much happened today
    Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model
    not much happened today
    not much happened today
    LlamaCon: Meta AI gets into the Llama API platform business
    Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
    gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
    Gemini 2.5 Flash completes the total domination of the Pareto Frontier
    Llama 4's Controversial Weekend Release
    not much happened today
    not much happened today
    not much happened today
    not much happened today
    Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)
    not much happened today
    not much happened today
    DeepSeek's Open Source Stack
    not much happened today
    not much happened today
    Anthropic's $61.5B Series E
    AI Engineer Summit Day 1
    not much happened today
    X.ai Grok 3 and Mira Murati's Thinking Machines
    LLaDA: Large Language Diffusion Models
    not much happened today
    small news items
    not much happened today
    not much happened today
    not much happened today
    OpenAI takes on Gemini's Deep Research
    o3-mini launches, OpenAI on "wrong side of history"
    OpenAI launches Operator, its first Agent
    not much happened today
    not much happened this weekend
    o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath
    Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights
    Meta BLT: Tokenizer-free, Byte-level LLM
    Google wakes up: Gemini 2.0 et al
    $200 ChatGPT Pro and o1-full/pro, with vision, without API, and mixed reviews
    Olympus has dropped (aka, Amazon Nova Micro|Lite|Pro|Premier|Canvas|Reel)
    not much happened to end the week
    Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500
    DeepSeek-R1 claims to beat o1-preview AND will be open sourced
    Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo
    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
    The AI Search Wars Have Begun — SearchGPT, Gemini Grounding, and more
    DeepSeek Janus and Meta SpiRit-LM: Decoupled Image and Expressive Voice Omnimodality
    Did Nvidia's Nemotron 70B train on test?
    not much happened today
    not much happened today
    ChatGPT Advanced Voice Mode
    not much happened today
    nothing much happened today
    o1: OpenAI's new general reasoning models
    Pixtral 12B: Mistral beats Llama to Multimodality
    not much happened today + AINews Podcast?
    AIPhone 16: the Visual Intelligence Phone
    Cerebras Inference: Faster, Better, AND Cheaper
    super quiet day
    Ideogram 2 + Berkeley Function Calling Leaderboard V2
    not much happened today
    Grok 2! and ChatGPT-4o-latest confuses everybody
    Gemini Live
    not much happened today
    GPT4o August + 100% Structured Outputs for All (GPT4o August edition)
    How Carlini Uses AI
    Apple Intelligence Beta + Segment Anything Model 2
    AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold
    Mistral Large 2 + RIP Mistral 7B, 8x7B, 8x22B
    Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
    DataComp-LM: the best open-data 7B model/benchmark/dataset
    Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o-mini version)
    Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)
    We Solved Hallucinations
    Problems with MMLU-Pro
    Qdrant's BM42: "Please don't trust us"
    Gemini Nano: 50-90% of Gemini Pro, <100ms inference, on device, in Chrome Canary
    Claude Crushes Code - 92% HumanEval and Claude.ai Artifacts
    Hybrid SSM/Transformers > Pure SSMs/Pure Transformers
    Francois Chollet launches $1m ARC Prize
    Qwen 2 beats Llama 3 (and we don't know how)
    Mamba-2: State Space Duality
    Ten Commandments for Deploying Fine-Tuned Models
    Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model
    Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing
    LMSys advances Llama 3 eval analysis
    DeepSeek-V2 beats Mixtral 8x22B with >160 experts at HALF the cost
    $100k to predict LMSYS human preferences in a Kaggle contest
    Evals: The Next Generation
    Not much happened today
    Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM
    OpenAI's Instruction Hierarchy for the LLM OS
    Perplexity, the newest AI unicorn
    FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
    Llama-3-70b is GPT-4-level Open Model
    Meta Llama 3 (8B, 70B)
    Music's Dall-E moment
    DBRX: Best open model (just not most efficient)
    Claude 3 is officially America's Next Top Model
    Welcome /r/LocalLlama!
    Grok-1 in Bio
    DeepMind SIMA: one AI, 9 games, 600 tasks, vision+language ONLY
    Fixing Gemma
    FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs
    Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
    Stable Diffusion 3 — Rombach & Esser did it again!
    Claude 3 just destroyed GPT 4 (see for yourself)
    Mistral Large disappoints
    Google AI: Win some (Gemma, 1.5 Pro), Lose some (Image gen)
    Adept Fuyu-Heavy: Multimodal model for Agents
    Nightshade poisons AI art... kinda?
    1/4/2024: Jeff Bezos backs Perplexity's $520m Series B.
    12/30/2023: Mega List of all LLMs
    12/28/2023: Smol Talk updates
    12/22/2023: Anyscale's Benchmark Criticisms
    12/9/2023: The Mixtral Rush