All tags
Model: "claude-3-sonnet"
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
TinyZero: Reproduce DeepSeek R1-Zero for $30
deepseek-r1 qwen o1 claude-3-sonnet claude-3 prime ppo grpo llama-stack deepseek berkeley hugging-face meta-ai-fair openai deeplearningai reinforcement-learning fine-tuning chain-of-thought multi-modal-benchmark memory-management model-training open-source agentic-workflow-automation model-performance jiayi-pan saranormous reach_vb lmarena_ai nearcyan omarsar0 philschmid hardmaru awnihannun winglian
DeepSeek Mania continues to reshape the frontier model landscape with Jiayi Pan from Berkeley reproducing the OTHER result from the DeepSeek R1 paper, R1-Zero, in a cost-effective Qwen model fine-tune for two math tasks. A key finding is a lower bound to the distillation effect at 1.5B parameters, with RLCoT reasoning emerging as an intrinsic property. Various RL techniques like PPO, DeepSeek's GRPO, or PRIME show similar outcomes, and starting from an Instruct model speeds convergence. The Humanity’s Last Exam (HLE) Benchmark introduces a challenging multi-modal test with 3,000 expert-level questions across 100+ subjects, where models perform below 10%, with DeepSeek-R1 achieving 9.4%. DeepSeek-R1 excels in chain-of-thought reasoning, outperforming models like o1 while being 20x cheaper and MIT licensed. The WebDev Arena Leaderboard ranks DeepSeek-R1 #2 in technical domains and #1 under Style Control, closing in on Claude 3.5 Sonnet. OpenAI's Operator is deployed to 100% of Pro users in the US, enabling tasks like ordering meals and booking reservations, and functions as a research assistant for AI paper searches and summaries. Hugging Face announces a leadership change after significant growth, while Meta AI releases the first stable version of Llama Stack with streamlined upgrades and automated verification. DeepSeek-R1's open-source success is celebrated, and technical challenges like memory management on macOS 15+ are addressed with residency sets in MLX for stability.
Bespoke-Stratos + Sky-T1: The Vicuna+Alpaca moment for reasoning
sky-t1-32b-preview qwen-2.5-32b r1 o1-preview gpt-4o claude-3-sonnet bespoke-stratos-32b gemini-2.0-flash-thinking berkeley usc deepseek bespoke-labs google llmsys stanford lm-sys reasoning supervised-finetuning reinforcement-learning multimodality model-distillation context-windows code-execution model-repeatability behavioral-self-awareness rlhf teortaxestex cwolferesearch madiator chakraai philschmid abacaj omarsar0
Reasoning Distillation has emerged as a key technique, with Berkeley/USC researchers releasing Sky-T1-32B-Preview, a finetuned model of Qwen 2.5 32B using 17k reasoning traces for just $450, matching benchmarks of o1-preview. DeepSeek introduced R1, a model surpassing o1-preview and enabling distillation to smaller models like a 1.5B Qwen to match gpt-4o and claude-3-sonnet levels. Bespoke Labs further distilled R1 on Qwen, outperforming o1-preview with fewer samples. This progress suggests that "SFT is all you need" for reasoning without major architecture changes. Additionally, DeepSeek-R1 uses pure reinforcement learning with supervised finetuning to accelerate convergence and shows strong reasoning and multimodal capabilities. Google's Gemini 2.0 Flash Thinking model boasts a 1 million token context window, code execution, and excels in math, science, and multimodal reasoning. Critiques highlight challenges in model repeatability, behavioral self-awareness, and RLHF limitations in reasoning robustness.
Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo
claude-3-sonnet gpt-4 gemini-1.5 claude-3.5-sonnet anthropic openai langchain meta-ai-fair benchmarking prompt-engineering rag visuotactile-perception ai-governance theoretical-alignment ethical-alignment jailbreak-robustness model-releases alignment richardmcngo andrewyng philschmid
Anthropic released the 3.5 Sonnet benchmark for jailbreak robustness, emphasizing adaptive defenses. OpenAI enhanced GPT-4 with a new RAG technique for contiguous chunk retrieval. LangChain launched Promptim for prompt optimization. Meta AI introduced NeuralFeels with neural fields for visuotactile perception. RichardMCNgo resigned from OpenAI, highlighting concerns on AI governance and theoretical alignment. Discussions emphasized the importance of truthful public information and ethical alignment in AI deployment. The latest Gemini update marks a new #1 LLM amid alignment challenges. The AI community continues to focus on benchmarking, prompt-engineering, and alignment issues.
OpenAI beats Anthropic to releasing Speculative Decoding
claude-3-sonnet mrt5 openai anthropic nvidia microsoft boston-dynamics meta-ai-fair runway elevenlabs etched osmo physical-intelligence langchain speculative-decoding prompt-lookup cpu-inference multimodality retrieval-augmented-generation neural-networks optimization ai-safety governance model-architecture inference-economics content-generation adcock_brett vikhyatk dair_ai rasbt bindureddy teortaxestex svpino c_valenzuelab davidsholz
Prompt lookup and Speculative Decoding techniques are gaining traction with implementations from Cursor, Fireworks, and teased features from Anthropic. OpenAI has introduced faster response times and file edits with these methods, offering about 50% efficiency improvements. The community is actively exploring AI engineering use cases with these advancements. Recent updates highlight progress from companies like NVIDIA, OpenAI, Anthropic, Microsoft, Boston Dynamics, and Meta. Key technical insights include CPU inference capabilities, multimodal retrieval-augmented generation (RAG), and neural network fundamentals. New AI products include fully AI-generated games and advanced content generation tools. Challenges in AI research labs such as bureaucracy and resource allocation were also discussed, alongside AI safety and governance concerns.
ALL of AI Engineering in One Place
claude-3-sonnet claude-3 openai google-deepmind anthropic mistral-ai cohere hugging-face adept midjourney character-ai microsoft amazon nvidia salesforce mastercard palo-alto-networks axa novartis discord twilio tinder khan-academy sourcegraph mongodb neo4j hasura modular cognition anysphere perplexity-ai groq mozilla nous-research galileo unsloth langchain llamaindex instructor weights-biases lambda-labs neptune datastax crusoe covalent qdrant baseten e2b octo-ai gradient-ai lancedb log10 deepgram outlines crew-ai factory-ai interpretability feature-steering safety multilinguality multimodality rag evals-ops open-models code-generation gpus agents ai-leadership
The upcoming AI Engineer World's Fair in San Francisco from June 25-27 will feature a significantly expanded format with booths, talks, and workshops from top model labs like OpenAI, DeepMind, Anthropic, Mistral, Cohere, HuggingFace, and Character.ai. It includes participation from Microsoft Azure, Amazon AWS, Google Vertex, and major companies such as Nvidia, Salesforce, Mastercard, Palo Alto Networks, and more. The event covers 9 tracks including RAG, multimodality, evals/ops, open models, code generation, GPUs, agents, AI in Fortune 500, and a new AI leadership track. Additionally, Anthropic shared interpretability research on Claude 3 Sonnet, revealing millions of interpretable features that can be steered to modify model behavior, including safety-relevant features related to bias and unsafe content, though more research is needed for practical applications. The event offers a discount code for AI News readers.
Anthropic's "LLM Genome Project": learning & clamping 34m features on Claude Sonnet
claude-3-sonnet claude-3 anthropic scale-ai suno-ai microsoft model-interpretability dictionary-learning neural-networks feature-activation intentional-modifiability scaling mechanistic-interpretability emmanuel-ameisen alex-albert
Anthropic released their third paper in the MechInterp series, Scaling Monosemanticity, scaling interpretability analysis to 34 million features on Claude 3 Sonnet. This work introduces the concept of dictionary learning to isolate recurring neuron activation patterns, enabling more interpretable internal states by combining features rather than neurons. The paper reveals abstract features related to code, errors, sycophancy, crime, self-representation, and deception, demonstrating intentional modifiability by clamping feature values. The research marks a significant advance in model interpretability and neural network analysis at frontier scale.
LMSys advances Llama 3 eval analysis
llama-3-70b llama-3 claude-3-sonnet alphafold-3 lmsys openai google-deepmind isomorphic-labs benchmarking model-behavior prompt-complexity model-specification molecular-structure-prediction performance-analysis leaderboards demis-hassabis sam-altman miranda-murati karina-nguyen joanne-jang john-schulman
LMSys is enhancing LLM evaluation by categorizing performance across 8 query subcategories and 7 prompt complexity levels, revealing uneven strengths in models like Llama-3-70b. DeepMind released AlphaFold 3, advancing molecular structure prediction with holistic modeling of protein-DNA-RNA complexes, impacting biology and genetics research. OpenAI introduced the Model Spec, a public standard to clarify model behavior and tuning, inviting community feedback and aiming for models to learn directly from it. Llama 3 has reached top leaderboard positions on LMSys, nearly matching Claude-3-sonnet in performance, with notable variations on complex prompts. The analysis highlights the evolving landscape of model benchmarking and behavior shaping.
Claude 3 is officially America's Next Top Model
claude-3-opus claude-3-sonnet claude-3-haiku gpt-4o-mini mistral-7b qwen-72b anthropic mistral-ai huggingface openrouter stable-diffusion automatic1111 comfyui fine-tuning model-merging alignment ai-ethics benchmarking model-performance long-context cost-efficiency model-evaluation mark_riedl ethanjperez stuhlmueller ylecun aravsrinivas
Claude 3 Opus outperforms GPT4T and Mistral Large in blind Elo rankings, with Claude 3 Haiku marking a new cost-performance frontier. Fine-tuning techniques like QLoRA on Mistral 7B and evolutionary model merging on HuggingFace models are highlighted. Public opinion shows strong opposition to ASI development. Research supervision opportunities in AI alignment are announced. The Stable Diffusion 3 (SD3) release raises workflow concerns for tools like ComfyUI and automatic1111. Opus shows a 5% performance dip on OpenRouter compared to the Anthropic API. A new benchmark stresses LLM recall at long contexts, with Mistral 7B struggling and Qwen 72b performing well.
Shipping and Dipping: Inflection + Stability edition
inflection-ai-2.5 stable-diffusion-3 claude-3-haiku claude-3-sonnet claude-3-opus tacticai inflection-ai stability-ai microsoft nvidia google-deepmind anthropic executive-departures gpu-acceleration ai-assistants geometric-deep-learning ai-integration ai-cost-reduction ai-job-displacement ai-healthcare model-release mustafa-suleyman
Inflection AI and Stability AI recently shipped major updates (Inflection AI 2.5 and Stable Diffusion 3) but are now experiencing significant executive departures, signaling potential consolidation in the GPU-rich startup space. Mustafa Suleyman has joined Microsoft AI as CEO, overseeing consumer AI products like Copilot, Bing, and Edge. Microsoft Azure is collaborating with NVIDIA on the Grace Blackwell 200 Superchip. Google DeepMind announced TacticAI, an AI assistant for football tactics developed with Liverpool FC, using geometric deep learning and achieving 90% expert approval in blind tests. Anthropic released Claude 3 Haiku and Claude 3 Sonnet on Google Cloud's Vertex AI, with Claude 3 Opus coming soon. Concerns about AI job displacement arise as NVIDIA introduces AI nurses that outperform humans at bedside manner at 90% lower cost.
MM1: Apple's first Large Multimodal Model
mm1 gemini-1 command-r claude-3-opus claude-3-sonnet claude-3-haiku claude-3 apple cohere anthropic hugging-face langchain multimodality vqa fine-tuning retrieval-augmented-generation open-source robotics model-training react reranking financial-agents yann-lecun francois-chollet
Apple announced the MM1 multimodal LLM family with up to 30B parameters, claiming performance comparable to Gemini-1 and beating larger older models on VQA benchmarks. The paper targets researchers and hints at applications in embodied agents and business/education. Yann LeCun emphasized that human-level AI requires understanding the physical world, memory, reasoning, and hierarchical planning, while Fran ois Chollet cautioned that NLP is far from solved despite LLM advances. Cohere released Command-R, a model for Retrieval Augmented Generation, and Anthropic highlighted the Claude 3 family (Opus, Sonnet, Haiku) for various application needs. Open-source hardware DexCap enables dexterous robot manipulation data collection affordably. Tools like CopilotKit simplify AI integration into React apps, and migration to Keras 3 with JAX backend offers faster training. New projects improve reranking for retrieval and add financial agents to LangChain. The content includes insights on AI progress, new models, open-source tools, and frameworks.
Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0
Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.
Not much happened today
claude-3 claude-3-opus claude-3-sonnet gpt-4 gemma-2b anthropic perplexity langchain llamaindex cohere accenture mistral-ai snowflake together-ai hugging-face european-space-agency google gpt4all multimodality instruction-following out-of-distribution-reasoning robustness enterprise-ai cloud-infrastructure open-datasets model-deployment model-discoverability generative-ai image-generation
Anthropic released Claude 3, replacing Claude 2.1 as the default on Perplexity AI, with Claude 3 Opus surpassing GPT-4 in capability. Debate continues on whether Claude 3's performance stems from emergent properties or pattern matching. LangChain and LlamaIndex added support for Claude 3 enabling multimodal and tool-augmented applications. Despite progress, current models still face challenges in out-of-distribution reasoning and robustness. Cohere partnered with Accenture for enterprise AI search, while Mistral AI and Snowflake collaborate to provide LLMs on Snowflake's platform. Together AI Research integrates Deepspeed innovations to accelerate generative AI infrastructure. Hugging Face and the European Space Agency released a large earth observation dataset, and Google open sourced Gemma 2B, optimized for smartphones via the MLC-LLM project. GPT4All improved model discoverability for open models. The AI community balances excitement over new models with concerns about limitations and robustness, alongside growing enterprise adoption and open-source contributions. Memes and humor continue to provide social commentary.
Claude 3 just destroyed GPT 4 (see for yourself)
claude-3 claude-3-opus claude-3-sonnet claude-3-haiku gpt-4 anthropic amazon google claude-ai multimodality vision long-context model-alignment model-evaluation synthetic-data structured-output instruction-following model-speed cost-efficiency benchmarking safety mmitchell connor-leahy
Claude 3 from Anthropic launches in three sizes: Haiku (small, unreleased), Sonnet (medium, default on claude.ai, AWS, and GCP), and Opus (large, on Claude Pro). Opus outperforms GPT-4 on key benchmarks like GPQA, impressing benchmark authors. All models support multimodality with advanced vision capabilities, including converting a 2-hour video into a blog post. Claude 3 offers improved alignment, fewer refusals, and extended context length up to 1 million tokens with near-perfect recall. Haiku is noted for speed and cost-efficiency, processing dense research papers in under three seconds. The models excel at following complex instructions and producing structured outputs like JSON. Safety improvements reduce refusal rates, though some criticism remains from experts. Claude 3 is trained on synthetic data and shows strong domain-specific evaluation results in finance, medicine, and philosophy.