Person: "jeremyphoward"

claude-opus-4.8 gpt-5.5 qwen kimi deepseek anthropic huggingface langchain vllm_project reinforcement-learning tokenization agentic-ai api model-optimization long-context rust performance-optimization multi-agent-systems prompt-engineering jeremyphoward leo_linsky clementdelangue johnschulman2 omarsar0 hwchase17 ofirpress scaling01

Anthropic rolled out Claude Opus 4.8, which shows incremental improvements but mixed benchmark results, including better cooperation and coding behavior but some regressions in document parsing. Platform updates include mid-conversation system instructions enhancing long agent sessions, though API pricing remains a concern. A Hugging Face analysis revealed a critical bug in multi-turn reinforcement learning training loops involving tokenization mismatches, with a proposed "Token-In, Token-Out" fix. Agent harness design is evolving as a key optimization area, with LangChain's Deep Agents v0.6 achieving strong performance at much lower cost, and vllm_project releasing native weight syncing APIs and a Rust BPE tokenizer to improve tokenization efficiency. Debate continues on the value of multi-agent systems, with some seeing them as speedups and others expecting capability breakthroughs.

May 04

not much happened today

codex deepseek-v4-pro gemini-3.5-flash gemini-3.1-pro gpt-5.5 claude-opus-4.7 openai claude deepseek gemini qwen model-performance cost-curves agent-products workflow-optimization product-differentiation benchmarking model-optimization gdb dzhng signulll teortaxestex ajambrosino reach_vb theo claudedevs _mohansolo artificialanlys scaling01 yuchenj_uw kimmonismus officiallogank designarena alezander907 giffmana jeremyphoward hamelhusain

AI News for 5/4/2026-5/5/2026 highlights a shift in AI product development emphasizing model + harness + workflow + UI + memory + economics over model quality alone, with notable updates from OpenAI Codex and Claude including new features like Appshots, auto mode, and Sonnet 4.6. DeepSeek made a significant market impact by permanently discounting DeepSeek-V4-Pro by 75%, drastically improving cost/performance ratios compared to Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.7. Meanwhile, Gemini 3.5 Flash showed benchmark improvements but received mixed feedback on practical utility. The competitive landscape continues to tighten with Qwen and other Chinese frontier models.

Apr 28

not much happened today

vllm-0.20.0 poolside-laguna-xs.2 ling-2.6-flash nemotron-3-nano-omni qwen-3.5 vllm poolside nvidia opensrouter lmstudio ollama unsloth fal fireworks deepinfra togethercompute baseten canonical memory-optimization mixture-of-experts model-optimization inference-speed quantization model-deployment multimodality hardware-optimization model-benchmarking open-models agentic-ai jeremyphoward maharshii teortaxestex aymericroucher piotrz

vLLM v0.20.0 introduces significant improvements in memory and MoE serving efficiency, including TurboQuant 2-bit KV cache for 4× KV capacity and a 2.1% latency improvement. The update supports multiple hardware platforms like DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and Grace-Blackwell setups. Early benchmarks show DeepSeek V4 Pro on B300 hardware can be up to 8× faster than H200. The ecosystem is rapidly adopting day-0 support for new open models such as Poolside Laguna XS.2, Ling-2.6-flash, and NVIDIA Nemotron 3 Nano Omni. Poolside released Laguna XS.2, a 33B total / 3B active MoE coding model under Apache 2.0, capable of running on a single GPU, with hybrid attention and FP8 KV cache, performing near Qwen-3.5. NVIDIA launched Nemotron 3 Nano Omni, a 30B / A3B multimodal MoE with 256K context, supporting text, image, video, audio, and documents, with immediate distribution across multiple platforms. Discussions highlighted tradeoffs in quantization methods and a shift away from CUDA lock-in towards heterogeneous accelerator support.

Mar 24

not much happened today

arc-agi-3 claude-code anthropic langchain arcprize primeintellect agentic-reasoning interactive-environments benchmarking efficiency-metrics zero-preparation-generalization agent-infrastructure trainable-agents classifier-approval fchollet mikeknoop scaling01 _rockt mark_k andykonwinski bradenjhancock jeremyphoward togelius bracesproul hwchase17 caspar_br _catwu

ARC-AGI-3 benchmark introduced by @arcprize and François Chollet resets the frontier for general agentic reasoning with humans solving 100% of tasks versus under 1% for current models, focusing on zero-preparation generalization and human-like learning efficiency. The scoring protocol sparked debate over its harsh efficiency-based metric compared to prior ARC versions and other benchmarks like NetHack. The community acknowledges the benchmark highlights weaknesses in current LLM agents in interactive, sparse-feedback environments. Concurrently, agent infrastructure advances with LangChain launching Fleet shareable skills for reusable domain knowledge, and Anthropic revealing Claude Code auto mode for classifier-mediated approval balancing autonomy and manual confirmation. Browser and coding agents are evolving into trainable systems beyond prompt wrappers, exemplified by BrowserBase and Prime Intellect collaboration.

Mar 23

not much happened today

claude gpt-5.2-pro dgm-h rllm anthropic meta-ai-fair agent-frameworks workflow-automation multi-agent-systems reinforcement-learning reward-models self-improving-agents benchmark-generation operational-efficiency closed-loop-feedback jenny_zhang jase_weston mikhail_parakhin jeremyphoward

Anthropic introduced Claude Cowork and Claude Code enabling desktop control of mouse, keyboard, and screen in a macOS research preview, expanding agent capabilities beyond APIs and browsers. The agent ecosystem is evolving towards long-running, parallel, tool-rich workflows with projects like Hermes Agent, T3 Code, Command Center, and Parchi enhancing multi-agent orchestration and autonomous task management. Operational challenges such as fragility and inefficiency in subagents, including GPT-5.2 Pro and Claude browser/computer use, highlight the need for closed-loop feedback systems. Research from Meta AI advances self-improving agents with Hyperagents / DGM-H enabling meta-level procedural improvements, and unifies reinforcement learning post-training with RLLM (RL + LM-as-RM) to improve reward modeling across task types. Additionally, WebArena-Infinity drastically reduces browser environment construction costs, accelerating benchmark and environment generation.

Dec 05, 2025

not much happened today

vllm-0.12.0 gemma3n qwen3-omni qwen3-vl gpt-5.1-codex-max gemini-3-pro runway-gen-4.5 kling-video-2.6 vllm nvidia huggingface langchain-ai together-ai meta-ai-fair sonarsource openrouter runway gemini arena gpu-programming quantization multimodality agent-platforms reinforcement-learning static-analysis reasoning inference-infrastructure model-optimization economics audio video-generation jeremyphoward mervenoyann sydneyrunkle swyx maximelabonne

vLLM 0.12.0 introduces DeepSeek support, GPU Model Runner V2, and quantization improvements with PyTorch 2.9.0 and CUDA 12.9. NVIDIA launches CUDA Tile IR and cuTile Python for advanced GPU tensor operations targeting Blackwell GPUs. Hugging Face releases Transformers v5 RC with an any-to-any multimodal pipeline supporting models like Gemma3n and Qwen3-Omni. Agent platforms see updates from LangChain with content moderation and cost tracking, Together AI and Meta AI collaborate on RL for long-horizon workflows, and SonarSource integrates static analysis into AI codegen. Economic insights from OpenRouter highlight coding as a key AI application, with reasoning models surpassing 50% usage and market bifurcation between premium and open models. Additionally, Kling Video 2.6 debuts native audio capabilities, and Runway Gen-4.5, Qwen3-TTS, and Gemini 3 Pro advance multimodality.

Nov 26, 2025

not much happened today

claude-opus-4.5 qwen-3-4b qwen-3-8b qwen-3-14b deepseek-r1 anthropic booking.com perplexity-ai langchain claude scaling01 deepseek qwen prefect agent-systems multi-agent-systems reasoning benchmarking cost-efficiency model-optimization long-context memory-management reinforcement-learning model-performance multi-agent-communication latent-representation inference-cost software-integration jeremyphoward alexalbert__ omarsar0 lingyang_pu dair_ai

Anthropic introduces durable agents and MCP tasks for long-running workflows, with practical engineering patterns and integrations like Prefect. Booking.com deploys a large-scale agent system improving customer satisfaction using LangGraph, Kubernetes, GPT-4 Mini, and Weaviate. Perplexity rolls out user-level memory and virtual try-on features. Claude Opus 4.5 leads on LisanBench and Code Arena WebDev benchmarks with mixed community feedback on its "thinking" and "non-thinking" modes, while improving cost-efficiency and UX with batch APIs and context compaction. Research on multi-agent systems shows LatentMAS reduces communication tokens by 70-84% and improves accuracy using Qwen3 models, and reasoning trace distillation achieves significant token reduction with maintained accuracy, highlighting the importance of reasoning trace style.

Oct 17, 2025

The Karpathy-Dwarkesh Interview delays AGI timelines

claude-haiku-4.5 gpt-5 arch-router-1.5b anthropic openai huggingface langchain llamaindex google epoch-ai reasoning long-context sampling benchmarking data-quality agent-frameworks modular-workflows ide-extensions model-routing graph-first-agents real-world-grounding karpathy aakaran31 du_yilun giffmana omarsar0 jeremyphoward claude_code mikeyk alexalbert__ clementdelangue jerryjliu0

The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.

Aug 13, 2025

not much happened today

gpt-5 gpt-oss-120b opus-4.1 sonnet-4 openai anthropic minimax context-windows model-routing model-hosting multi-tool-pipelines prompt-caching model-extraction model-pairing cost-efficiency model-optimization sama jeremyphoward jxmnop _catwu

OpenAI continues small updates to GPT-5, introducing "Auto/Fast/Thinking" modes with 196k token context, 3,000 messages/week, and dynamic routing to cheaper models for cost efficiency. The MiniMax AI Agent Challenge offers $150,000 in prizes for AI agent development by August 25. The community discusses GPT-OSS-120B base model extraction, hosting, and tooling improvements, including multi-tool pipelines and flex-attention. Anthropic announces model pairing in Claude Code with Opus 4.1 for planning and Sonnet 4 for execution, expanding context to 1M tokens and introducing prompt caching. Key figures include @sama, @jeremyphoward, @jxmnop, and @_catwu.

Aug 08, 2025

not much happened today

gpt-5 gpt-4o grok-4 claude-4-sonnet openai microsoft reasoning latency model-routing benchmarking reinforcement-learning hallucination-control creative-writing priority-processing api-traffic model-deprecation user-experience model-selection voice-mode documentation sama nickaturley elaineyale6 scaling01 mustafasuleyman kevinweil omarsar0 jeremyphoward juberti epochairesearch lechmazur gdb

OpenAI launched GPT-5 with a unified user experience removing manual model selection, causing initial routing and access issues for Plus users that are being addressed with fixes including restored model options and increased usage limits. GPT-5 introduces "Priority Processing" for lower latency at higher price tiers, achieving ~750ms median time-to-first-token in some cases. Microsoft reports full Copilot adoption of GPT-5, and API traffic doubled within 24 hours, peaking at 2 billion tokens per minute. Early benchmarks show GPT-5 leading in reasoning tasks like FrontierMath and LiveBench, with improvements in hallucination control and creative writing, though some models like Grok-4 and Claude-4 Sonnet Thinking outperform it in specific RL-heavy reasoning benchmarks. OpenAI also released extensive migration and feature guides but faced some rollout issues including a broken code sample and a problematic Voice Mode launch. "Unified GPT-5" ends model pickers, pushing developers away from manual model selection.

Jul 24, 2025

3x in 3 months: Cursor @ $28b, Cognition + Windsurf @ $10b

qwen3-coder chatgpt-agent claude-code mini cursor cognition windsurf alibaba openai anthropic perplexity agentic-ai fundraising software-engineering ai-coding agentic-economy model-integration community-feedback performance-benchmarking bindureddy xikun_zhang_ aravsrinivas gergelyorosz jeremyphoward

Cursor is reportedly fundraising at a $28 billion valuation with $1 billion ARR, while the combined Cognition+Windsurf entity is fundraising at a $10 billion valuation after acquiring Windsurf remainco for $300 million. The competition between AI coding agents intensifies as Cursor focuses on Async SWE Agents and Cognition+Windsurf acquires an agentic IDE. Alibaba's Qwen3-Coder gains widespread adoption for coding tasks and integration into tools like Claude Code and LM Studio. OpenAI rolls out ChatGPT Agent to all Plus, Pro, and Team users, sparking discussions about an "agentic economy" emphasizing AI literacy. Anthropic's Claude Code is praised as a premier development tool with active community feedback. Perplexity's Comet browser assistant receives positive reviews and new feature showcases. The debate continues on whether AI coding tools will replace developers, with critiques highlighting the ongoing human effort required. A new minimalistic software engineering agent, mini, achieves 65% on SWE-bench with just 100 lines of code.

Jul 15, 2025

Voxtral - Mistral's SOTA ASR model in 3B (mini) and 24B ("small") sizes beats OpenAI Whisper large-v3

voxtal-3b voxtal-24b kimi-k2 mistral-ai moonshot-ai groq together-ai deepinfra huggingface langchain transcription long-context function-calling multilingual-models mixture-of-experts inference-speed developer-tools model-integration jeremyphoward teortaxestex scaling01 zacharynado jonathanross321 reach_vb philschmid

Mistral surprises with the release of Voxtral, a transcription model outperforming Whisper large-v3, GPT-4o mini Transcribe, and Gemini 2.5 Flash. Voxtral models (3B and 24B) support 32k token context length, handle audios up to 30-40 minutes, offer built-in Q&A and summarization, are multilingual, and enable function-calling from voice commands, powered by the Mistral Small 3.1 language model backbone. Meanwhile, Moonshot AI's Kimi K2, a non-reasoning Mixture of Experts (MoE) model built by a team of around 200 people, gains attention for blazing-fast inference on Groq hardware, broad platform availability including Together AI and DeepInfra, and local running on M4 Max 128GB Mac. Developer tool integrations include LangChain and Hugging Face support, highlighting Kimi K2's strong tool use capabilities.

Jul 14, 2025

not much happened today

kimi-k2 grok-4 gpt-5 gemini-2.5 gemini-embedding cognition windsurf moonshot-ai x-ai openai google stanfordnlp huggingface mixture-of-experts model-training model-performance fine-tuning benchmarking agentic-ai model-bugs embedding-models sama hardmaru jeremyphoward akhaliq teortaxestex yuchenj_uw demishassabis

Cognition is acquiring the remaining assets of Windsurf after a significant weekend deal. Moonshot AI released Kimi K2, an open-source, MIT-licensed agentic model with 1 Trillion total / 32B active parameters using a Mixture-of-Experts architecture, trained on 15.5 Trillion tokens with the MuonClip optimizer, showing top performance on benchmarks like EQ-Bench and Creative Writing. xAI launched Grok-4, ranking 5th on IQ Bench but with notable quirks including a bug causing it to respond only with "Heavy" and a high frequency of Elon Musk mentions. Rumors about OpenAI delaying an open-source model release surfaced, with speculation about CEO sama's PR strategy and a possible GPT-5 launch in September. The Gemini 2.5 paper was released with 3,295 authors, and Google introduced its Gemini Embedding model, topping the MTEB leaderboard.

Jul 01, 2025

not much happened today

chai-2 gemini-2.5-pro deepseek-r1-0528 meta scale-ai anthropic cloudflare grammarly superhuman chai-discovery atlassian notion slack commoncrawl hugging-face sakana-ai inference model-scaling collective-intelligence zero-shot-learning enterprise-deployment data-access science-funding open-source-llms alexandr_wang nat_friedman clementdelangue teortaxestex ylecun steph_palazzolo andersonbcdefg jeremyphoward reach_vb

Meta makes a major AI move by hiring Scale AI founder Alexandr Wang as Chief AI Officer and acquiring a 49% non-voting stake in Scale AI for $14.3 billion, doubling its valuation to about $28 billion. Chai Discovery announces Chai-2, a breakthrough model for zero-shot antibody discovery and optimization. The US government faces budget cuts threatening to eliminate a quarter million science research jobs by 2026. Data access restrictions intensify as companies like Atlassian, Notion, and Slack block web crawlers including Common Crawl, raising concerns about future public internet archives. Hugging Face shuts down HuggingChat after serving over a million users, marking a significant experiment in open-source LLMs. Sakana AI releases AB-MCTS, an inference-time scaling algorithm enabling multiple models like Gemini 2.5 Pro and DeepSeek-R1-0528 to cooperate and outperform individual models.

May 23, 2025

not much happened today

claude-4 claude-4-opus claude-4-sonnet gemini-2.5-pro gemma-3n imagen-4-ultra anthropic google-deepmind openai codebase-understanding coding agentic-performance multimodality text-to-speech video-generation model-integration benchmarking memory-optimization cline amanrsanger ryanpgreenblatt johnschulman2 alexalbert__ nearcyan mickeyxfriedman jeremyphoward gneubig teortaxesTex scaling01 artificialanlys philschmid

Anthropic's Claude 4 models (Opus 4, Sonnet 4) demonstrate strong coding abilities, with Sonnet 4 achieving 72.7% on SWE-bench and Opus 4 at 72.5%. Claude Sonnet 4 excels in codebase understanding and is considered SOTA on large codebases. Criticism arose over Anthropic's handling of ASL-3 security requirements. Demand for Claude 4 is high, with integration into IDEs and support from Cherry Studio and FastHTML. Google DeepMind introduced Gemini 2.5 Pro Deep Think and Gemma 3n, a mobile multimodal model reducing RAM usage by nearly 3x. Google's Imagen 4 Ultra ranks third in the Artificial Analysis Image Arena, available on Vertex AI Studio. Google also promoted Google Beam, an AI video model for immersive 3D experiences, and new text-to-speech models with multi-speaker support. The GAIA benchmark shows Claude 4 Opus and Sonnet leading in agentic performance.

Apr 10, 2025

Google's Agent2Agent Protocol (A2A)

kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward

Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.

Mar 18, 2025

not much happened today

gemini-2.0-flash imagen-3 mistral-small-3.1 mistral-3 gpt-4o-mini claude-3.5-haiku olm0-32b qwen-2.5 shieldgemma-2 julian fasttransform nvidia google mistral-ai allen-ai anthropic langchainai perplexity-ai kalshi stripe qodoai multimodality image-generation context-windows model-pricing open-source-models image-classification frameworks python-libraries partnerships jeremyphoward karpathy abacaj mervenoyann

At Nvidia GTC Day 1, several AI updates were highlighted: Google's Gemini 2.0 Flash introduces image input/output but is not recommended for text-to-image tasks, with Imagen 3 preferred for that. Mistral AI released Mistral Small 3.1 with 128k token context window and competitive pricing. Allen AI launched OLMo-32B, an open LLM outperforming GPT-4o mini and Qwen 2.5. ShieldGemma 2 was introduced for image safety classification. LangChainAI announced multiple updates including Julian powered by LangGraph and integration with AnthropicAI's MCP. Jeremy Howard released fasttransform, a Python library for data transformations. Perplexity AI partnered with Kalshi for NCAA March Madness predictions.

Mar 01, 2025

not much happened today

gpt-4.5 gpt-4 gpt-4o o1 claude-3.5-sonnet claude-3.7 claude-3-opus deepseek-v3 grok-3 openai anthropic perplexity-ai deepseek scaling01 model-performance humor emotional-intelligence model-comparison pricing context-windows model-size user-experience andrej-karpathy jeremyphoward abacaj stevenheidel yuchenj_uw aravsrinivas dylan522p random_walker

GPT-4.5 sparked mixed reactions on Twitter, with @karpathy noting users preferred GPT-4 in a poll despite his personal favor for GPT-4.5's creativity and humor. Critics like @abacaj highlighted GPT-4.5's slowness and questioned its practical value and pricing compared to other models. Performance-wise, GPT-4.5 ranks above GPT-4o but below o1 and Claude 3.5 Sonnet, with Claude 3.7 outperforming it on many tasks yet GPT-4.5 praised for its humor and "vibes." Speculation about GPT-4.5's size suggests around 5 trillion parameters. Discussions also touched on pricing disparities, with Perplexity Deep Research at $20/month versus ChatGPT at $200/month. The emotional intelligence and humor of models like Claude 3.7 were also noted.

Feb 13, 2025

small news items

gpt-4.5 gpt-5 deepseek-r1-distilled-qwen-1.5b o1-preview modernbert-0.3b qwen-0.5b o3 openai ollama mistral perplexity cerebras alibaba groq bytedance math benchmarking fine-tuning model-performance reinforcement-learning model-architecture partnerships funding jeremyphoward arankomatsuzaki sama nrehiew_ danhendrycks akhaliq

OpenAI announced plans for GPT-4.5 (Orion) and GPT-5, with GPT-5 integrating the o3 model and offering unlimited chat access in the free tier. DeepSeek R1 Distilled Qwen 1.5B outperforms OpenAI's o1-preview on math benchmarks, while ModernBERT 0.3b surpasses Qwen 0.5b at MMLU without fine-tuning. Mistral and Perplexity adopt Cerebras hardware for 10x performance gains. OpenAI's o3 model won a gold medal at the 2024 International Olympiad in Informatics. Partnerships include Qwen with Groq. Significant RLHF activity is noted in Nigeria and the global south, and Bytedance is expected to rise in AI prominence soon. "GPT5 is all you need."

Feb 11, 2025

not much happened today

gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr

Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.

Dec 20, 2024

ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,

modernbert gemini-2.0-flash-thinking o1 llama answerdotai lightonio hugging-face google-deepmind openai meta-ai-fair figure encoder-only-models long-context alternating-attention natural-language-understanding reasoning robotics-simulation physics-engine humanoid-robots model-performance model-releases jeremyphoward alec-radford philschmid drjimfan bindureddy

Answer.ai/LightOn released ModernBERT, an updated encoder-only model with 8k token context, trained on 2 trillion tokens including code, with 139M/395M parameters and state-of-the-art performance on retrieval, NLU, and code tasks. It features Alternating Attention layers mixing global and local attention. Gemini 2.0 Flash Thinking debuted as #1 in Chatbot Arena, and the O1 model scored top in reasoning benchmarks. Llama downloads surpassed 650 million, doubling in 3 months. OpenAI launched desktop app integrations with voice capabilities. Figure delivered its first humanoid robots commercially. Advances in robotics simulation and a new physics engine Genesis claiming 430,000x faster than real-time were highlighted.

Aug 29, 2024

Cerebras Inference: Faster, Better, AND Cheaper

llama-3.1-8b llama-3.1-70b gemini-1.5-flash gemini-1.5-pro cogvideox-5b mamba-2 rene-1.3b llama-3.1 gemini-1.5 claude groq cerebras cursor google-deepmind anthropic inference-speed wafer-scale-chips prompt-caching model-merging benchmarking open-source-models code-editing model-optimization jeremyphoward sam-altman nat-friedman daniel-gross swyx

Groq led early 2024 with superfast LLM inference speeds, achieving ~450 tokens/sec for Mixtral 8x7B and 240 tokens/sec for Llama 2 70B. Cursor introduced a specialized code edit model hitting 1000 tokens/sec. Now, Cerebras claims the fastest inference with their wafer-scale chips, running Llama3.1-8b at 1800 tokens/sec and Llama3.1-70B at 450 tokens/sec at full precision, with competitive pricing and a generous free tier. Google's Gemini 1.5 models showed significant benchmark improvements, especially Gemini-1.5-Flash and Gemini-1.5-Pro. New open-source models like CogVideoX-5B and Mamba-2 (Rene 1.3B) were released, optimized for consumer hardware. Anthropic's Claude now supports prompt caching, improving speed and cost efficiency. "Cerebras Inference runs Llama3.1 20x faster than GPU solutions at 1/5 the price."

Aug 10, 2024

not much happened today

qwen2-math-72b gpt-4o claude-3.5-sonnet gemini-1.5-pro llama-3.1-405b idefics3-llama-8b anthropic google mistral-ai llamaindex math fine-tuning synthetic-data reinforcement-learning bug-bounty visual-question-answering open-source retrieval-augmented-generation agentic-ai ai-safety policy rohanpaul_ai anthropicai mervenoyann jeremyphoward omarsar0 ylecun bindureddy

Qwen2-Math-72B outperforms GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B on math benchmarks using synthetic data and advanced optimization techniques. Google AI cuts pricing for Gemini 1.5 Flash by up to 78%. Anthropic expands its bug bounty program targeting universal jailbreaks in next-gen safety systems. Tutorial on QLoRA fine-tuning of IDEFICS3-Llama 8B for visual question answering released. A Chinese open weights model surpasses previous MATH benchmark records. Surveys on Mamba models and LLM-based agents for software engineering highlight advancements and applications. Open-source tools like R2R RAG engine and LlamaIndex Workflows simplify building complex AI applications. Mistral AI introduces customizable AI agents. Concerns raised about California bill SB 1047's focus on existential risk and debates on banning open-source AI. Memes and humor continue in AI communities.

Jul 31, 2024

not much happened today

sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan

Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.

Jul 06, 2024

Qdrant's BM42: "Please don't trust us"

claude-3.5-sonnet gemma-2 nano-llava-1.5 qdrant cohere stripe anthropic hugging-face stablequan_ai semantic-search benchmarking dataset-quality model-evaluation model-optimization vision fine-tuning context-windows nils-reimers jeremyphoward hamelhusain rohanpaul_ai

Qdrant attempted to replace BM25 and SPLADE with a new method called "BM42" combining transformer attention and collection-wide statistics for semantic and keyword search, but their evaluation using the Quora dataset was flawed. Nils Reimers from Cohere reran BM42 on better datasets and found it underperformed. Qdrant acknowledged the errors but still ran a suboptimal BM25 implementation. This highlights the importance of dataset choice and evaluation sanity checks in search model claims. Additionally, Stripe faced criticism for AI/ML model failures causing account and payment issues, prompting calls for alternatives. Anthropic revealed that Claude 3.5 Sonnet suppresses some answer parts with backend tags, sparking debate. Gemma 2 model optimizations allow 2x faster fine-tuning with 63% less memory and longer context windows, running up to 34B parameters on consumer GPUs. nanoLLaVA-1.5 was announced as a compact 1B parameter vision model with significant improvements.

Mar 08, 2024

Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU

inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0

Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.

Mar 05, 2024

Stable Diffusion 3 — Rombach & Esser did it again!

stable-diffusion-3 claude-3 orca dolphincoder-starcoder2-15b stability-ai anthropic microsoft latitude perplexity-ai llamaindex tripo-ai diffusion-models multimodality benchmarking human-evaluation text-generation image-generation 3d-modeling fine-tuning roleplay coding dataset-release soumith-chintala bill-peebles swyx kevinafischer jeremyphoward akhaliq karinanguyen_ aravsrinivas

Over 2500 new community members joined following Soumith Chintala's shoutout, highlighting growing interest in SOTA LLM-based summarization. The major highlight is the detailed paper release of Stable Diffusion 3 (SD3), showcasing advanced text-in-image control and complex prompt handling, with the model outperforming other SOTA image generation models in human-evaluated benchmarks. The SD3 model is based on an enhanced Diffusion Transformer architecture called MMDiT. Meanwhile, Anthropic released Claude 3 models, noted for human-like responses and emotional depth, scoring 79.88% on HumanEval but costing over twice as much as GPT-4. Microsoft launched new Orca-based models and datasets, and Latitude released DolphinCoder-StarCoder2-15b with strong coding capabilities. Integration of image models by Perplexity AI and 3D CAD generation by PolySpectra powered by LlamaIndex were also highlighted. "SD3's win rate beats all other SOTA image gen models (except perhaps Ideogram)" and "Claude 3 models are very good at generating d3 visualizations from text descriptions."