All tags
Person: "jeremyphoward"
3x in 3 months: Cursor @ $28b, Cognition + Windsurf @ $10b
qwen3-coder chatgpt-agent claude-code mini cursor cognition windsurf alibaba openai anthropic perplexity agentic-ai fundraising software-engineering ai-coding agentic-economy model-integration community-feedback performance-benchmarking bindureddy xikun_zhang_ aravsrinivas gergelyorosz jeremyphoward
Cursor is reportedly fundraising at a $28 billion valuation with $1 billion ARR, while the combined Cognition+Windsurf entity is fundraising at a $10 billion valuation after acquiring Windsurf remainco for $300 million. The competition between AI coding agents intensifies as Cursor focuses on Async SWE Agents and Cognition+Windsurf acquires an agentic IDE. Alibaba's Qwen3-Coder gains widespread adoption for coding tasks and integration into tools like Claude Code and LM Studio. OpenAI rolls out ChatGPT Agent to all Plus, Pro, and Team users, sparking discussions about an "agentic economy" emphasizing AI literacy. Anthropic's Claude Code is praised as a premier development tool with active community feedback. Perplexity's Comet browser assistant receives positive reviews and new feature showcases. The debate continues on whether AI coding tools will replace developers, with critiques highlighting the ongoing human effort required. A new minimalistic software engineering agent, mini, achieves 65% on SWE-bench with just 100 lines of code.
Voxtral - Mistral's SOTA ASR model in 3B (mini) and 24B ("small") sizes beats OpenAI Whisper large-v3
voxtal-3b voxtal-24b kimi-k2 mistral-ai moonshot-ai groq together-ai deepinfra huggingface langchain transcription long-context function-calling multilingual-models mixture-of-experts inference-speed developer-tools model-integration jeremyphoward teortaxestex scaling01 zacharynado jonathanross321 reach_vb philschmid
Mistral surprises with the release of Voxtral, a transcription model outperforming Whisper large-v3, GPT-4o mini Transcribe, and Gemini 2.5 Flash. Voxtral models (3B and 24B) support 32k token context length, handle audios up to 30-40 minutes, offer built-in Q&A and summarization, are multilingual, and enable function-calling from voice commands, powered by the Mistral Small 3.1 language model backbone. Meanwhile, Moonshot AI's Kimi K2, a non-reasoning Mixture of Experts (MoE) model built by a team of around 200 people, gains attention for blazing-fast inference on Groq hardware, broad platform availability including Together AI and DeepInfra, and local running on M4 Max 128GB Mac. Developer tool integrations include LangChain and Hugging Face support, highlighting Kimi K2's strong tool use capabilities.
not much happened today
kimi-k2 grok-4 gpt-5 gemini-2.5 gemini-embedding cognition windsurf moonshot-ai x-ai openai google stanfordnlp huggingface mixture-of-experts model-training model-performance fine-tuning benchmarking agentic-ai model-bugs embedding-models sama hardmaru jeremyphoward akhaliq teortaxestex yuchenj_uw demishassabis
Cognition is acquiring the remaining assets of Windsurf after a significant weekend deal. Moonshot AI released Kimi K2, an open-source, MIT-licensed agentic model with 1 Trillion total / 32B active parameters using a Mixture-of-Experts architecture, trained on 15.5 Trillion tokens with the MuonClip optimizer, showing top performance on benchmarks like EQ-Bench and Creative Writing. xAI launched Grok-4, ranking 5th on IQ Bench but with notable quirks including a bug causing it to respond only with "Heavy" and a high frequency of Elon Musk mentions. Rumors about OpenAI delaying an open-source model release surfaced, with speculation about CEO sama's PR strategy and a possible GPT-5 launch in September. The Gemini 2.5 paper was released with 3,295 authors, and Google introduced its Gemini Embedding model, topping the MTEB leaderboard.
not much happened today
chai-2 gemini-2.5-pro deepseek-r1-0528 meta scale-ai anthropic cloudflare grammarly superhuman chai-discovery atlassian notion slack commoncrawl hugging-face sakana-ai inference model-scaling collective-intelligence zero-shot-learning enterprise-deployment data-access science-funding open-source-llms alexandr_wang nat_friedman clementdelangue teortaxestex ylecun steph_palazzolo andersonbcdefg jeremyphoward reach_vb
Meta makes a major AI move by hiring Scale AI founder Alexandr Wang as Chief AI Officer and acquiring a 49% non-voting stake in Scale AI for $14.3 billion, doubling its valuation to about $28 billion. Chai Discovery announces Chai-2, a breakthrough model for zero-shot antibody discovery and optimization. The US government faces budget cuts threatening to eliminate a quarter million science research jobs by 2026. Data access restrictions intensify as companies like Atlassian, Notion, and Slack block web crawlers including Common Crawl, raising concerns about future public internet archives. Hugging Face shuts down HuggingChat after serving over a million users, marking a significant experiment in open-source LLMs. Sakana AI releases AB-MCTS, an inference-time scaling algorithm enabling multiple models like Gemini 2.5 Pro and DeepSeek-R1-0528 to cooperate and outperform individual models.
not much happened today
claude-4 claude-4-opus claude-4-sonnet gemini-2.5-pro gemma-3n imagen-4-ultra anthropic google-deepmind openai codebase-understanding coding agentic-performance multimodality text-to-speech video-generation model-integration benchmarking memory-optimization cline amanrsanger ryanpgreenblatt johnschulman2 alexalbert__ nearcyan mickeyxfriedman jeremyphoward gneubig teortaxesTex scaling01 artificialanlys philschmid
Anthropic's Claude 4 models (Opus 4, Sonnet 4) demonstrate strong coding abilities, with Sonnet 4 achieving 72.7% on SWE-bench and Opus 4 at 72.5%. Claude Sonnet 4 excels in codebase understanding and is considered SOTA on large codebases. Criticism arose over Anthropic's handling of ASL-3 security requirements. Demand for Claude 4 is high, with integration into IDEs and support from Cherry Studio and FastHTML. Google DeepMind introduced Gemini 2.5 Pro Deep Think and Gemma 3n, a mobile multimodal model reducing RAM usage by nearly 3x. Google's Imagen 4 Ultra ranks third in the Artificial Analysis Image Arena, available on Vertex AI Studio. Google also promoted Google Beam, an AI video model for immersive 3D experiences, and new text-to-speech models with multi-speaker support. The GAIA benchmark shows Claude 4 Opus and Sonnet leading in agentic performance.
Google's Agent2Agent Protocol (A2A)
kimi-vl-a3b gpt-4o llama-4-scout llama-4-maverick llama-4-behemoth deepcoder-14b o3-mini o1 llama-3.1-nemotron-ultra-253b deepseek-r1 google google-deepmind moonshot-ai meta-ai-fair uc-berkeley openai nvidia hugging-face togethercompute deepseek agent-interoperability multimodality vision math reinforcement-learning coding model-training open-source model-benchmarking context-windows streaming push-notifications enterprise-authentication model-release reach_vb _akhaliq epochairesearch artificialanlys winglian danielhanchen yuchenj_uw jeremyphoward
Google Cloud Next announcements featured the launch of Google and DeepMind's full MCP support and a new Agent to Agent protocol designed for agent interoperability with multiple partners. The protocol includes components like the Agent Card, Task communication channels, Enterprise Auth and Observability, and Streaming and Push Notification support. On the model front, Moonshot AI released Kimi-VL-A3B, a multimodal model with 128K context and strong vision and math benchmark performance, outperforming gpt-4o. Meta AI introduced smaller versions of llama-4 family models: llama-4-scout and llama-4-maverick, with a larger Behemoth model still in training. DeepCoder 14B from UC Berkeley is an open-source coding model rivaling openai's o3-mini and o1 models, trained with reinforcement learning on 24K coding problems. Nvidia released llama-3.1-nemotron-ultra-253b on Hugging Face, noted for beating llama-4-behemoth and maverick and competing with deepseek-r1.
not much happened today
gemini-2.0-flash imagen-3 mistral-small-3.1 mistral-3 gpt-4o-mini claude-3.5-haiku olm0-32b qwen-2.5 shieldgemma-2 julian fasttransform nvidia google mistral-ai allen-ai anthropic langchainai perplexity-ai kalshi stripe qodoai multimodality image-generation context-windows model-pricing open-source-models image-classification frameworks python-libraries partnerships jeremyphoward karpathy abacaj mervenoyann
At Nvidia GTC Day 1, several AI updates were highlighted: Google's Gemini 2.0 Flash introduces image input/output but is not recommended for text-to-image tasks, with Imagen 3 preferred for that. Mistral AI released Mistral Small 3.1 with 128k token context window and competitive pricing. Allen AI launched OLMo-32B, an open LLM outperforming GPT-4o mini and Qwen 2.5. ShieldGemma 2 was introduced for image safety classification. LangChainAI announced multiple updates including Julian powered by LangGraph and integration with AnthropicAI's MCP. Jeremy Howard released fasttransform, a Python library for data transformations. Perplexity AI partnered with Kalshi for NCAA March Madness predictions.
not much happened today
gpt-4.5 gpt-4 gpt-4o o1 claude-3.5-sonnet claude-3.7 claude-3-opus deepseek-v3 grok-3 openai anthropic perplexity-ai deepseek scaling01 model-performance humor emotional-intelligence model-comparison pricing context-windows model-size user-experience andrej-karpathy jeremyphoward abacaj stevenheidel yuchenj_uw aravsrinivas dylan522p random_walker
GPT-4.5 sparked mixed reactions on Twitter, with @karpathy noting users preferred GPT-4 in a poll despite his personal favor for GPT-4.5's creativity and humor. Critics like @abacaj highlighted GPT-4.5's slowness and questioned its practical value and pricing compared to other models. Performance-wise, GPT-4.5 ranks above GPT-4o but below o1 and Claude 3.5 Sonnet, with Claude 3.7 outperforming it on many tasks yet GPT-4.5 praised for its humor and "vibes." Speculation about GPT-4.5's size suggests around 5 trillion parameters. Discussions also touched on pricing disparities, with Perplexity Deep Research at $20/month versus ChatGPT at $200/month. The emotional intelligence and humor of models like Claude 3.7 were also noted.
small news items
gpt-4.5 gpt-5 deepseek-r1-distilled-qwen-1.5b o1-preview modernbert-0.3b qwen-0.5b o3 openai ollama mistral perplexity cerebras alibaba groq bytedance math benchmarking fine-tuning model-performance reinforcement-learning model-architecture partnerships funding jeremyphoward arankomatsuzaki sama nrehiew_ danhendrycks akhaliq
OpenAI announced plans for GPT-4.5 (Orion) and GPT-5, with GPT-5 integrating the o3 model and offering unlimited chat access in the free tier. DeepSeek R1 Distilled Qwen 1.5B outperforms OpenAI's o1-preview on math benchmarks, while ModernBERT 0.3b surpasses Qwen 0.5b at MMLU without fine-tuning. Mistral and Perplexity adopt Cerebras hardware for 10x performance gains. OpenAI's o3 model won a gold medal at the 2024 International Olympiad in Informatics. Partnerships include Qwen with Groq. Significant RLHF activity is noted in Nigeria and the global south, and Bytedance is expected to rise in AI prominence soon. "GPT5 is all you need."
not much happened today
gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr
Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.
ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,
modernbert gemini-2.0-flash-thinking o1 llama answerdotai lightonio hugging-face google-deepmind openai meta-ai-fair figure encoder-only-models long-context alternating-attention natural-language-understanding reasoning robotics-simulation physics-engine humanoid-robots model-performance model-releases jeremyphoward alec-radford philschmid drjimfan bindureddy
Answer.ai/LightOn released ModernBERT, an updated encoder-only model with 8k token context, trained on 2 trillion tokens including code, with 139M/395M parameters and state-of-the-art performance on retrieval, NLU, and code tasks. It features Alternating Attention layers mixing global and local attention. Gemini 2.0 Flash Thinking debuted as #1 in Chatbot Arena, and the O1 model scored top in reasoning benchmarks. Llama downloads surpassed 650 million, doubling in 3 months. OpenAI launched desktop app integrations with voice capabilities. Figure delivered its first humanoid robots commercially. Advances in robotics simulation and a new physics engine Genesis claiming 430,000x faster than real-time were highlighted.
Cerebras Inference: Faster, Better, AND Cheaper
llama-3.1-8b llama-3.1-70b gemini-1.5-flash gemini-1.5-pro cogvideox-5b mamba-2 rene-1.3b llama-3.1 gemini-1.5 claude groq cerebras cursor google-deepmind anthropic inference-speed wafer-scale-chips prompt-caching model-merging benchmarking open-source-models code-editing model-optimization jeremyphoward sam-altman nat-friedman daniel-gross swyx
Groq led early 2024 with superfast LLM inference speeds, achieving ~450 tokens/sec for Mixtral 8x7B and 240 tokens/sec for Llama 2 70B. Cursor introduced a specialized code edit model hitting 1000 tokens/sec. Now, Cerebras claims the fastest inference with their wafer-scale chips, running Llama3.1-8b at 1800 tokens/sec and Llama3.1-70B at 450 tokens/sec at full precision, with competitive pricing and a generous free tier. Google's Gemini 1.5 models showed significant benchmark improvements, especially Gemini-1.5-Flash and Gemini-1.5-Pro. New open-source models like CogVideoX-5B and Mamba-2 (Rene 1.3B) were released, optimized for consumer hardware. Anthropic's Claude now supports prompt caching, improving speed and cost efficiency. "Cerebras Inference runs Llama3.1 20x faster than GPU solutions at 1/5 the price."
not much happened today
qwen2-math-72b gpt-4o claude-3.5-sonnet gemini-1.5-pro llama-3.1-405b idefics3-llama-8b anthropic google mistral-ai llamaindex math fine-tuning synthetic-data reinforcement-learning bug-bounty visual-question-answering open-source retrieval-augmented-generation agentic-ai ai-safety policy rohanpaul_ai anthropicai mervenoyann jeremyphoward omarsar0 ylecun bindureddy
Qwen2-Math-72B outperforms GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B on math benchmarks using synthetic data and advanced optimization techniques. Google AI cuts pricing for Gemini 1.5 Flash by up to 78%. Anthropic expands its bug bounty program targeting universal jailbreaks in next-gen safety systems. Tutorial on QLoRA fine-tuning of IDEFICS3-Llama 8B for visual question answering released. A Chinese open weights model surpasses previous MATH benchmark records. Surveys on Mamba models and LLM-based agents for software engineering highlight advancements and applications. Open-source tools like R2R RAG engine and LlamaIndex Workflows simplify building complex AI applications. Mistral AI introduces customizable AI agents. Concerns raised about California bill SB 1047's focus on existential risk and debates on banning open-source AI. Memes and humor continue in AI communities.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
Qdrant's BM42: "Please don't trust us"
claude-3.5-sonnet gemma-2 nano-llava-1.5 qdrant cohere stripe anthropic hugging-face stablequan_ai semantic-search benchmarking dataset-quality model-evaluation model-optimization vision fine-tuning context-windows nils-reimers jeremyphoward hamelhusain rohanpaul_ai
Qdrant attempted to replace BM25 and SPLADE with a new method called "BM42" combining transformer attention and collection-wide statistics for semantic and keyword search, but their evaluation using the Quora dataset was flawed. Nils Reimers from Cohere reran BM42 on better datasets and found it underperformed. Qdrant acknowledged the errors but still ran a suboptimal BM25 implementation. This highlights the importance of dataset choice and evaluation sanity checks in search model claims. Additionally, Stripe faced criticism for AI/ML model failures causing account and payment issues, prompting calls for alternatives. Anthropic revealed that Claude 3.5 Sonnet suppresses some answer parts with backend tags, sparking debate. Gemma 2 model optimizations allow 2x faster fine-tuning with 63% less memory and longer context windows, running up to 34B parameters on consumer GPUs. nanoLLaVA-1.5 was announced as a compact 1B parameter vision model with significant improvements.
Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0
Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.
Stable Diffusion 3 — Rombach & Esser did it again!
stable-diffusion-3 claude-3 orca dolphincoder-starcoder2-15b stability-ai anthropic microsoft latitude perplexity-ai llamaindex tripo-ai diffusion-models multimodality benchmarking human-evaluation text-generation image-generation 3d-modeling fine-tuning roleplay coding dataset-release soumith-chintala bill-peebles swyx kevinafischer jeremyphoward akhaliq karinanguyen_ aravsrinivas
Over 2500 new community members joined following Soumith Chintala's shoutout, highlighting growing interest in SOTA LLM-based summarization. The major highlight is the detailed paper release of Stable Diffusion 3 (SD3), showcasing advanced text-in-image control and complex prompt handling, with the model outperforming other SOTA image generation models in human-evaluated benchmarks. The SD3 model is based on an enhanced Diffusion Transformer architecture called MMDiT. Meanwhile, Anthropic released Claude 3 models, noted for human-like responses and emotional depth, scoring 79.88% on HumanEval but costing over twice as much as GPT-4. Microsoft launched new Orca-based models and datasets, and Latitude released DolphinCoder-StarCoder2-15b with strong coding capabilities. Integration of image models by Perplexity AI and 3D CAD generation by PolySpectra powered by LlamaIndex were also highlighted. "SD3's win rate beats all other SOTA image gen models (except perhaps Ideogram)" and "Claude 3 models are very good at generating d3 visualizations from text descriptions."