Person: "karpathy"

dflash nemo-automodel claude openai broadcom qualcomm modular nvidia skypilot modal anthropic hugging-face hardware inference performance-optimization model-training agent-ux security capability-based-security open-source fine-tuning infrastructure model-optimization gdb kimmonismus scaling01 clattner_llvm karpathy gallabytes dabit3 kentonvarda random_walker jubbaonjeans victormustar

OpenAI announced Jalapeño, its first custom AI chip for LLM inference, built with Broadcom, aiming to control more of the AI stack and improve compute economics with a fast 9-month design cycle. Community analysis suggests Jalapeño features 216GB HBM3E, ~7.1–7.4 TB/s bandwidth, and ~10 PFLOPS FP4 performance, signaling hyperscaler-style inference silicon as a new standard. Meanwhile, Qualcomm is acquiring Modular, with Mojo open-sourcing on track, indicating rising competition in vertically integrated inference stacks beyond NVIDIA/CUDA. On infrastructure, NVIDIA's NeMo AutoModel boosts training throughput for MoE models by 3.4–3.7x, and startups like SkyPilot and Modal advance unified and open-source inference solutions. Custom training of DFLASH models yields 30–50% decode gains. In UX, Anthropic's Slack-native Claude agent shifts agent interaction from tools to coworkers, raising new security and cost concerns around identity, permissions, and lock-in, with debates on capability-based security and attribution. Hugging Face responded with its self-hosted Slack coding agent Moon Bot.

Mar 11

not much happened today

nemotron-3-super gpt-oss-120b qwen3.5-122b-a10b nvidia perplexity replit base44 vllm llama.cpp ollama togethercompute baseten wandb langchain unsloth model-architecture model-optimization inference-speed kv-cache multi-token-prediction agent-infrastructure orchestration persistent-agents model-serving product-launches karpathy ctnzr bnjmn_marie artificialanlys

NVIDIA’s Nemotron 3 Super is a 120B parameter / ~12B active open model featuring a hybrid Mamba-Transformer / SSM Latent MoE architecture and 1M context window, delivering up to 2.2x faster inference than GPT-OSS-120B in FP4 with strong throughput gains. It supports agentic workloads and is unusually open with weights, data, and infrastructure details released. The model scored 36 on the AA Intelligence Index, outperforming GPT-OSS-120B but behind Qwen3.5-122B-A10B. Community and infrastructure support from projects like vLLM, llama.cpp, Ollama, Together, Baseten, W&B Inference, LangChain, and Unsloth GGUFs was immediate. Key technical innovations include native multi-token prediction (MTP) and a significant KV-cache efficiency advantage. On the product side, a shift towards persistent agent runtimes and orchestration layers is highlighted, with Andrej Karpathy advocating for a "bigger IDE" concept where agents replace files as the unit of work, enabling legible, forkable agentic organizations with real-time control. New launches fitting this vision include Perplexity’s Personal Computer, an always-on local/cloud hybrid running on Mac mini, and Computer for Enterprise orchestrating 20 specialized models and 400+ apps. Replit Agent 4 offers a collaborative, canvas-like workflow with parallel agents, while Base44 Superagents provide integrated solutions for nontechnical users. The engineering focus is increasingly on the orchestration harness rather than just the model.

Mar 09

Autoresearch: Sparks of Recursive Self Improvement

claude-3 codex anthropic openai cognition automated-machine-learning coding-agents bug-fixing model-autonomy multi-agent-systems pr-review systems-engineering model-verification karpathy yi_tay jakub_pachocki

RSI covers AI developments from 3/5/2026 to 3/9/2026, highlighting the emergence of LLMs autonomously training smaller LLMs, marking a significant "AutoML moment" in AI progress. Karpathy and Yi Tay discuss "vibe training," where AI models fix bugs and improve code autonomously, suggesting models may soon surpass human debugging efficiency. The report anticipates Jakub Pachocki's Automated AI Research Intern system by September 2026 to accelerate human researchers. On AI Twitter, the focus is on coding agents shifting bottlenecks from implementation to review and verification, with Anthropic's Claude Code Review improving PR review effectiveness significantly, and tools like OpenAI Codex Review and Cognition's Devin Review enhancing code review workflows. Harness engineering is evolving into systems engineering, emphasizing decoupling agent storage from compute for collaborative agent teams.

Feb 25

Agentic Engineering: WTF Happened in December 2025?

gpt-5.3-codex claude-code perplexity openai anthropic langchain-ai coding-agents agent-architecture distributed-workflows usage-based-pricing model-routing benchmarking context-length observability software-development karpathy aravsrinivas lioronai denisyarats swyx catwu hwchase17

Perplexity launched Computer, an orchestration-first agent platform featuring multi-model routing, usage-based pricing, and parallel asynchronous sub-agents for distributed workflows. Andrej Karpathy claims a "phase change" in coding agents since December, highlighting sustained long-horizon task completion. OpenAI released GPT-5.3-Codex with ~25% speed improvements and strong benchmark performance, while Claude Code celebrates its first year with ecosystem integrations and scaling challenges. This marks a significant shift in coding workflows and agent-based software development.

Feb 06

not much happened today

gpt-5.3-codex claude-opus-4.6 nanochat-gpt-2 openai anthropic langchain agent-systems ai-engineering benchmarking software-organization sandboxing tracing state-management recursive-language-models context-management karpathy sama swyx omarsar0 hamelhusain deepfates

AI News for early February 2026 highlights a detailed comparison between GPT-5.3-Codex and Claude Opus 4.6, with users noting Codex's strength in detailed scoped tasks and Opus's ergonomic advantage for exploratory work. Benchmarks on Karpathy's nanochat GPT-2 speedrun show Opus 4.6 achieving better wall-clock performance, while Codex-5.3-xhigh sometimes suffers from context issues. Karpathy cautions that current models are not yet reliable for fully autonomous AI engineering. Discussions on agent swarms reveal emerging parallels to software organizational design, with Anthropic-style agent coordination systems and LangChain/LangSmith emphasizing environment engineering through tracing, sandboxing, and state control. The concept of Recursive Language Models (RLM) is introduced as a future direction for agent systems to reduce context rot and improve structured communication.

Jan 30

MoltBook takes over the timeline

claude genie-3 moltbook openclaw anthropic google multi-agent-systems agent-communication security prompt-injection identity alignment observability ai-planning ai-coding emergent-behavior karpathy

Moltbook and OpenClaw showcase emergent multi-agent social networks where AI agents autonomously interact, creating an AI-native forum layer with complex security and identity challenges. Karpathy describes this as "takeoff-adjacent," highlighting bots self-organizing and engaging in prompt-injection and credential theft. Anthropic reports on AI coding tradeoffs with a study of 52 junior engineers and reveals Claude planned a Mars rover drive, marking a milestone in AI-driven space exploration. Google publicly releases Genie 3, sparking debate over its capabilities and latency issues. The rise of agent-to-agent private communications raises concerns about alignment and observability in 2026.

Jan 07

not much happened today

nouscoder-14b deepseek-r1 langchain cursor huggingface openai weights-biases agent-frameworks context-management reinforcement-learning operational-safety model-transparency trajectory-exploration token-optimization coding-agents integration-platforms karpathy _philschmid omarsar0

AI News for 1/6/2026-1/7/2026 highlights a quiet day with key updates on LangChain DeepAgents introducing Ralph Mode for persistent agent loops, Cursor improving context management by reducing token usage by 46.9%, and operational safety measures for coding agents with allow/deny lists. MCP integration is expanding across assistants and robotics, with Hugging Face embedding assistants via HuggingChat + HF MCP server. The DeepSeek-R1 paper has been expanded to 86 pages, emphasizing trajectory exploration and RL shaping behavior. NousCoder-14B shows a +7% improvement on LiveCodeBench after 4 days of RL training, demonstrating advances in RL for coding with small open models. Top tweets also mention a viral "96GB RAM laptop", ChatGPT Health launch by OpenAI, and Karpathy's nanochat scaling-law miniseries.

Oct 20, 2025

DeepSeek-OCR finds vision models can decode 10x more efficiently with ~97% accuracy of text-only, 33/200k pages/day/A100

deepseek-ocr deepseek3b-moe-a570m veo-3.1 deepseek-ai google-deepmind krea ocr vision multimodality model-compression long-context model-architecture video-generation autoregressive-models model-efficiency precision-editing karpathy teortaxestex reach_vb _akhaliq eliebakouch vikhyatk demishassabis

As ICCV 2025 begins, DeepSeek releases a novel DeepSeek-OCR 3B MoE vision-language model that compresses long text as visual context with high accuracy and efficiency, challenging traditional tokenization approaches. The model achieves ~97% decoding precision at <10× compression and processes up to ~33M pages/day on 20 A100-40G nodes, outperforming benchmarks like GOT-OCR2.0. Discussions highlight the potential for unlimited context windows and tokenization-free inputs, with contributions from @karpathy, @teortaxesTex, and others. In video generation, google-deepmind's Veo 3.1 leads community benchmarks with advanced precision editing and scene blending, while Krea open-sources a 14B autoregressive video model enabling realtime long-form generation at ~11 FPS on a single B200 GPU.

Oct 17, 2025

The Karpathy-Dwarkesh Interview delays AGI timelines

claude-haiku-4.5 gpt-5 arch-router-1.5b anthropic openai huggingface langchain llamaindex google epoch-ai reasoning long-context sampling benchmarking data-quality agent-frameworks modular-workflows ide-extensions model-routing graph-first-agents real-world-grounding karpathy aakaran31 du_yilun giffmana omarsar0 jeremyphoward claude_code mikeyk alexalbert__ clementdelangue jerryjliu0

The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.

Oct 14, 2025

not much happened today

qwen3-vl-4b qwen3-vl-8b qwen2.5-vl-72b deepseek-v3.1 alibaba arena runway nvidia togethercompute ollama model-optimization fine-tuning inference-speed video-generation diffusion-models representation-learning local-ai speculative-decoding fp8-quantization context-windows karpathy

Alibaba released compact dense Qwen3-VL models at 4B and 8B sizes with FP8 options, supporting up to 1M context and open vocabulary detection, rivaling larger models like Qwen2.5-VL-72B. Ecosystem support includes MLX-VLM, LM Studio, vLLM, Kaggle models, and Ollama Cloud. In video AI, Arena added Sora 2 models leading in video benchmarks, with Higgsfield Enhancer improving video quality. Runway launched domain-specific workflow apps for creative tasks. Research on Representation Autoencoders for DiTs (RAE-DiT) shows improved diffusion model performance. On local training, NVIDIA DGX Spark enables strong local fine-tuning, while Nanochat by Karpathy offers a minimal stack for training and inference. Together AI introduced ATLAS, a speculative decoding method achieving up to 4× faster inference on DeepSeek-V3.1. These developments highlight advances in efficient model deployment, video AI, local fine-tuning, and inference speed optimization.

Oct 01, 2025

Thinking Machines' Tinker: LoRA based LLM fine-tuning API

qwen-235b-a22b sora-2 thinking-machines openai fine-tuning lora model-training api model-optimization distributed-training post-training-methods research-productivity video-generation content-moderation engagement-patterns karpathy lilianweng sama

Thinking Machines recently raised $2 billion without shipping a product until now, launching their first product Tinker, a managed service API for fine-tuning large and mixture-of-experts models like Qwen-235B-A22B using LoRA for cost-efficient training. The Tinker API offers low-level primitives for post-training methods and is supported by an open-source Tinker Cookbook library. Influential AI figures like Andrej Karpathy and Lilian Weng praised its design for reducing complexity and boosting research productivity. Meanwhile, OpenAI launched Sora 2, a video+audio model integrated into their consumer social app, sparking viral engagement and concerns over misuse and content moderation. Sam Altman emphasized the product's dual focus on delight and revenue alongside AGI research.

Sep 05, 2025

Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched

kimi-k2-0905 qwen-3-max qwen-3 moonshot-ai alibaba huggingface together-ai groq lmsys openrouter llamaindex long-context agents coding tool-use model-evaluation instruction-following context-windows semantic-search discriminator-models swyx karpathy willdepue levie bebischof andrew_n_carr bigeagle_xd

Moonshot AI updated their Kimi K2-0905 open model with doubled context length to 256k tokens, improved coding and tool-calling, and integration with agent scaffolds. Alibaba released Qwen 3 Max, a 1 trillion parameter model with agent-oriented behavior, available via Qwen Chat, Alibaba Cloud API, and OpenRouter. The community highlights China's dominance in open models and debates around meaningful evaluation methods for code agents, emphasizing long-horizon and domain-specific evals. Influential voices like @swyx and @karpathy discuss the importance of practical evals and discriminator models for ranking outputs.

Jun 25, 2025

Context Engineering: Much More than Prompts

gemini-code openai langchain cognition google-deepmind vercel cloudflare openrouter context-engineering retrieval-augmented-generation tools state-management history-management prompt-engineering software-layer chatgpt-connectors api-integration karpathy walden_yan tobi_lutke hwchase17 rlancemartin kwindla dex_horthy

Context Engineering emerges as a significant trend in AI, highlighted by experts like Andrej Karpathy, Walden Yan from Cognition, and Tobi Lutke. It involves managing an LLM's context window with the right mix of prompts, retrieval, tools, and state to optimize performance, going beyond traditional prompt engineering. LangChain and its tool LangGraph are noted for advancing this approach. Additionally, OpenAI has launched ChatGPT connectors for platforms like Google Drive, Dropbox, SharePoint, and Box, enhancing context integration for Pro users. Other notable news includes the launch of Vercel Sandbox, Cloudflare Containers, the leak and release of Gemini Code by Google DeepMind, and fundraising efforts by OpenRouter.

Jun 16, 2025

Chinese Models Launch - MiniMax-M1, Hailuo 2 "Kangaroo", Moonshot Kimi-Dev-72B

minimax-m1 hailuo-02 kimi-dev-72b deepseek-r1 ale-agent minimax-ai moonshot-ai deepseek bytedance anthropic langchain columbia-university sakana-ai openai microsoft multi-agent-systems attention-mechanisms coding optimization prompt-injection model-performance video-generation model-training task-automation jerryjliu0 hwchase17 omarsar0 gallabytes lateinteraction karpathy

MiniMax AI launched MiniMax-M1, a 456 billion parameter open weights LLM with a 1 million token input and 80k token output using efficient "lightning attention" and a GRPO variant called CISPO. MiniMax AI also announced Hailuo 02 (0616), a video model similar to ByteDance's Seedance. Moonshot AI released Kimi-Dev-72B, a coding model outperforming DeepSeek R1 on SWEBench Verified. Discussions on multi-agent system design from Anthropic and LangChain highlighted improvements in task completion and challenges like prompt injection attacks, as demonstrated by Karpathy and Columbia University research. Sakana AI introduced ALE-Agent, a coding agent that ranked 21st in the AtCoder Heuristic Competition solving NP-hard optimization problems. There is unverified news about an acquisition involving OpenAI, Microsoft, and Windsurf.

May 01, 2025

not much happened today

phi-4 phi-4-mini-reasoning qwen3-235b qwen3-moe-235b qwen3-moe-30b qwen3-dense-32b qwen3-dense-14b qwen3-dense-8b qwen3-dense-4b qwen3-dense-0.6b qwen2.5-omni-3b deepseek-prover-v2 llama llama-guard-4 prompt-guard-2 mimo-7b microsoft anthropic cursor alibaba togethercompute deepseek meta-ai-fair xiaomi openrouterai cohere reasoning model-fine-tuning model-evaluation benchmarking model-popularity open-source math model-scaling model-filtering jailbreak-prevention cline reach_vb vipulved akhaliq omarsar0 zhs05232838 huajian_xin mervenoyann karpathy random_walker sarahookr blancheminerva clefourrier

Microsoft released Phi-reasoning 4, a finetuned 14B reasoning model slightly behind QwQ but limited by data transparency and token efficiency issues. Anthropic introduced remote MCP server support and a 45-minute Research mode in Claude. Cursor published a model popularity list. Alibaba launched Qwen3-235B and other Qwen3 variants, highlighting budget-friendly coding and reasoning capabilities, with availability on Together AI API. Microsoft also released Phi-4-Mini-Reasoning with benchmark performance on AIME 2025 and OmniMath. DeepSeek announced DeepSeek-Prover V2 with state-of-the-art math problem solving, scaling to 671B parameters. Meta AI's Llama models hit 1.2 billion downloads, with new Llama Guard 4 and Prompt Guard 2 for input/output filtering and jailbreak prevention. Xiaomi released the open-source reasoning model MiMo-7B trained on 25 trillion tokens. Discussions on AI model evaluation highlighted issues with the LMArena leaderboard, data access biases favoring proprietary models, and challenges in maintaining fair benchmarking, with suggestions for alternatives like OpenRouterAI rankings. "LMArena slop and biased" and "61.3% of all data going to proprietary model providers" were noted concerns.

Apr 30, 2025

ChatGPT responds to GlazeGate + LMArena responds to Cohere

qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb

OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.

Apr 01, 2025

>$41B raised today (OpenAI @ 300b, Cursor @ 9.5b, Etched @ 1.5b)

deepseek-v3-0324 gemini-2.5-pro claude-3.7-sonnet openai deepseek gemini cursor etched skypilot agent-evals open-models model-releases model-performance coding multimodality model-deployment cost-efficiency agent-evaluation privacy kevinweil sama lmarena_ai scaling01 iscienceluvr stevenheidel lepikhin dzhng raizamrtn karpathy

OpenAI is preparing to release a highly capable open language model, their first since GPT-2, with a focus on reasoning and community feedback, as shared by @kevinweil and @sama. DeepSeek V3 0324 has achieved the #5 spot on the Arena leaderboard, becoming the top open model with an MIT license and cost advantages. Gemini 2.5 Pro is noted for outperforming models like Claude 3.7 Sonnet in coding tasks, with upcoming pricing and improvements expected soon. New startups like Sophont are building open multimodal foundation models for healthcare. Significant fundraises include Cursor closing $625M at a $9.6B valuation and Etched raising $85M at $1.5B. Innovations in AI infrastructure include SkyPilot's cost-efficient cloud provisioning and the launch of AgentEvals, an open-source package for evaluating AI agents. Discussions on smartphone privacy highlight iPhone's stronger user defense compared to Android.

Mar 18, 2025

not much happened today

gemini-2.0-flash imagen-3 mistral-small-3.1 mistral-3 gpt-4o-mini claude-3.5-haiku olm0-32b qwen-2.5 shieldgemma-2 julian fasttransform nvidia google mistral-ai allen-ai anthropic langchainai perplexity-ai kalshi stripe qodoai multimodality image-generation context-windows model-pricing open-source-models image-classification frameworks python-libraries partnerships jeremyphoward karpathy abacaj mervenoyann

At Nvidia GTC Day 1, several AI updates were highlighted: Google's Gemini 2.0 Flash introduces image input/output but is not recommended for text-to-image tasks, with Imagen 3 preferred for that. Mistral AI released Mistral Small 3.1 with 128k token context window and competitive pricing. Allen AI launched OLMo-32B, an open LLM outperforming GPT-4o mini and Qwen 2.5. ShieldGemma 2 was introduced for image safety classification. LangChainAI announced multiple updates including Julian powered by LangGraph and integration with AnthropicAI's MCP. Jeremy Howard released fasttransform, a Python library for data transformations. Perplexity AI partnered with Kalshi for NCAA March Madness predictions.

Feb 18, 2025

X.ai Grok 3 and Mira Murati's Thinking Machines

grok-3 grok-3-mini gemini-2-pro gpt-4o o3-mini-high o1 deepseek-r1 anthropic openai thinking-machines benchmarking reasoning reinforcement-learning coding multimodality safety alignment research-publishing model-performance creative-ai mira-murati lmarena_ai karpathy omarsar0 ibab arankomatsuzaki iscienceluvr scaling01

Grok 3 has launched with mixed opinions but strong benchmark performance, notably outperforming models like Gemini 2 Pro and GPT-4o. The Grok-3 mini variant shows competitive and sometimes superior capabilities, especially in reasoning and coding, with reinforcement learning playing a key role. Mira Murati has publicly shared her post-OpenAI plan, founding the frontier lab Thinking Machines, focusing on collaborative, personalizable AI, multimodality, and empirical safety and alignment research, reminiscent of Anthropic's approach.

Feb 14, 2025

Reasoning Models are Near-Superhuman Coders (OpenAI IOI, Nvidia Kernels)

o3 o1 o3-mini deepseek-r1 qwen-2.5 openthinker openai nvidia ollama elevenlabs sakana-ai apple reinforcement-learning gpu-kernel-optimization fine-tuning knowledge-distillation scaling-laws chain-of-thought-reasoning model-accessibility alex-wei karpathy abacaj awnihannun

o3 model achieved a gold medal at the 2024 IOI and ranks in the 99.8 percentile on Codeforces, outperforming most humans with reinforcement learning (RL) methods proving superior to inductive bias approaches. Nvidia's DeepSeek-R1 autonomously generates GPU kernels that surpass some expert-engineered kernels, showcasing simple yet effective AI-driven optimization. OpenAI updated o1 and o3-mini models to support file and image uploads in ChatGPT and released DeepResearch, a powerful research assistant based on the o3 model with RL for deep chain-of-thought reasoning. Ollama introduced OpenThinker models fine-tuned from Qwen2.5, outperforming some DeepSeek-R1 distillation models. ElevenLabs grew into a $3.3 billion company specializing in AI voice synthesis without open-sourcing their technology. Research highlights include Sakana AI Labs' TAID knowledge distillation method receiving a Spotlight at ICLR 2025, and Apple's work on scaling laws for mixture-of-experts (MoEs). The importance of open-source AI for scientific discovery was also emphasized.

Nov 12, 2024

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

o1 claude-3.5-haiku gpt-4o epoch-ai openai microsoft anthropic x-ai langchainai benchmarking math moravecs-paradox mixture-of-experts chain-of-thought agent-framework financial-metrics-api pdf-processing few-shot-learning code-generation karpathy philschmid adcock_brett dylan522p

Epoch AI collaborated with over 60 leading mathematicians to create the FrontierMath benchmark, a fresh set of hundreds of original math problems with easy-to-verify answers, aiming to challenge current AI models. The benchmark reveals that all tested models, including o1, perform poorly, highlighting the difficulty of complex problem-solving and Moravec's paradox in AI. Key AI developments include the introduction of Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture reducing computational costs, and improvements in Chain-of-Thought (CoT) prompting through incorrect reasoning and explanations. Industry news covers OpenAI acquiring the chat.com domain, Microsoft launching the Magentic-One agent framework, Anthropic releasing Claude 3.5 Haiku outperforming gpt-4o on some benchmarks, and xAI securing 150MW grid power with support from Elon Musk and Trump. LangChain AI introduced new tools including a Financial Metrics API, Document GPT with PDF upload and Q&A, and LangPost AI agent for LinkedIn posts. xAI also demonstrated the Grok Engineer compatible with OpenAI and Anthropic APIs for code generation.

Oct 08, 2024

not much happened this weekend

o1-preview claude-3.5-sonnet 21b-flash-model openai meta-ai-fair reka langchainai entropix prompting-techniques finetuning entropy-based-sampling temporal-understanding native-audio tool-use instruction-chaining multimodality retrieval-augmented-generation synthetic-data-generation rnn parallel-training biologically-inspired-ai-safety text-to-video-generation video-editing lex-fridman imrat jjitsev giffmana _philschmid karpathy rasbt adcock_brett glennko rohanpaul_ai labenz

AI news from 10/4/2024 to 10/7/2024 highlights several developments: OpenAI's o1-preview shows strong performance on complex tasks but struggles with simpler ones, while Claude 3.5 Sonnet can match its reasoning through advanced prompting techniques. Meta introduced Movie Gen, a cutting-edge media foundation model for text-to-video generation and editing. Reka updated their 21B Flash Model with temporal video understanding, native audio, and tool use capabilities. Interest grows in "open o1" reproductions focusing on prompting and finetuning, with Entropix exploring entropy-based sampling. LangChainAI demonstrated a Retrieval Agent for complex Q&A, and synthetic data generation research surveyed 417 models. A resurgence in RNNs shows efficient parallel training making them competitive with Transformers. Biologically-inspired AI safety approaches were also noted. "A quiet weekend and air conditioning is all you need."

Sep 20, 2024

not much happened today

o1-preview o1-mini qwen-2.5 gpt-4o deepseek-v2.5 gpt-4-turbo-2024-04-09 grin llama-3-1-405b veo kat openai qwen deepseek-ai microsoft kyutai-labs perplexity-ai together-ai meta-ai-fair google-deepmind hugging-face google anthropic benchmarking math coding instruction-following model-merging model-expressiveness moe voice voice-models generative-video competition open-source model-deployment ai-agents hyung-won-chung noam-brown bindureddy akhaliq karpathy aravsrinivas fchollet cwolferesearch philschmid labenz ylecun

OpenAI's o1-preview and o1-mini models lead benchmarks in Math, Hard Prompts, and Coding. Qwen 2.5 72B model shows strong performance close to GPT-4o. DeepSeek-V2.5 tops Chinese LLMs, rivaling GPT-4-Turbo-2024-04-09. Microsoft's GRIN MoE achieves good results with 6.6B active parameters. Moshi voice model from Kyutai Labs runs locally on Apple Silicon Macs. Perplexity app introduces voice mode with push-to-talk. LlamaCoder by Together.ai uses Llama 3.1 405B for app generation. Google DeepMind's Veo is a new generative video model for YouTube Shorts. The 2024 ARC-AGI competition increases prize money and plans a university tour. A survey on model merging covers 50+ papers for LLM alignment. The Kolmogorov–Arnold Transformer (KAT) paper proposes replacing MLP layers with KAN layers for better expressiveness. Hugging Face Hub integrates with Google Cloud Vertex AI Model Garden for easier open-source model deployment. Agent.ai is introduced as a professional network for AI agents. "Touching grass is all you need."

Aug 15, 2024

Grok 2! and ChatGPT-4o-latest confuses everybody

gpt-4o grok-2 claude-3.5-sonnet flux-1 stable-diffusion-3 gemini-advanced openai x-ai black-forest-labs google-deepmind benchmarking model-performance tokenization security-vulnerabilities multi-agent-systems research-automation text-to-image conversational-ai model-integration ylecun rohanpaul_ai karpathy

OpenAI quietly released a new GPT-4o model in ChatGPT, distinct from the API version, reclaiming the #1 spot on Lmsys arena benchmarks across multiple categories including math, coding, and instruction-following. Meanwhile, X.ai launched Grok 2, outperforming Claude 3.5 Sonnet and previous GPT-4o versions, with plans for enterprise API release. Grok 2 integrates Black Forest Labs' Flux.1, an open-source text-to-image model surpassing Stable Diffusion 3. Google DeepMind announced Gemini Advanced with enhanced conversational features and Pixel device integration. AI researcher ylecun highlighted LLM limitations in learning and creativity, while rohanpaul_ai discussed an AI Scientist system generating publishable ML research at low cost. karpathy warned of security risks in LLM tokenizers akin to SQL injection.

Aug 09, 2024

Too Cheap To Meter: AI prices cut 50-70% in last 30 days

gpt-4o gpt-4o-mini llama-3-1-405b mistral-large-2 gemini-1.5-flash deepseek-v2 sonnet-3.5 exaone-3.0 minicpm-v-2.6 claude-3.5 gpt-4o-2024-08-06 llamaindex together-ai deepinfra deepseek-ai mistral-ai google-deepmind lg-ai-research llamaindex llamaindex llamaindex price-cuts context-caching instruction-tuning vision benchmarks pytorch attention-mechanisms reinforcement-learning-from-human-feedback compute-optimal-scaling rohanpaul_ai akhaliq mervenoyann sophiamyang chhillee karpathy

Gemini 1.5 Flash has cut prices by approximately 70%, offering a highly competitive free tier of 1 million tokens per minute at $0.075/mtok, intensifying the AI model price war. Other significant price reductions include GPT-4o (~50% cut to $2.50/mtok), GPT-4o mini (70-98.5% cut to $0.15/mtok), Llama 3.1 405b (46% cut to $2.7/mtok), and Mistral Large 2 (62% cut to $3/mtok). Deepseek v2 introduced context caching, reducing input token costs by up to 90% to $0.014/mtok. New model releases include Llama 3.1 405b, Sonnet 3.5, EXAONE-3.0 (7.8B instruction-tuned by LG AI Research), and MiniCPM V 2.6 (vision-language model combining SigLIP 400M and Qwen2-7B). Benchmarks show Mistral Large performing well on ZebraLogic and Claude-3.5 leading LiveBench. FlexAttention, a new PyTorch API, simplifies and optimizes attention mechanisms. Andrej Karpathy analyzed RLHF, highlighting its limitations compared to traditional reinforcement learning. Google DeepMind research on compute-optimal scaling was also summarized.

Jul 13, 2024

We Solved Hallucinations

gpt-2 flashattention-3 lynx meta-ai-fair nvidia princeton colfax patronus-ai databricks mosaic-ai openai compute-hardware gpu-optimization flashattention llm-evaluation hallucination-detection vision benchmarking synthetic-data model-training karpathy tri_dao giffmana vikhyatk dbrxmosaicai

Reddit's URL structure causes link errors in AI-generated summaries, especially with NSFW content affecting models like Claude and GPT-4. The team fixed this glitch while still leveraging LLMs for summarizing Reddit content. GPT-2 training costs have dramatically dropped to ~$672 using H100 GPUs and software improvements like CUDA and FlashAttention. FlashAttention-3 was released, achieving up to 740 TFLOPS on H100 GPUs, with FP8 nearing 1.2 PFLOPS, developed collaboratively by Meta, NVIDIA, Princeton, and Colfax. Hopper GPUs enable major speedups with new hardware features. Synthetic data may not improve vision tasks, as shown in recent research. The Avocado360 benchmark evaluates vision-language models' ability to detect avocados in images. Lynx, a hallucination detection model for LLMs, was introduced for real-world healthcare and fintech applications, trained by Patronus AI on Databricks Mosaic AI using Composer.

Jul 02, 2024

RouteLLM: RIP Martian? (Plus: AINews Structured Summaries update)

gpt-4 gemma-2-27b gemma-2-9b lmsys openai llm-routing cost-efficiency model-performance model-optimization data-augmentation syntax-based-routing mixture-of-experts inference-throughput software-2.0 computer-vision karpathy bindureddy armand-joulin

LMSys introduces RouteLLM, an open-source router framework trained on preference data from Chatbot Arena, achieving cost reductions over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while maintaining 95% of GPT-4's performance. This approach surpasses previous task-specific routing by using syntax-based Mixture of Experts (MoE) routing and data augmentation, beating commercial solutions by 40%. The update highlights advances in LLM routing, cost-efficiency, and model performance optimization across multiple models rather than single-model or MoE-level improvements. Additionally, the AI Twitter recap notes the Gemma 2 model family as a top open model, the Block Transformer architecture for improved inference throughput, and a proposal for a fully Software 2.0 computer vision system by karpathy.

Jun 13, 2024

Hybrid SSM/Transformers > Pure SSMs/Pure Transformers

mamba-2-hybrid gpt-4 qwen-72b table-llava-7b nvidia lamini-ai sakana-ai luma-labs mixture-of-experts benchmarking fine-tuning multimodality text-to-video model-performance memory-optimization preference-optimization video-understanding multimodal-tables bryan-catanzaro bindureddy ylecun ctnzr corbtt realsharonzhou andrew-n-carr karpathy _akhaliq omarsar0

NVIDIA's Bryan Catanzaro highlights a new paper on Mamba models, showing that mixing Mamba and Transformer blocks outperforms either alone, with optimal attention below 20%. Mixture-of-Agents (MoA) architecture improves LLM generation quality, scoring 65.1% on AlpacaEval 2.0 versus GPT-4 Omni's 57.5%. The LiveBench AI benchmark evaluates reasoning, coding, writing, and data analysis. A hybrid Mamba-2-Hybrid model with 7% attention surpasses a Transformer on MMLU accuracy, jumping from 50% to 53.6%. GPT-4 performs better at temperature=1. Qwen 72B leads open-source models on LiveBench AI. LaminiAI Memory Tuning achieves 95% accuracy on a SQL agent task, improving over instruction fine-tuning. Sakana AI Lab uses evolutionary strategies for preference optimization. Luma Labs Dream Machine demonstrates advanced text-to-video generation. The MMWorld benchmark evaluates multimodal video understanding, and Table-LLaVa 7B competes with GPT-4V on multimodal table tasks.

Jun 11, 2024

Francois Chollet launches $1m ARC Prize

gpt-4 chatgpt openai apple togethercompute benchmarking agi pattern-recognition skill-acquisition privacy on-device-ai mixed-precision-quantization mixture-of-experts multimodality agentic-ai francois-chollet karpathy svpino philschmid clementdelangue sama gdb miramurati kevin-weil sarah-friar

François Chollet critiques current paths to AGI, emphasizing the importance of benchmarks that resist saturation and focus on skill acquisition and open-ended problem solving. The ARC-AGI puzzles exemplify "easy for humans, hard for AI" challenges to measure progress toward AGI. Meanwhile, Apple announces integration of ChatGPT into iOS, iPadOS, and macOS through a partnership with OpenAI, enabling AI-powered features like document summarization and photo analysis with privacy-preserving measures. Discussions highlight Apple's focus on deep AI integration and on-device models optimized with techniques like mixed-precision quantization, though some skepticism remains about their AI capabilities compared to GPT-4. Additionally, Together Compute introduces a Mixture of Agents approach achieving strong performance on AlpacaEval 2.0.

Jun 03, 2024

Mamba-2: State Space Duality

mamba-2 mamba transformer++ llama-3-70b gpt-3 hugging-face state-space-models perplexity training-efficiency data-pruning benchmarking multimodality video-analysis _albertgu tri_dao arankomatsuzaki _akhaliq clementdelangue karpathy

Mamba-2, a new state space model (SSM), outperforms previous models like Mamba and Transformer++ in perplexity and wall-clock time, featuring 8x larger states and 50% faster training. It introduces the concept of state space duality (SSD) connecting SSMs and linear attention. The FineWeb-Edu dataset, a high-quality subset of the 15 trillion token FineWeb dataset, filtered using llama-3-70b for educational quality, enables better and faster LLM learning, potentially reducing tokens needed to surpass GPT-3 performance. Additionally, perplexity-based data pruning using a 125M parameter model improves downstream performance and reduces pretraining steps by up to 1.45x. The Video-MME benchmark evaluates multi-modal LLMs on video analysis across multiple visual domains and video lengths.

May 31, 2024

Contextual Position Encoding (CoPE)

cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas

Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.

Dec 15, 2023

12/15/2023: Mixtral-Instruct beats Gemini Pro (and matches GPT3.5)

mixtral gemini-pro gpt-3.5 gpt-4.5 gpt-4 chatgpt lmsys openai deepseek cloudflare huggingface performance context-window prompt-engineering privacy local-gpu cloud-gpu code-generation model-comparison model-usage api-errors karpathy

Thanks to a karpathy shoutout, lmsys now has enough data to rank mixtral and gemini pro. The discussion highlights the impressive performance of these state-of-the-art open-source models that can run on laptops. In the openai Discord, users compared AI tools like perplexity and chatgpt's browsing tool, favoring Perplexity for its superior data gathering, pricing, and usage limits. Interest was shown in AI's ability to convert large code files with deepseek coder recommended. Debates on privacy implications for AI advancement and challenges of running LLMs on local and cloud GPUs were prominent. Users reported issues with chatgpt including performance problems, loss of access to custom GPTs, and unauthorized access. Discussions also covered prompt engineering for large context windows and speculations about gpt-4.5 and gpt-4 future developments.