All tags
Person: "svpino"
not much happened today
grok-3 grok-3-mini gpt-4.5 claude-3.7-sonnet quasar-alpha optimus-alpha gpt-4.1 kaleidoscope internvl3 internvit qwen2.5vl transmamba fantasytalking openai alibaba cmu reinforcement-learning reasoning benchmarks vision multilinguality multimodality transformers attention-mechanisms agents code-generation model-performance rasbt sarahookr mervenoyann gneubig svpino mathemagic1an
The AI news recap highlights independent evaluations showing Grok-3 outperforming models like GPT-4.5 and Claude 3.7 Sonnet on reasoning benchmarks, while Grok-3 mini excels in reasoning tasks. Research on reinforcement learning (RL) fine-tuning reveals potential improvements for small reasoning models but also notes instability in reported gains. Benchmark results suggest Quasar Alpha and Optimus Alpha may be versions of GPT-4.1. Vision and multimodal models like Kaleidoscope, supporting 18 languages, and InternVL3, built on InternViT and Qwen2.5VL, demonstrate advances in multilingual vision and reasoning. The fusion model TransMamba combines transformer precision with speed via SSM mechanisms. Alibaba's FantasyTalking generates realistic talking portraits. Agent-focused events at CMU and tools like FilmAgent AI for virtual film production and BrowseComp benchmark for browsing agents were announced. The coding assistant Augment supports multiple IDEs with code analysis and suggestions. Discussions also covered Google’s new agent-to-agent protocol concept.
not much happened to end the year
deepseek-v3 code-llm o1 sonnet-3.5 deepseek smol-ai reinforcement-learning reasoning training-data mixed-precision-training open-source multimodality software-development natural-language-processing interpretability developer-tools real-time-applications search sdk-generation corbtt tom_doerr cognitivecompai alexalbert__ theturingpost svpino bindureddy
Reinforcement Fine-Tuning (RFT) is introduced as a data-efficient method to improve reasoning in LLMs using minimal training data with strategies like First-Correct Solutions (FCS) and Greedily Diverse Solutions (GDS). DeepSeek-V3, a 671B parameter MoE language model trained on 14.8 trillion tokens with FP8 mixed precision training, highlights advances in large-scale models and open-source LLMs. Predictions for AI in 2025 include growth in smaller models, multimodality, and challenges in open-source AI. The impact of AI on software development jobs suggests a need for higher intelligence and specialization as AI automates low-skilled tasks. Enhancements to CodeLLM improve coding assistance with features like in-place editing and streaming responses. Natural Language Reinforcement Learning (NLRL) offers better interpretability and richer feedback for AI planning and critique. AI hiring is growing rapidly with startups seeking strong engineers in ML and systems. New AI-powered tools such as Rivet, Buzee, and Konfig improve real-time applications, search, and SDK generation using technologies like Rust and V8 isolates.
Not much happened today
grok-beta llama-3-1-70b claude-3-5-haiku claude-3-opus llama-3 chatgpt gemini meta-ai-fair scale-ai anthropic perplexity-ai langchainai weights-biases qwen pricing national-security defense open-source agentic-ai retrieval-augmented-generation election-predictions real-time-updates annotation ai-ecosystem memes humor alexandr_wang svpino aravsrinivas bindureddy teortaxestex jessechenglyu junyang-lin cte_junior jerryjliu0
Grok Beta surpasses Llama 3.1 70B in intelligence but is less competitive due to its pricing at $5/1M input tokens and $15/1M output tokens. Defense Llama, developed with Meta AI and Scale AI, targets American national security applications. SWE-Kit, an open-source framework, supports building customizable AI software engineers compatible with Llama 3, ChatGPT, and Claude. LangChainAI and Weights & Biases integrate to improve retrievers and reduce hallucinations in RAG applications using Gemini. Perplexity AI offers enhanced election tracking tools for the 2024 elections, including live state results and support for Claude 3.5 Haiku. AI Talk launched featuring discussions on Chinese AI labs with guests from Qwen. Memes highlight Elon Musk and humorous AI coding mishaps.
OpenAI beats Anthropic to releasing Speculative Decoding
claude-3-sonnet mrt5 openai anthropic nvidia microsoft boston-dynamics meta-ai-fair runway elevenlabs etched osmo physical-intelligence langchain speculative-decoding prompt-lookup cpu-inference multimodality retrieval-augmented-generation neural-networks optimization ai-safety governance model-architecture inference-economics content-generation adcock_brett vikhyatk dair_ai rasbt bindureddy teortaxestex svpino c_valenzuelab davidsholz
Prompt lookup and Speculative Decoding techniques are gaining traction with implementations from Cursor, Fireworks, and teased features from Anthropic. OpenAI has introduced faster response times and file edits with these methods, offering about 50% efficiency improvements. The community is actively exploring AI engineering use cases with these advancements. Recent updates highlight progress from companies like NVIDIA, OpenAI, Anthropic, Microsoft, Boston Dynamics, and Meta. Key technical insights include CPU inference capabilities, multimodal retrieval-augmented generation (RAG), and neural network fundamentals. New AI products include fully AI-generated games and advanced content generation tools. Challenges in AI research labs such as bureaucracy and resource allocation were also discussed, alongside AI safety and governance concerns.
not much happened today
smollm2 llama-3-2 stable-diffusion-3.5 claude-3.5-sonnet gemini openai anthropic google meta-ai-fair suno-ai perplexity-ai on-device-ai model-performance robotics multimodality ai-regulation model-releases natural-language-processing prompt-engineering agentic-ai ai-application model-optimization sam-altman akhaliq arav-srinivas labenz loubnabenallal1 alexalbert fchollet stasbekman svpino rohanpaul_ai hamelhusain
ChatGPT Search was launched by Sam Altman, who called it his favorite feature since ChatGPT's original launch, doubling his usage. Comparisons were made between ChatGPT Search and Perplexity with improvements noted in Perplexity's web navigation. Google introduced a "Grounding" feature in the Gemini API & AI Studio enabling Gemini models to access real-time web information. Despite Gemini's leaderboard performance, developer adoption lags behind OpenAI and Anthropic. SmolLM2, a new small, powerful on-device language model, outperforms Meta's Llama 3.2 1B. A Claude desktop app was released for Mac and Windows. Meta AI announced robotics advancements including Meta Sparsh, Meta Digit 360, and Meta Digit Plexus. Stable Diffusion 3.5 Medium, a 2B parameter model with a permissive license, was released. Insights on AGI development suggest initial inferiority but rapid improvement. Anthropic advocates for early targeted AI regulation. Discussions on ML specialization predict training will concentrate among few companies, while inference becomes commoditized. New AI tools include Suno AI Personas for music creation, PromptQL for natural language querying over data, and Agent S for desktop task automation. Humor was shared about Python environment upgrades.
The AI Search Wars Have Begun — SearchGPT, Gemini Grounding, and more
gpt-4o o1-preview claude-3.5-sonnet universal-2 openai google gemini nyt perplexity-ai glean nvidia langchain langgraph weights-biases cohere weaviate fine-tuning synthetic-data distillation hallucinations benchmarking speech-to-text robotics neural-networks ai-agents sam-altman alexalbert__ _jasonwei svpino drjimfan virattt
ChatGPT launched its search functionality across all platforms using a fine-tuned version of GPT-4o with synthetic data generation and distillation from o1-preview. This feature includes a Chrome extension promoted by Sam Altman but has issues with hallucinations. The launch coincides with Gemini introducing Search Grounding after delays. Notably, The New York Times is not a partner due to a lawsuit against OpenAI. The AI search competition intensifies with consumer and B2B players like Perplexity and Glean. Additionally, Claude 3.5 Sonnet achieved a new benchmark record on SWE-bench Verified, and a new hallucination evaluation benchmark, SimpleQA, was introduced. Other highlights include the Universal-2 speech-to-text model with 660M parameters and HOVER, a neural whole-body controller for humanoid robots trained in NVIDIA Isaac simulation. AI hedge fund teams using LangChain and LangGraph were also showcased. The news is sponsored by the RAG++ course featuring experts from Weights & Biases, Cohere, and Weaviate.
not much happened today
llama-3.1-nemotron-70b golden-gate-claude embed-3 liquid-ai anthropic cohere openai meta-ai-fair nvidia perplexity-ai langchain kestra ostrisai llamaindex feature-steering social-bias multimodality model-optimization workflow-orchestration inference-speed event-driven-workflows knowledge-backed-agents economic-impact ai-national-security trust-dynamics sam-altman lmarena_ai aravsrinivas svpino richardmcngo ajeya_cotra tamaybes danhendrycks jerryjliu0
Liquid AI held a launch event introducing new foundation models. Anthropic shared follow-up research on social bias and feature steering with their "Golden Gate Claude" feature. Cohere released multimodal Embed 3 embeddings models following Aya Expanse. There was misinformation about GPT-5/Orion debunked by Sam Altman. Meta AI FAIR announced Open Materials 2024 with new models and datasets for inorganic materials discovery using the EquiformerV2 architecture. Anthropic AI demonstrated feature steering to balance social bias and model capabilities. NVIDIA's Llama-3.1-Nemotron-70B ranked highly on the Arena leaderboard with style control. Perplexity AI expanded to 100M weekly queries with new finance and reasoning modes. LangChain emphasized real application integration with interactive frame interpolation. Kestra highlighted scalable event-driven workflows with open-source YAML-based orchestration. OpenFLUX optimized inference speed by doubling it through guidance LoRA training. Discussions on AI safety included trust dynamics between humans and AI, economic impacts of AI automation, and the White House AI National Security memo addressing cyber and biological risks. LlamaIndex showcased knowledge-backed agents for enhanced AI applications.
not much happened today
claudette llama-3-1 yi-lightning gpt-4o claude-3.5-sonnet answer-ai tencent notebooklm motherduck perplexity dropbox openai meta-ai-fair yi-ai zyphra-ai anthropic langchain openai synthetic-data fine-tuning sql audio-processing on-device-ai dataset-release transformer llm-reasoning ai-safety code-generation ai-pricing ai-job-market fchollet aravsrinivas svpino swyx
Answer.ai launched fastdata, a synthetic data generation library using
claudette
and Tencent's Billion Persona paper. NotebookLM became customizable, and Motherduck introduced notable LLMs in SQL implementations. Perplexity and Dropbox announced competitors to Glean. OpenAI unveiled audio chat completions priced at 24 cents per minute. Meta AI released Llama 3.1, powering Lenovo AI Now's on-device agent. Yi-Lightning model ranked #6 globally, surpassing GPT-4o. Zyphra AI released the large Zyda-2 dataset with 5 trillion tokens. François Chollet clarified transformer architecture as set-processing, not sequence-processing. Research suggests memorization aids LLM reasoning. Anthropic updated its Responsible Scaling Policy for AI safety. Tools like Perplexity Finance, Open Canvas by LangChain, and AlphaCodium code generation tool were highlighted. Approximately $500 million was raised for AI agent startups, with ongoing discussions on AI's job market impact. Combining prompt caching with the Batches API can yield a 95% discount on Claude 3.5 Sonnet tokens. Not much (in AI) happened this weekend
llama-3.1-8b llama-3.2 chatgpt movie-gen openai meta-ai-fair google-deepmind microsoft x-ai spacex harvard nvidia long-context feature-prediction-loss ai-agents privacy text-to-video text-to-image humanoid-robots gpu-deployment media-foundation-models ai-research-labs sam-altman yann-lecun rasbt bindureddy andrej-karpathy soumithchintala svpino adcock_brett rohanpaul_ai
OpenAI introduced an "edit this area" feature for image generation, praised by Sam Altman. Yann LeCun highlighted a NYU paper improving pixel generation with feature prediction loss using pre-trained visual encoders like DINOv2. Long-context LLMs such as llama-3.1-8b and llama-3.2 variants now support up to 131k tokens, offering alternatives to RAG systems. Bindu Reddy announced AI agents capable of building and deploying code from English instructions, signaling AI's replacement of SQL and potential impact on Python. SpaceX's successful Starship rocket catch was celebrated by Andrej Karpathy and others, with Soumith Chintala praising SpaceX's efficient, low-bureaucracy research approach. Privacy concerns arose from Harvard students' AI glasses, I-XRAY, which can reveal personal information. Meta AI FAIR's Movie Gen model advances media foundation models with high-quality text-to-image and video generation, including synced audio. Humanoid robots like Ameca and Azi now engage in expressive conversations using ChatGPT. xAI rapidly deployed 100K Nvidia H100 GPUs in 19 days, with CEO Jensen Huang commending Elon Musk. Leading AI research labs compared include Meta-FAIR, Google DeepMind, and Microsoft Research. Skepticism about LLM intelligence was voiced by Sam Pino, emphasizing limitations in novel problem-solving despite strong memorization.
not much happened today
aria o1-preview o1-mini gemini-1.5-pro gemini-1.5-flash gemini-1.5 claude-3.5-sonnet rhymes-ai openai anthropic google meta-ai-fair oxylabs multimodality mixture-of-experts long-context retrieval-augmented-generation benchmarking software-engineering llm-evaluation prompt-engineering web-scraping python production-applications mervenoyann osanseviero dbrxmosaicai ylecun ofirpress clefourrier omarsar0 rohanpaul_ai svpino finbarrtimbers _philschmid
Rhymes AI released Aria, a new 25.3B parameter multimodal MoE model supporting text, code, image, and video with a 64k token context window and Apache-2.0 license. OpenAI's o1-preview and o1-mini models show consistent improvement over Anthropic and Google Gemini 1.5 Pro/Flash on long context RAG benchmarks up to 128k tokens, while Google Gemini 1.5 models excel at extreme context lengths up to 2 million tokens. Meta AI expanded rollout to 21 countries with new language support but remains unavailable in the EU. The one-year anniversary of SWE-bench benchmark for software engineering tasks was celebrated, alongside the introduction of SWE-bench Multimodal. New AI tools include OxyCopilot by Oxylabs for web scraping, Taipy for Python-based production apps, and Latitude for prompt engineering. Industry insights highlight changing AI funding dynamics and OpenAI's strategic focus on consumer products like ChatGPT. "all recaps done by Claude 3.5 Sonnet, best of 4 runs."
not much happened today
flux-schnell meta-ai-fair anthropic togethercompute hugging-face audio-generation quantization prompt-caching long-term-memory llm-serving-framework hallucination-detection ai-safety ai-governance geoffrey-hinton john-hopfield demis-hassabis rohanpaul_ai svpino hwchase17 shreyar philschmid mmitchell_ai bindureddy
Geoffrey Hinton and John Hopfield won the Nobel Prize in Physics for foundational work on neural networks linking AI and physics. Meta AI introduced a 13B parameter audio generation model as part of Meta Movie Gen for video-synced audio. Anthropic launched the Message Batches API enabling asynchronous processing of up to 10,000 queries at half the cost. Together Compute released Flux Schnell, a free model for 3 months. New techniques like PrefixQuant quantization and Prompt Caching for low-latency inference were highlighted by rohanpaul_ai. LangGraph added long-term memory support for persistent document storage. Hex-LLM framework was introduced for TPU-based low-cost, high-throughput LLM serving from Hugging Face models. Discussions on AI safety emphasized gender equality in science, and concerns about premature AI regulation by media and Hollywood were raised.
The AI Nobel Prize
claude-3.5-sonnet reka-flash got openai anthropic reka-ai zep artificial-neural-networks nobel-prize knowledge-graphs memory-layers real-time-voice-api vision fine-tuning prompt-caching multimodality function-calling ocr open-source single-sign-on software-testing ai-assisted-coding ai-ethics geoff-hinton john-hopfield philschmid alexalbert mervenoyann clementdelangue svpino bindureddy ylecun rohanpaul_ai
Geoff Hinton and John Hopfield won the Nobel Prize in Physics for their work on Artificial Neural Networks. The award citation spans 14 pages highlighting their contributions. Zep released a new community edition of their low-latency memory layer for AI agents, emphasizing knowledge graphs for memory. At OpenAI's DevDay, new features like real-time voice API, vision model fine-tuning, and prompt caching with a 50% discount on reused tokens were introduced. Anthropic's Claude 3.5 Sonnet was recognized as the best model currently. Reka AI Labs updated their Reka Flash model with enhanced multimodal and function calling capabilities. The GOT (Generic OCR Transformer) achieved 98.79% accuracy on OCR benchmarks. Discussions on open-source AI models highlighted their role in fostering competition and decentralization. Software development insights included the importance of Single Sign-On (SSO), thorough testing, and AI-assisted coding workflows. Ethical and societal topics covered critiques of tax policies and the appointment of France's first Minister of AI.
Liquid Foundation Models: A New Transformers alternative + AINews Pod 2
llama-3-2 gemini-1.5-pro-002 gemini-1.5-flash-002 liquid-ai meta-ai-fair google-deepmind openai reinforcement-learning multimodality model-efficiency foundation-models audio-processing model-deployment open-source ylecun svpino
Liquid.ai emerged from stealth with three subquadratic foundation models demonstrating superior efficiency compared to state space models and Apple’s on-device and server models, backed by a $37M seed round. Meta AI announced Llama 3.2 with multimodal vision-enabled models and lightweight text-only variants for mobile. Google DeepMind introduced production-ready Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 models with improved pricing and rate limits, alongside AlphaChip, an AI-driven chip design system using reinforcement learning for rapid superhuman layouts. OpenAI enhanced ChatGPT Plus and Teams with Advanced Voice Mode featuring Custom Instructions, Memory, and new nature-inspired voices. California Governor vetoed SB-1047 AI regulation bill, celebrated by AI community figures like ylecun and svpino as a win for open-source AI. Google upgraded NotebookLM with audio overviews supporting YouTube and audio files, turning documents into AI-generated podcasts. "Open source in AI is thriving," noted ylecun, highlighting 1 million models on Github and HuggingFace.
not much happened today
llama-3-2 llama-3 molmo meta-ai-fair google-deepmind hugging-face on-device-ai multimodality chip-design retrieval-augmented-generation rag benchmarking reliability ai-regulation free-speech pytorch-optimization demis-hassabis clementdelangue svpino awnihannun osanseviero omarsar0 sarahookr ylecun
Meta released Llama 3.2, including lightweight 1B and 3B models for on-device AI with capabilities like summarization and retrieval-augmented generation. Molmo, a new multimodal model, was introduced with a large dense captioning dataset. Google DeepMind announced AlphaChip, an AI-driven chip design method improving TPU and CPU designs. Hugging Face surpassed 1 million free public models, highlighting the value of smaller specialized models. Discussions covered challenges in scaling RAG applications, the future of on-device AI running ChatGPT-level models, reliability issues in larger LLMs, and new Elo benchmarking accepted at NeurIPS 2024. AI ethics and regulation topics included free speech responsibilities and California's SB-1047 bill potentially affecting open-source AI. "AlphaChip transformed computer chip design," and "ChatGPT-level AI on mobile devices predicted within a year."
ChatGPT Advanced Voice Mode
o1-preview qwen-2.5 llama-3 claude-3.5 openai anthropic scale-ai togethercompute kyutai-labs voice-synthesis planning multilingual-datasets retrieval-augmented-generation open-source speech-assistants enterprise-ai price-cuts benchmarking model-performance sam-altman omarsar0 bindureddy rohanpaul_ai _philschmid alexandr_wang svpino ylecun _akhaliq
OpenAI rolled out ChatGPT Advanced Voice Mode with 5 new voices and improved accent and language support, available widely in the US. Ahead of rumored updates for Llama 3 and Claude 3.5, Gemini Pro saw a significant price cut aligning with the new intelligence frontier pricing. OpenAI's o1-preview model showed promising planning task performance with 52.8% accuracy on Randomized Mystery Blocksworld. Anthropic is rumored to release a new model, generating community excitement. Qwen 2.5 was released with models up to 32B parameters and support for 128K tokens, matching GPT-4 0613 benchmarks. Research highlights include PlanBench evaluation of o1-preview, OpenAI's release of a multilingual MMMLU dataset covering 14 languages, and RAGLAB framework standardizing Retrieval-Augmented Generation research. New AI tools include PDF2Audio for converting PDFs to audio, an open-source AI starter kit for local model deployment, and Moshi, a speech-based AI assistant from Kyutai. Industry updates feature Scale AI nearing $1B ARR with 4x YoY growth and Together Compute's enterprise platform offering faster inference and cost reductions. Insights from Sam Altman's blog post were also shared.
nothing much happened today
o1 chatgpt-4o llama-3-1-405b openai lmsys scale-ai cognition langchain qdrant rohanpaul_ai reinforcement-learning model-merging embedding-models toxicity-detection image-editing dependency-management automated-code-review visual-search benchmarking denny_zhou svpino alexandr_wang cwolferesearch rohanpaul_ai _akhaliq kylebrussell
OpenAI's o1 model faces skepticism about open-source replication due to its extreme restrictions and unique training advances like RL on CoT. ChatGPT-4o shows significant performance improvements across benchmarks. Llama-3.1-405b fp8 and bf16 versions perform similarly with cost benefits for fp8. A new open-source benchmark "Humanity's Last Exam" offers $500K in prizes to challenge LLMs. Model merging benefits from neural network sparsity and linear mode connectivity. Embedding-based toxic prompt detection achieves high accuracy with low compute. InstantDrag enables fast, optimization-free drag-based image editing. LangChain v0.3 releases with improved dependency management. Automated code review tool CodeRabbit adapts to team coding styles. Visual search advances integrate multimodal data for better product search. Experts predict AI will be default software by 2030.
Everybody shipped small things this holiday weekend
gpt-4o-voice gemini claude jamba-1.5 mistral-nemo-minitron-8b xai google anthropic openai cognition ai21-labs nvidia langchain fine-tuning long-context parameter-efficient-fine-tuning latex-rendering real-time-audio virtual-try-on resource-tags low-code ai-agents workspace-organization model-benchmarking dario-amodei scott-wu fchollet svpino
xAI announced the Colossus 100k H100 cluster capable of training an FP8 GPT-4 class model in 4 days. Google introduced Structured Output for Gemini. Anthropic discussed Claude's performance issues possibly due to API prompt modifications. OpenAI enhanced controls for File Search in their Assistants API. Cognition and Anthropic leaders appeared on podcasts. The viral Kwai-Kolors virtual try-on model and the open-source real-time audio conversational model Mini-Omni (similar to gpt-4o-voice) were released. Tutorials on parameter-efficient fine-tuning with LoRA and QLoRA, long-context embedding challenges, and Claude's LaTeX rendering feature were highlighted. AI21 Labs released Jamba 1.5 models with a 256K context window and faster long-context performance. NVIDIA debuted Mistral-Nemo-Minitron-8B on the Open LLM Leaderboard. LangChain introduced resource tags for workspace organization, and a low-code AI app toolkit was shared by svpino. Legal AI agents and financial agent evaluations using LangSmith were also featured.
Gemini Live
gemini-1.5-pro genie falcon-mamba gemini-1.5 llamaindex google anthropic tii supabase perplexity-ai llamaindex openai hugging-face multimodality benchmarking long-context retrieval-augmented-generation open-source model-releases model-integration model-performance software-engineering linear-algebra hugging-face-hub debugging omarsar0 osanseviero dbrxmosaicai alphasignalai perplexity_ai _jasonwei svpino
Google launched Gemini Live on Android for Gemini Advanced subscribers during the Pixel 9 event, featuring integrations with Google Workspace apps and other Google services. The rollout began on 8/12/2024, with iOS support planned. Anthropic released Genie, an AI software engineering system achieving a 57% improvement on SWE-Bench. TII introduced Falcon Mamba, a 7B attention-free open-access model scalable to long sequences. Benchmarking showed that longer context lengths do not always improve Retrieval-Augmented Generation. Supabase launched an AI-powered Postgres service dubbed the "ChatGPT of databases," fully open source. Perplexity AI partnered with Polymarket to integrate real-time probability predictions into search results. A tutorial demonstrated a multimodal recipe recommender using Qdrant, LlamaIndex, and Gemini. An OpenAI engineer shared success tips emphasizing debugging and hard work. The connection between matrices and graphs in linear algebra was highlighted for insights into nonnegative matrices and strongly connected components. Keras 3.5.0 was released with Hugging Face Hub integration for model saving and loading.
Execuhires: Tempting The Wrath of Khan
gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai
Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.
Rombach et al: FLUX.1 [pro|dev|schnell], $31m seed for Black Forest Labs
gemma-2-2b gpt-3.5-turbo-0613 mixtral-8x7b flux-1 stability-ai google-deepmind nvidia text-to-image text-to-video model-benchmarking open-weight-models model-distillation safety-classifiers sparse-autoencoders ai-coding-tools rohanpaul_ai fchollet bindureddy clementdelangue ylecun svpino
Stability AI co-founder Rombach launched FLUX.1, a new text-to-image model with three variants: pro (API only), dev (open-weight, non-commercial), and schnell (Apache 2.0). FLUX.1 outperforms Midjourney and Ideogram based on Black Forest Labs' ELO score and plans to expand into text-to-video. Google DeepMind released Gemma-2 2B, a 2 billion parameter open-source model that outperforms larger models like GPT-3.5-Turbo-0613 and Mixtral-8x7b on Chatbot Arena, optimized with NVIDIA TensorRT-LLM. The release includes safety classifiers (ShieldGemma) and sparse autoencoder analysis (Gemma Scope). Discussions highlight benchmarking discrepancies and US government support for open-weight AI models. Critiques of AI coding tools' productivity gains were also noted.
Problems with MMLU-Pro
mmlu-pro llama-3-8b-q8 gpt4all-3.0 chatgpt claude llama gemini mobilellm runway-gen-3-alpha meta-3d-gen huggingface meta-ai-fair salesforce runway nomic-ai pineapple argil-ai benchmarking prompt-engineering model-evaluation model-performance multimodality automated-dataset-generation video-generation open-source-models ai-assistants text-to-3d deepfake transformers reasoning wenhu-chen danhendrycks clementine ylecun adcock_brett svpino rohanpaul_ai
MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips. Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.
GraphRAG: The Marriage of Knowledge Graphs and RAG
gemma-2 llama-3-70b claude-3.5-sonnet nemotron-340b qwen2-72b llama-3 microsoft-research anthropic nvidia hugging-face retrieval-augmented-generation knowledge-graphs token-usage inference-time attention-mechanisms instruction-following coding math long-range-reasoning synthetic-data dataset-release fine-tuning context-windows function-calling travis-fischer rasbt alexandr-wang osanseviero rohanpaul_ai hamelhusain svpino aaaazzam omarsar0
Microsoft Research open sourced GraphRAG, a retrieval augmented generation (RAG) technique that extracts knowledge graphs from sources and clusters them for improved LLM answers, though it increases token usage and inference time. Gemma 2 models were released focusing on efficient small LLMs with innovations like sliding window attention and RMS norm, nearly matching the larger Llama 3 70B. Anthropic's Claude 3.5 Sonnet leads in instruction following and coding benchmarks, while Nvidia's Nemotron 340B model was released in June. Qwen2-72B tops the HuggingFace Open LLM leaderboard excelling in math and long-range reasoning. Discussions on RAG highlighted its limitations and improvements in context usage via function calls. A persona-driven synthetic data generation approach introduced 1 billion personas, with a fine-tuned model matching GPT-4 performance on math benchmarks at 7B scale. The 200GB AutoMathText dataset was also noted for math data synthesis.
Is this... OpenQ*?
deepseek-coder-v2 llama-3-8b nemotron-4-340b stable-diffusion-3-medium deepseek_ai anthropic runwayml openai apple nvidia stability-ai luma-labs reward-tampering test-time-search mathematical-reasoning process-supervision fine-tuning on-device-ai video-generation cost-efficiency context-length coding image-understanding multimodality adcock_brett clementdelangue svpino
DeepSeekCoder V2 promises GPT4T-beating performance at a fraction of the cost. Anthropic released new research on reward tampering. Runway launched their Sora response and Gen-3 Alpha video generation model. A series of papers explore "test-time" search techniques improving mathematical reasoning with models like LLaMa-3 8B. Apple announced Apple Intelligence with smarter Siri and image/document understanding, partnered with OpenAI to integrate ChatGPT into iOS 18, and released 20 new CoreML models with LoRA fine-tuning for specialization. NVIDIA released Nemotron-4 340B, an open model matching GPT-4 performance. DeepSeek-Coder-V2 excels in coding and math with 338 programming languages and 128K context length. Stability AI released Stable Diffusion 3 Medium weights. Luma Labs launched Dream Machine for 5-second video generation from text and images.
Francois Chollet launches $1m ARC Prize
gpt-4 chatgpt openai apple togethercompute benchmarking agi pattern-recognition skill-acquisition privacy on-device-ai mixed-precision-quantization mixture-of-experts multimodality agentic-ai francois-chollet karpathy svpino philschmid clementdelangue sama gdb miramurati kevin-weil sarah-friar
François Chollet critiques current paths to AGI, emphasizing the importance of benchmarks that resist saturation and focus on skill acquisition and open-ended problem solving. The ARC-AGI puzzles exemplify "easy for humans, hard for AI" challenges to measure progress toward AGI. Meanwhile, Apple announces integration of ChatGPT into iOS, iPadOS, and macOS through a partnership with OpenAI, enabling AI-powered features like document summarization and photo analysis with privacy-preserving measures. Discussions highlight Apple's focus on deep AI integration and on-device models optimized with techniques like mixed-precision quantization, though some skepticism remains about their AI capabilities compared to GPT-4. Additionally, Together Compute introduces a Mixture of Agents approach achieving strong performance on AlpacaEval 2.0.
Somebody give Andrej some H100s already
gpt-2 openai fineweb meta-ai-fair nvidia tesla cuda fine-tuning training-time gpu-acceleration convolutional-neural-networks real-time-processing ai-safety ai-regulation andrej-karpathy yann-lecun elon-musk francois-chollet svpino mervenoyann
OpenAI's GPT-2 sparked controversy five years ago for being "too dangerous to release." Now, with FineWeb and llm.c, a tiny GPT-2 model can be trained in 90 minutes for $20 using 8xA100 GPUs, with the full 1.6B model estimated to take 1 week and $2.5k. The project is notable for its heavy use of CUDA (75.8%) aiming to simplify the training stack. Meanwhile, a Twitter debate between Yann LeCun and Elon Musk highlighted the importance of convolutional neural networks (CNNs) in real-time image processing for autonomous driving, with LeCun emphasizing scientific research's role in technological progress. LeCun also criticized AI doomsday scenarios, arguing for cautious optimism about AI safety and regulation.
Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing
gpt-4 gpt-4o gpt-4-turbo gpt-4o-mini llama bloom stable-diffusion cursor openai anthropic google-deepmind huggingface speculative-decoding code-edits multimodality image-generation streaming tool-use fine-tuning benchmarking mmlu model-performance evaluation synthetic-data context-windows sama abacaj imjaredz erhartford alexalbert svpino maximelabonne _philschmid
Cursor, an AI-native IDE, announced a speculative edits algorithm for code editing that surpasses GPT-4 and GPT-4o in accuracy and latency, achieving speeds of over 1000 tokens/s on a 70b model. OpenAI released GPT-4o with multimodal capabilities including audio, vision, and text, noted to be 2x faster and 50% cheaper than GPT-4 turbo, though with mixed coding performance. Anthropic introduced streaming, forced tool use, and vision features for developers. Google DeepMind unveiled Imagen Video and Gemini 1.5 Flash, a small model with a 1M-context window. HuggingFace is distributing $10M in free GPUs for open-source AI models like Llama, BLOOM, and Stable Diffusion. Evaluation insights highlight challenges with LLMs on novel problems and benchmark saturation, with new benchmarks like MMLU-Pro showing significant drops in top model performance.
Quis promptum ipso promptiet?
llama-3-70b llama-3-120b llama-3 llama-cpp anthropic openai zoominfo neuralink prompt-engineering chain-of-thought rag quantization cuda-graphs gpu-optimization thought-controlled-devices modeling-consciousness conference sama gdb bindureddy svpino rohanpaul_ai alexalbert__ abacaj
Anthropic released upgrades to their Workbench Console, introducing new prompt engineering features like chain-of-thought reasoning and prompt generators that significantly reduce development time, exemplified by their customer Zoominfo. OpenAI teased a "magic" new development coming soon, speculated to be a new LLM replacing GPT-3.5 in the free tier or a search competitor. The open-source community highlighted Llama 3 70B as "game changing" with new quantized weights for Llama 3 120B and CUDA graph support for llama.cpp improving GPU performance. Neuralink demonstrated a thought-controlled mouse, sparking interest in modeling consciousness from brain signals. The ICLR 2024 conference is being held in Asia for the first time, generating excitement.
Kolmogorov-Arnold Networks: MLP killers or just spicy MLPs?
gpt-5 gpt-4 dall-e-3 openai microsoft learnable-activations mlp function-approximation interpretability inductive-bias-injection b-splines model-rearrangement parameter-efficiency ai-generated-image-detection metadata-standards large-model-training max-tegmark ziming-liu bindureddy nptacek zacharynado rohanpaul_ai svpino
Ziming Liu, a grad student of Max Tegmark, published a paper on Kolmogorov-Arnold Networks (KANs), claiming they outperform MLPs in interpretability, inductive bias injection, function approximation accuracy, and scaling, despite being 10x slower to train but 100x more parameter efficient. KANs use learnable activation functions modeled by B-splines on edges rather than fixed activations on nodes. However, it was later shown that KANs can be mathematically rearranged back into MLPs with similar parameter counts, sparking debate on their interpretability and novelty. Meanwhile, on AI Twitter, there is speculation about a potential GPT-5 release with mixed impressions, OpenAI's adoption of the C2PA metadata standard for detecting AI-generated images with high accuracy for DALL-E 3, and Microsoft training a large 500B parameter model called MAI-1, potentially previewed at Build conference, signaling increased competition with OpenAI. "OpenAI's safety testing for GPT-4.5 couldn't finish in time for Google I/O launch" was also noted.
Mixtral 8x22B Instruct sparks efficiency memes
mixtral-8x22b llama-2-7b olmo-7b mistral-ai hugging-face google microsoft intel softbank nvidia multilinguality math code-generation context-window model-performance model-release retrieval-augmented-generation deepfake ai-investment ai-chip hybrid-architecture training-data guillaume-lample osanseviero _philschmid svpino
Mistral released an instruct-tuned version of their Mixtral 8x22B model, notable for using only 39B active parameters during inference, outperforming larger models and supporting 5 languages with 64k context window and math/code capabilities. The model is available on Hugging Face under an Apache 2.0 license for local use. Google plans to invest over $100 billion in AI, with other giants like Microsoft, Intel, and SoftBank also making large investments. The UK criminalized non-consensual deepfake porn, raising enforcement debates. A former Nvidia employee claims Nvidia's AI chip lead is unmatchable this decade. AI companions could become a $1 billion market. AI has surpassed humans on several basic tasks but lags on complex ones. Zyphra introduced Zamba, a novel 7B parameter hybrid model outperforming LLaMA-2 7B and OLMo-7B with less training data, trained on 128 H100 GPUs over 30 days. GroundX API advances retrieval-augmented generation accuracy.