All tags
Topic: "model-quantization"
Mary Meeker is so back: BOND Capital AI Trends report
qwen-3-8b anthropic hugging-face deepseek attention-mechanisms inference arithmetic-intensity transformers model-optimization interpretability model-quantization training tri_dao fleetwood___ teortaxestex awnihannun lateinteraction neelnanda5 eliebakouch _akhaliq
Mary Meeker returns with a comprehensive 340-slide report on the state of AI, highlighting accelerating tech cycles, compute growth, and comparisons of ChatGPT to early Google and other iconic tech products. The report also covers enterprise traction and valuation of major AI companies. On Twitter, @tri_dao discusses an "ideal" inference architecture featuring attention variants like GTA, GLA, and DeepSeek MLA with high arithmetic intensity (~256), improving efficiency and model quality. Other highlights include the release of 4-bit DWQ of DSR1 Qwen3 8B on Hugging Face, AnthropicAI's open-source interpretability tools for LLMs, and discussions on transformer training and abstractions by various researchers.
OpenAI adopts MCP
gemini-2.5-pro gemini-1.5-pro gemini-2.0-flash qwen-2.5-omni-7b deepseek-v3-0324 deepseek-r1 openai google-deepmind alibaba togethercompute model-benchmarking multimodality reasoning scaling-laws model-quantization synthetic-data model-performance context-windows speech-recognition translation audio-processing video-processing swyx
OpenAI announced support for MCP, a significant technical update. Google's Gemini 2.5 Pro leads benchmarks with top scores in MMLU-Pro (86%), GPQA Diamond (83%), and AIME 2024 (88%), featuring a 1 million token context window and multimodal inputs. Alibaba's Qwen 2.5 Omni 7B was released as a fully multimodal, interactive, open-source model with a novel "thinker-talker" architecture supporting voice and video chat. DeepSeek V3-0324 outperforms its predecessor on multiple benchmarks. Research on reasoning features in large language models using sparse autoencoders was highlighted, alongside a study on scaling laws of synthetic data showing performance plateaus near 300B tokens. Discussions also covered the fastest output speeds of Gemini models and concerns about over-reliance on benchmarks for intelligence measurement. Swyx will curate the Data Council AI Engineering Track in April.
Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)
gpt-4o-mini mistral-nemo llama-3 llama-3-400b deepseek-v2 openai nvidia mistral-ai togethercompute deepseek-ai lmsys model-quantization context-windows instruction-following model-performance cost-efficiency multimodality benchmarking open-source model-release sam-altman
GPT-4o-mini launches with a 99% price reduction compared to text-davinci-003, offering 3.5% the price of GPT-4o and matching Opus-level benchmarks. It supports 16k output tokens, is faster than previous models, and will soon support text, image, video, and audio inputs and outputs. Mistral Nemo, a 12B parameter model developed with Nvidia, features a 128k token context window, FP8 checkpoint, and strong benchmark performance. Together Lite and Turbo offer fp8/int4 quantizations of Llama 3 with up to 4x throughput and significantly reduced costs. DeepSeek V2 is now open-sourced. Upcoming releases include at least 5 unreleased models and Llama 4 leaks ahead of ICML 2024.
Gemini Nano: 50-90% of Gemini Pro, <100ms inference, on device, in Chrome Canary
gemini-nano gemini-pro claude-3.5-sonnet gpt-4o deepseek-coder-v2 glm-0520 nemotron-4-340b gpt-4-turbo-0409 google gemini huggingface anthropic deepseek zhipu-ai tsinghua nvidia model-quantization prompt-api optimization model-weights benchmarking code-generation math synthetic-data automatic-differentiation retrieval-augmented-generation mitigating-memorization tree-search inference-time-algorithms adcock_brett dair_ai lmsysorg
The latest Chrome Canary now includes a feature flag for Gemini Nano, offering a prompt API and on-device optimization guide, with models Nano 1 and 2 at 1.8B and 3.25B parameters respectively, showing decent performance relative to Gemini Pro. The base and instruct-tuned model weights have been extracted and posted to HuggingFace. In AI model releases, Anthropic launched Claude 3.5 Sonnet, which outperforms GPT-4o on some benchmarks, is twice as fast as Opus, and is free to try. DeepSeek-Coder-V2 achieves 90.2% on HumanEval and 75.7% on MATH, surpassing GPT-4-Turbo-0409, with models up to 236B parameters and 128K context length. GLM-0520 from Zhipu AI/Tsinghua ranks highly in coding and overall benchmarks. NVIDIA announced Nemotron-4 340B, an open model family for synthetic data generation. Research highlights include TextGrad, a framework for automatic differentiation on textual feedback; PlanRAG, an iterative plan-then-RAG decision-making technique; a paper on goldfish loss to mitigate memorization in LLMs; and a tree search algorithm for language model agents.
Music's Dall-E moment
griffin command-r-plus gpt-4-0613 gpt-4-0314 mistral-8x22b codegemma stable-diffusion-1.5 command-r gemini-1.5 google mistral-ai lmsys cohere model-architecture benchmarking open-source model-quantization memory-optimization inference-speed multimodality finetuning performance-optimization audio-processing andrej-karpathy
Google's Griffin architecture outperforms transformers with faster inference and lower memory usage on long contexts. Command R+ climbs to 6th place on the LMSYS Chatbot Arena leaderboard, surpassing GPT-4-0613 and GPT-4-0314. Mistral AI releases an open-source 8x22B model with a 64K context window and around 130B total parameters. Google open-sources CodeGemma models with pre-quantized 4-bit versions for faster downloads. Ella weights enhance Stable Diffusion 1.5 with LLM for semantic alignment. Unsloth enables 4x larger context windows and 80% memory reduction for finetuning. Andrej Karpathy releases LLMs implemented in pure C for potential performance gains. Command R+ runs in realtime on M2 Max MacBook using iMat q1 quantization. Cohere's Command R model offers low API costs and strong leaderboard performance. Gemini 1.5 impresses with audio capabilities recognizing speech tone and speaker identification from audio clips.
World_sim.exe
gpt-4 gpt-4o grok-1 llama-cpp claude-3-opus claude-3 gpt-5 nvidia nous-research stability-ai hugging-face langchain anthropic openai multimodality foundation-models hardware-optimization model-quantization float4 float6 retrieval-augmented-generation text-to-video prompt-engineering long-form-rag gpu-optimization philosophy-of-ai agi-predictions jensen-huang yann-lecun sam-altman
NVIDIA announced Project GR00T, a foundation model for humanoid robot learning using multimodal instructions, built on their tech stack including Isaac Lab, OSMO, and Jetson Thor. They revealed the DGX Grace-Blackwell GB200 with over 1 exaflop compute, capable of training GPT-4 1.8T parameters in 90 days on 2000 Blackwells. Jensen Huang confirmed GPT-4 has 1.8 trillion parameters. The new GB200 GPU supports float4/6 precision with ~3 bits per parameter and achieves 40,000 TFLOPs on fp4 with 2x sparsity.
Open source highlights include the release of Grok-1, a 340B parameter model, and Stability AI's SV3D, an open-source text-to-video generation solution. Nous Research collaborated on implementing Steering Vectors in Llama.CPP.
In Retrieval Augmented Generation (RAG), a new 5.5-hour tutorial builds a pipeline using open-source HF models, and LangChain released a video on query routing and announced integration with NVIDIA NIM for GPU-optimized LLM inference.
Prominent opinions include Yann LeCun distinguishing language from other cognitive abilities, Sam Altman predicting AGI arrival in 6 years with a leap from GPT-4 to GPT-5 comparable to GPT-3 to GPT-4, and discussions on the philosophical status of LLMs like Claude. There is also advice against training models from scratch for most companies.