All tags
Topic: "model-quantization"
Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model
qwen3.5-397b-a17b qwen3.5-plus qwen3-max qwen3-vl kimi alibaba openai deepseek z-ai minimax kimi unsloth ollama vllm native-multimodality spatial-intelligence sparse-moe long-context model-quantization model-architecture model-deployment inference-optimization apache-2.0-license pete_steinberger justinlin610
Alibaba released Qwen3.5-397B-A17B, an open-weight model featuring native multimodality, spatial intelligence, and a hybrid linear attention + sparse MoE architecture supporting 201 languages and long context windows up to 256K tokens. The model shows improvements over previous versions like Qwen3-Max and Qwen3-VL, with a sparsity ratio of about 4.3%. Community discussions highlighted the Gated Delta Networks enabling efficient inference despite large model size (~800GB BF16), with successful local runs on Apple Silicon using quantization techniques. The hosted API version, Qwen3.5-Plus, supports 1M context and integrates search and code interpreter features. This release follows other Chinese labs like Z.ai, Minimax, and Kimi in refreshing large models. The model is licensed under Apache-2.0 and is expected to be the last major release before DeepSeek v4. The news also notes Pete Steinberger joining OpenAI.
MiniMax-M2.5: SOTA coding, search, toolcalls, $1/hour
minimax-m2.5 glm-5 minimax-ai togethercompute huggingface intel wandb reinforcement-learning agent-based-models model-quantization benchmarking model-efficiency multi-turn-dialogue infrastructure-optimization cost-efficiency on-device-ai
MiniMax-M2.5 is now open source, featuring an "agent-native" reinforcement learning framework called Forge trained across 200k+ RL environments for coding, tool use, and workflows. It boasts strong benchmark scores like 80.2% SWE-Bench Verified and emphasizes cost-efficiency with claims like "$1 per hour at 100 tps" and good on-device performance. The Forge RL system uses multi-level prefix caching and high rollout compute share (~60%) to generate millions of trajectories daily. Independent reviews note improved stability and multi-turn viability but high token usage. The ecosystem rapidly adopted MiniMax-M2.5 with quantized releases including 2-bit GGUF and INT4 formats. Meanwhile, Together markets GLM-5 as a leading open-source model for long-horizon agents with 77.8% SWE-Bench Verified and MoE efficiency using DeepSeek Sparse Attention.
not much happened today
claude-3-7-sonnet gpt-4-1 gemini-3 qwen3-vl-embedding qwen3-vl-reranker glm-4-7 falcon-h1r-7b jamba2 stanford google google-deepmind alibaba z-ai tii ai21-labs huggingface copyright-extraction multimodality multilinguality retrieval-augmented-generation model-architecture mixture-of-experts model-quantization reasoning inference kernel-engineering memory-optimization enterprise-ai sundarpichai justinlin610
Stanford paper reveals Claude 3.7 Sonnet memorized 95.8% of Harry Potter 1, highlighting copyright extraction risks compared to GPT-4.1. Google AI Studio sponsors TailwindCSS amid OSS funding debates. Google and Sundar Pichai launch Gmail Gemini 3 features including AI Overviews and natural-language search with user controls. Alibaba Qwen releases Qwen3-VL-Embedding and Qwen3-VL-Reranker, a multimodal, multilingual retrieval stack supporting text, images, and video with quantization and instruction customization, achieving strong benchmark results. Z.ai goes public on HKEX with GLM-4.7 leading the Artificial Analysis Intelligence Index v4.0, showing gains in reasoning, coding, and agentic use, with large-scale MoE architecture and MIT license. Falcon-H1R-7B from TII targets efficient reasoning in smaller models, scoring 16 on the Intelligence Index. AI21 Labs introduces Jamba2, a memory-efficient enterprise model with hybrid SSM-Transformer architecture and Apache 2.0 license, available via SaaS and Hugging Face. vLLM shows throughput improvements in inference and kernel engineering. "Embeddings should be multimodal by default," notes Justin Lin.
OpenAI GPT Image-1.5 claims to beat Nano Banana Pro, #1 across all Arenas, but completely fails Vibe Checks
gpt-image-1.5 nano-banana-pro mimo-v2-flash deepseek-v3.2 openai gemini xiaomi lmsys deepseek openrouter image-generation instruction-following benchmarking model-efficiency long-context multi-token-prediction hybrid-attention model-optimization inference-speed agentic-workflows model-architecture model-quantization fuli_luo eliebakouch
OpenAI released its new image model GPT Image 1.5, featuring precise image editing, better instruction following, improved text and markdown rendering, and faster generation up to 4×. Despite topping multiple leaderboards like LMArena (1277), Design Arena (1344), and AA Arena (1272), user feedback from Twitter, Reddit, and Discord communities is largely negative compared to Nano Banana Pro by Gemini. Xiaomi introduced the MiMo-V2-Flash, a 309B MoE model optimized for inference efficiency with 256K context window, achieving state-of-the-art scores on SWE-Bench. The model uses Hybrid Sliding Window Attention and multi-token prediction, offering significant speedups and efficiency improvements. The timing of OpenAI's launch amid competition from Gemini and Nano Banana Pro affects user sentiment, highlighting challenges in benchmarking relevance.
not much happened today
gpt-5.2 opus-4.5 gemini-3-pro gpt-5.1 olmo-3.1-32b qwen3-vl-235b openai allen_ai mistral-ai ollama lmstudio thinkymachines reinforcement-learning model-benchmarking long-context model-quantization model-optimization inference-speed sparsity fine-tuning vision sama scaling01 akhaliq artificialanlys lechmazur acerfur epochairesearch
GPT-5.2 shows mixed performance in public evaluations, excelling in agentic tasks but at a significantly higher cost (~$620/run) compared to Opus 4.5 and GPT-5.1. It performs variably on reasoning and coding benchmarks, with some improvements on long-context tasks. Extended "reasoning effort" settings notably impact results. Aggregators rank Gemini 3 Pro above GPT-5.2 in task persistence. OpenAI released sparse activation models sparking debate on sparsity vs MoE architectures. Allen AI's Olmo 3.1 (32B) advances open reinforcement learning scale with substantial compute investment (~125k H100 hours). Mistral's Devstral-2 and llama.cpp improve local inference infrastructure with new features like GGUF support and distributed speedups. Tinker platform goes GA with vision input and finetuning support for Qwen3-VL-235B.
not much happened today
kling-2.6 kling-o1 runway-gen-4.5 gemini-3 deepseek-v3.2 ministral-3 evoqwen2.5-vl hermes-4.3 intellect-3 openai anthropic google runway elevenlabs freepik openart deepseek mistral-ai alibaba nous-research video-generation audio-processing multimodality image-generation reasoning model-quantization sparse-attention model-pricing multimodal-models retrieval-augmentation model-training model-release
OpenAI's Code Red response and Anthropic's IPO are major highlights. In AI video and imaging, Kling 2.6 introduces native audio co-generation with coherent lip-sync, partnered with platforms like ElevenLabs and OpenArt. Runway Gen-4.5 enhances lighting fidelity, while Google's Gemini 3 Nano Banana Pro supports advanced image compositing. Open model releases include DeepSeek V3.2 with sparse attention and cost-effective pricing, and Mistral's Ministral 3 multimodal family with strong 14B variants. Retrieval and code models from Alibaba's EvoQwen2.5-VL and Nous Research's Hermes 4.3 show competitive performance with permissive licensing and HF availability. The community arena sees additions like INTELLECT-3 (106B MoE). "coherent looking & sounding output" and "auto-lighting to match scene mood" are noted advancements.
not much happened today
fastvlm mobileclip2 grok-code-fast-1 gpt-5 qwen-3-coder-30b-a3b apple hugging-face x-ai openai groq run-llama lmstudio vision model-quantization code-generation cli-workflows retrieval-augmentation embedding-models local-ai multimodality reach_vb xenovacom pcuenq awnihannun cline veggie_eric nickbaumann_ gdb benankdev loganmarkewich tom_doerr fastmcp ggerganov orionweller antoine_chaffin
Apple released three real-time vision-language models (FastVLM, MobileCLIP2) on Hugging Face with significant speed and size improvements, supporting WebGPU and Core ML. Their MLX framework now supports MXFP4 format, competing with NVFP4 for FP4 quantization. xAI launched grok-code-fast-1, outperforming Claude for code edits, while OpenAI integrated GPT-5 into Xcode 26 and released a new Responses API on Groq hardware. CLI-first agent workflows advanced with tools like SemTools, MLX local runner for Apple Silicon, and llama.vim recommending Qwen 3 Coder 30B A3B. Retrieval research highlights limitations of single-vector embeddings, promoting ColBERT-style late interaction.
not much happened today
grok-2 grok-2.5 vibevoice-1.5b motif-2.6b gpt-5 qwen-code xai-org microsoft motif-technology alibaba huggingface langchain-ai mixture-of-experts model-scaling model-architecture text-to-speech fine-tuning training-data optimization reinforcement-learning agentic-ai tool-use model-training model-release api software-development model-quantization elonmusk clementdelangue rasbt quanquangu akhaliq eliebakouch gdb ericmitchellai ivanfioravanti deanwball giffmana omarsar0 corbtt
xAI released open weights for Grok-2 and Grok-2.5 with a novel MoE residual architecture and μP scaling, sparking community excitement and licensing concerns. Microsoft open-sourced VibeVoice-1.5B, a multi-speaker long-form TTS model with streaming support and a 7B variant forthcoming. Motif Technology published a detailed report on Motif-2.6B, highlighting Differential Attention, PolyNorm, and extensive finetuning, trained on AMD MI250 GPUs. In coding tools, momentum builds around GPT-5-backed workflows, with developers favoring it over Claude Code. Alibaba released Qwen-Code v0.0.8 with deep VS Code integration and MCP CLI enhancements. The MCP ecosystem advances with LiveMCP-101 stress tests, the universal MCP server "Rube," and LangGraph Platform's rollout of revision queueing and ART integration for RL training of agents.
Mary Meeker is so back: BOND Capital AI Trends report
qwen-3-8b anthropic hugging-face deepseek attention-mechanisms inference arithmetic-intensity transformers model-optimization interpretability model-quantization training tri_dao fleetwood___ teortaxestex awnihannun lateinteraction neelnanda5 eliebakouch _akhaliq
Mary Meeker returns with a comprehensive 340-slide report on the state of AI, highlighting accelerating tech cycles, compute growth, and comparisons of ChatGPT to early Google and other iconic tech products. The report also covers enterprise traction and valuation of major AI companies. On Twitter, @tri_dao discusses an "ideal" inference architecture featuring attention variants like GTA, GLA, and DeepSeek MLA with high arithmetic intensity (~256), improving efficiency and model quality. Other highlights include the release of 4-bit DWQ of DSR1 Qwen3 8B on Hugging Face, AnthropicAI's open-source interpretability tools for LLMs, and discussions on transformer training and abstractions by various researchers.
OpenAI adopts MCP
gemini-2.5-pro gemini-1.5-pro gemini-2.0-flash qwen-2.5-omni-7b deepseek-v3-0324 deepseek-r1 openai google-deepmind alibaba togethercompute model-benchmarking multimodality reasoning scaling-laws model-quantization synthetic-data model-performance context-windows speech-recognition translation audio-processing video-processing swyx
OpenAI announced support for MCP, a significant technical update. Google's Gemini 2.5 Pro leads benchmarks with top scores in MMLU-Pro (86%), GPQA Diamond (83%), and AIME 2024 (88%), featuring a 1 million token context window and multimodal inputs. Alibaba's Qwen 2.5 Omni 7B was released as a fully multimodal, interactive, open-source model with a novel "thinker-talker" architecture supporting voice and video chat. DeepSeek V3-0324 outperforms its predecessor on multiple benchmarks. Research on reasoning features in large language models using sparse autoencoders was highlighted, alongside a study on scaling laws of synthetic data showing performance plateaus near 300B tokens. Discussions also covered the fastest output speeds of Gemini models and concerns about over-reliance on benchmarks for intelligence measurement. Swyx will curate the Data Council AI Engineering Track in April.
Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)
gpt-4o-mini mistral-nemo llama-3 llama-3-400b deepseek-v2 openai nvidia mistral-ai togethercompute deepseek-ai lmsys model-quantization context-windows instruction-following model-performance cost-efficiency multimodality benchmarking open-source model-release sam-altman
GPT-4o-mini launches with a 99% price reduction compared to text-davinci-003, offering 3.5% the price of GPT-4o and matching Opus-level benchmarks. It supports 16k output tokens, is faster than previous models, and will soon support text, image, video, and audio inputs and outputs. Mistral Nemo, a 12B parameter model developed with Nvidia, features a 128k token context window, FP8 checkpoint, and strong benchmark performance. Together Lite and Turbo offer fp8/int4 quantizations of Llama 3 with up to 4x throughput and significantly reduced costs. DeepSeek V2 is now open-sourced. Upcoming releases include at least 5 unreleased models and Llama 4 leaks ahead of ICML 2024.
Gemini Nano: 50-90% of Gemini Pro, <100ms inference, on device, in Chrome Canary
gemini-nano gemini-pro claude-3.5-sonnet gpt-4o deepseek-coder-v2 glm-0520 nemotron-4-340b gpt-4-turbo-0409 google gemini huggingface anthropic deepseek zhipu-ai tsinghua nvidia model-quantization prompt-api optimization model-weights benchmarking code-generation math synthetic-data automatic-differentiation retrieval-augmented-generation mitigating-memorization tree-search inference-time-algorithms adcock_brett dair_ai lmsysorg
The latest Chrome Canary now includes a feature flag for Gemini Nano, offering a prompt API and on-device optimization guide, with models Nano 1 and 2 at 1.8B and 3.25B parameters respectively, showing decent performance relative to Gemini Pro. The base and instruct-tuned model weights have been extracted and posted to HuggingFace. In AI model releases, Anthropic launched Claude 3.5 Sonnet, which outperforms GPT-4o on some benchmarks, is twice as fast as Opus, and is free to try. DeepSeek-Coder-V2 achieves 90.2% on HumanEval and 75.7% on MATH, surpassing GPT-4-Turbo-0409, with models up to 236B parameters and 128K context length. GLM-0520 from Zhipu AI/Tsinghua ranks highly in coding and overall benchmarks. NVIDIA announced Nemotron-4 340B, an open model family for synthetic data generation. Research highlights include TextGrad, a framework for automatic differentiation on textual feedback; PlanRAG, an iterative plan-then-RAG decision-making technique; a paper on goldfish loss to mitigate memorization in LLMs; and a tree search algorithm for language model agents.
Music's Dall-E moment
griffin command-r-plus gpt-4-0613 gpt-4-0314 mistral-8x22b codegemma stable-diffusion-1.5 command-r gemini-1.5 google mistral-ai lmsys cohere model-architecture benchmarking open-source model-quantization memory-optimization inference-speed multimodality finetuning performance-optimization audio-processing andrej-karpathy
Google's Griffin architecture outperforms transformers with faster inference and lower memory usage on long contexts. Command R+ climbs to 6th place on the LMSYS Chatbot Arena leaderboard, surpassing GPT-4-0613 and GPT-4-0314. Mistral AI releases an open-source 8x22B model with a 64K context window and around 130B total parameters. Google open-sources CodeGemma models with pre-quantized 4-bit versions for faster downloads. Ella weights enhance Stable Diffusion 1.5 with LLM for semantic alignment. Unsloth enables 4x larger context windows and 80% memory reduction for finetuning. Andrej Karpathy releases LLMs implemented in pure C for potential performance gains. Command R+ runs in realtime on M2 Max MacBook using iMat q1 quantization. Cohere's Command R model offers low API costs and strong leaderboard performance. Gemini 1.5 impresses with audio capabilities recognizing speech tone and speaker identification from audio clips.
World_sim.exe
gpt-4 gpt-4o grok-1 llama-cpp claude-3-opus claude-3 gpt-5 nvidia nous-research stability-ai hugging-face langchain anthropic openai multimodality foundation-models hardware-optimization model-quantization float4 float6 retrieval-augmented-generation text-to-video prompt-engineering long-form-rag gpu-optimization philosophy-of-ai agi-predictions jensen-huang yann-lecun sam-altman
NVIDIA announced Project GR00T, a foundation model for humanoid robot learning using multimodal instructions, built on their tech stack including Isaac Lab, OSMO, and Jetson Thor. They revealed the DGX Grace-Blackwell GB200 with over 1 exaflop compute, capable of training GPT-4 1.8T parameters in 90 days on 2000 Blackwells. Jensen Huang confirmed GPT-4 has 1.8 trillion parameters. The new GB200 GPU supports float4/6 precision with ~3 bits per parameter and achieves 40,000 TFLOPs on fp4 with 2x sparsity.
Open source highlights include the release of Grok-1, a 340B parameter model, and Stability AI's SV3D, an open-source text-to-video generation solution. Nous Research collaborated on implementing Steering Vectors in Llama.CPP.
In Retrieval Augmented Generation (RAG), a new 5.5-hour tutorial builds a pipeline using open-source HF models, and LangChain released a video on query routing and announced integration with NVIDIA NIM for GPU-optimized LLM inference.
Prominent opinions include Yann LeCun distinguishing language from other cognitive abilities, Sam Altman predicting AGI arrival in 6 years with a leap from GPT-4 to GPT-5 comparable to GPT-3 to GPT-4, and discussions on the philosophical status of LLMs like Claude. There is also advice against training models from scratch for most companies.