All tags
Model: "gemini-3-pro"
not much happened today
minimax-m2.1 glm-4.7 gemini-3-pro claude-3-sonnet vl-jepa minimax-ai vllm-project exolabs mlx apple openai open-source mixture-of-experts local-inference quantization inference-quality multimodality non-autoregressive-models video-processing reinforcement-learning self-play agentic-rl parallel-computing model-deployment ylecun awnihannun alexocheema edwardsun0909 johannes_hage
MiniMax M2.1 launches as an open-source agent and coding Mixture-of-Experts (MoE) model with ~10B active / ~230B total parameters, claiming to outperform Gemini 3 Pro and Claude Sonnet 4.5, and supports local inference including on Apple Silicon M3 Ultra with quantization. GLM 4.7 demonstrates local scaling on Mac Studios with 2× 512GB M3 Ultra hardware, highlighting system-level challenges like bandwidth and parallelism. The concept of inference quality is emphasized as a key factor affecting output variance across deployments. Yann LeCun's VL-JEPA proposes a non-generative, non-autoregressive multimodal model operating in latent space for efficient real-time video processing with fewer parameters and decoding operations. Advances in agentic reinforcement learning for coding include self-play methods where agents inject and fix bugs autonomously, enabling self-improvement without human labeling, and large-scale RL infrastructure involving massive parallel code generation and execution sandboxes.
Gemini 3.0 Flash Preview: 1/4 cost of Pro, but ~as smart, retakes Pareto Frontier
gemini-3-flash gemini-3 gpt-5.2 gemini-3-pro google google-deepmind tool-calling multimodality benchmarking reasoning cost-efficiency model-performance context-window agentic-ai model-deployment sundar_pichai jeffdean demishassabis
Google launched Gemini 3 Flash, a pro-grade reasoning model with flash latency, supporting tool calling and multimodal IO, available via multiple platforms including Google AI Studio and Vertex AI. It offers competitive pricing at $0.50 per 1M input tokens and $3.00 per 1M output tokens, with context windows up to 1M tokens. Benchmarks show Gemini 3 Flash rivals or outperforms larger models like GPT-5.2 and Gemini 3 Pro in agentic, coding, and reasoning tasks, validated by ARC-AGI-2, SWE-bench, LMArena, and Arena benchmarks. Despite some tradeoffs like high token use and hallucination rates, it is cost-effective overall. Key figures include Sundar Pichai, Jeff Dean, and Demis Hassabis who publicly celebrated this achievement. The model's tool calling capabilities were demonstrated with 100 tools in a live demo.
not much happened today
gpt-5.2 opus-4.5 gemini-3-pro gpt-5.1 olmo-3.1-32b qwen3-vl-235b openai allen_ai mistral-ai ollama lmstudio thinkymachines reinforcement-learning model-benchmarking long-context model-quantization model-optimization inference-speed sparsity fine-tuning vision sama scaling01 akhaliq artificialanlys lechmazur acerfur epochairesearch
GPT-5.2 shows mixed performance in public evaluations, excelling in agentic tasks but at a significantly higher cost (~$620/run) compared to Opus 4.5 and GPT-5.1. It performs variably on reasoning and coding benchmarks, with some improvements on long-context tasks. Extended "reasoning effort" settings notably impact results. Aggregators rank Gemini 3 Pro above GPT-5.2 in task persistence. OpenAI released sparse activation models sparking debate on sparsity vs MoE architectures. Allen AI's Olmo 3.1 (32B) advances open reinforcement learning scale with substantial compute investment (~125k H100 hours). Mistral's Devstral-2 and llama.cpp improve local inference infrastructure with new features like GGUF support and distributed speedups. Tinker platform goes GA with vision input and finetuning support for Qwen3-VL-235B.
not much happened today
nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency
NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."
not much happened today
vllm-0.12.0 gemma3n qwen3-omni qwen3-vl gpt-5.1-codex-max gemini-3-pro runway-gen-4.5 kling-video-2.6 vllm nvidia huggingface langchain-ai together-ai meta-ai-fair sonarsource openrouter runway gemini arena gpu-programming quantization multimodality agent-platforms reinforcement-learning static-analysis reasoning inference-infrastructure model-optimization economics audio video-generation jeremyphoward mervenoyann sydneyrunkle swyx maximelabonne
vLLM 0.12.0 introduces DeepSeek support, GPU Model Runner V2, and quantization improvements with PyTorch 2.9.0 and CUDA 12.9. NVIDIA launches CUDA Tile IR and cuTile Python for advanced GPU tensor operations targeting Blackwell GPUs. Hugging Face releases Transformers v5 RC with an any-to-any multimodal pipeline supporting models like Gemma3n and Qwen3-Omni. Agent platforms see updates from LangChain with content moderation and cost tracking, Together AI and Meta AI collaborate on RL for long-horizon workflows, and SonarSource integrates static analysis into AI codegen. Economic insights from OpenRouter highlight coding as a key AI application, with reasoning models surpassing 50% usage and market bifurcation between premium and open models. Additionally, Kling Video 2.6 debuts native audio capabilities, and Runway Gen-4.5, Qwen3-TTS, and Gemini 3 Pro advance multimodality.
DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling
deepseek-v3.2 deepseek-v3.2-speciale gpt-5-high sonnet-4.5 gemini-3-pro deepseek_ai lm-arena agentic-ai reinforcement-learning large-context-windows model-benchmarking model-performance multi-agent-systems model-training model-deployment suchenzang teortaxestex
DeepSeek launched the DeepSeek V3.2 family including Standard, Thinking, and Speciale variants with up to 131K context window and competitive benchmarks against GPT-5-High, Sonnet 4.5, and Gemini 3 Pro. The release features a novel Large Scale Agentic Task Synthesis Pipeline focusing on agentic behaviors and improvements in reinforcement learning post-training algorithms. The models are available on platforms like LM Arena with pricing around $0.28/$0.42 per million tokens. Community feedback is mixed, praising the frontier reasoning capabilities but critiquing the chat UI experience. Key figures include Susan Zhang and Teortaxes who provided commentary on the release.
Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights
flux-2 flux-2-dev claude-opus-4.5 gpt-5.1 gemini-3-pro black-forest-labs anthropic huggingface multi-reference-support variational-autoencoder image-generation open-weights agentic-coding token-efficiency benchmarking prompting model-performance
Black Forest Labs' FLUX.2 release features Multi-Reference Support for up to 4 Megapixel output and up to 10 images with consistency, including four form factors: Pro, Flex, Dev (32B Open Weight model), and Klein (TBA Open Weights). The new FLUX.2 - VAE introduces a variational autoencoder optimizing learnability, quality, and compression. Meanwhile, Anthropic's Claude Opus 4.5 demonstrates strong performance and efficiency, scoring 70 on Artificial Analysis, tying with GPT-5.1 high and trailing Gemini 3 Pro (73). Opus 4.5 excels in agentic coding benchmarks and research evaluations, with notable token efficiency and reduced running costs. "Opus 4.5 leads Gemini 3 Pro on SWE-Bench Verified and tops the AICodeKing leaderboard," and it shows strong QA and systematic review capabilities. Anthropic also released a dense prompting guide for Opus 4.5.
Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus
claude-opus-4.5 gemini-3-pro gpt-5.1-codex-max opus-4.1 sonnet-4.5 anthropic amazon google anthropic coding agents tool-use token-efficiency benchmarking api model-pricing model-performance effort-control context-compaction programmatic-tool-calling alexalbert__ btibor91 scaling01 klieret
Anthropic launched Claude Opus 4.5, a new flagship model excelling in coding, agents, and tooling with a significant 3x price cut compared to Opus 4.1 and improved token efficiency using 76% fewer output tokens. Opus 4.5 achieved a new SOTA on SWE-bench Verified with 80.9% accuracy, surpassing previous models like Gemini 3 Pro and GPT-5.1-Codex-Max. The update includes advanced API features such as effort control, context compaction, and programmatic tool calling, improving tool accuracy and reducing token usage. Claude Code is now bundled with Claude Desktop, and new integrations like Claude for Chrome and Excel are rolling out. Benchmarks show Opus 4.5 breaking the 80% barrier on SWE-bench Verified and strong performance on ARC-AGI-2 and BrowseComp-Plus.
Nano Banana Pro (Gemini Image Pro) solves text-in-images, infographic generation, 2-4k resolution, and Google Search grounding
gemini-3-pro gpt-5 google openai hugging-face togethercompute lmsys image-generation text-rendering model-provenance scientific-research proof-assistance multimodal-integration api-access fine-tuning jeffdean kevinweil demishassabis
Google launched Gemini 3 Pro Image (Nano Banana Pro), a next-generation AI image generation and editing model with integrated Google Search grounding, multi-image composition, and fine-grained visual controls, offering pricing at $0.134 per 2K image and $0.24 per 4K image. It features improved text rendering with error rates dropping from 56% to 8% compared to its predecessor, and includes SynthID watermark checks for provenance. The model is available via Gemini App, API, LM Arena, Hugging Face Spaces, Together AI, and Flow. Meanwhile, OpenAI shared early experiments with GPT-5 accelerating scientific research, including proofs of previously unsolved problems in math, physics, biology, and materials science. "GPT-5 accelerated research tasks in math/physics/biology/materials; in 4, it helped find proofs of previously unsolved problems."
OpenAI fires back: GPT-5.1-Codex-Max (API) and GPT 5.1 Pro (ChatGPT)
gpt-5.1-codex-max gpt-5.1-codex gemini-3-pro claude-3.5-sonnet openai google anthropic langchain-ai coding autonomous-systems benchmarking model-scaling multi-agent-systems model-performance reasoning model-architecture sama
OpenAI released GPT-5.1-Codex-Max, featuring compaction-native training, an "Extra High" reasoning mode, and claims of over 24-hour autonomous operation, showing significant performance gains on benchmarks like METR, CTF, and PaperBench. Google's Gemini 3 Pro demonstrates strong coding and reasoning capabilities, achieving new state-of-the-art results on SWE-bench Verified and WeirdML, with estimated model size between 5-10 trillion parameters. The AI coding agent ecosystem is rapidly evolving with integrations and tooling improvements from multiple companies. Sam Altman highlighted the significant improvements in GPT-5.1-Codex-Max. The news also covers educational offerings like ChatGPT for Teachers and multi-agent workflows involving Gemini 3, GPT-5.1-Codex-Max, and Claude Sonnet 4.5.
Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE
gemini-3-pro gemini-2.5 grok-4.1 sonnet-4.5 gpt-5.1 google google-deepmind multimodality agentic-ai benchmarking context-window model-performance instruction-following model-pricing api model-release reasoning model-evaluation sundarpichai _philschmid oriol_vinyals
Google launched Gemini 3 Pro, a state-of-the-art model with a 1M-token context window, multimodal reasoning, and strong agentic capabilities, priced significantly higher than Gemini 2.5. It leads major benchmarks, surpassing Grok 4.1 and competing closely with Sonnet 4.5 and GPT-5.1, though GPT-5.1 excels in ultralong summarization. Independent evaluations from Artificial Analysis, Vending Bench, ARC-AGI 2, Box, and PelicanBench validate Gemini 3 as a frontier LLM. Google also introduced Antigravity, an agentic IDE powered by Gemini 3 Pro and other models, featuring task orchestration and human-in-the-loop validation. The launch marks Google's strong return to AI with more models expected soon. "Google is very, very back in the business."