All tags
Person: "sundarpichai"
not much happened today
gemini-3.1-flash-lite gemini-3 gpt-5.3 gpt-5.4 qwen google-deepmind google openai alibaba multimodality latency throughput context-window model-pricing model-benchmarking model-performance conversational-ai hallucination-reduction api model-rollout leadership-exit jeffdean noamshazeer sundarpichai aidan_mclau justinlin610
Google DeepMind launched Gemini 3.1 Flash-Lite, emphasizing dynamic thinking levels for adjustable compute, with notable metrics like $0.25/M input, $1.50/M output, 1432 Elo on LMArena, and 2.5× faster time-to-first-token than Gemini 2.5 Flash. It supports a 1M context window and high throughput for multimodal inputs including text, images, video, audio, and PDFs. OpenAI rolled out GPT-5.3 Instant to all ChatGPT users, improving conversational naturalness and reducing hallucinations by 26.8% with search. The upcoming GPT-5.4 was teased amid speculation. Alibaba's Qwen faces leadership exits, raising concerns about its future and open-source status. The news highlights advancements in model efficiency, pricing, and multimodality, alongside organizational changes impacting AI development.
Nano Banana 2 aka Gemini 3.1 Flash Image Preview: the new SOTA Imagegen model
gemini-3.1-flash gpt-5.2 gpt-5.3-codex opus-4.6 claude google google-deepmind microsoft anthropic perplexity-ai image-generation text-rendering 3d-imaging real-time-information agentic-ai persistent-memory multi-agent-systems tooling coding-agents task-delegation sundarpichai demishassabis mustafasuleyman yusuf_i_mehdi borisdayma aravsrinivas
Google and DeepMind launched Nano Banana 2 (aka Gemini 3.1 Flash Image Preview), a leading image generation and editing model integrated across multiple Google products with features like 4K upscaling, multi-subject consistency, and real-time search-conditioned generation. Evaluations rank it #1 in text-to-image tasks with competitive pricing. Additionally, advances in agentic coding are noted with models like GPT-5.2, GPT-5.3 Codex, Opus 4.6, and Gemini 3.1, alongside Microsoft's Copilot Tasks introducing task delegation. Persistent memory features are rolling out in Claude models, though interoperability challenges remain.
Gemini 3.1 Pro: 2x 3.0 on ARC-AGI 2
gemini-3.1-pro gemini-3-deep-think google google-deepmind geminiapp reasoning benchmarking agentic-ai cost-efficiency hallucination code-generation model-release developer-tools sundarpichai demishassabis jeffdean koraykv noamshazeer joshwoodward artificialanlys arena oriolvinyalsml scaling01
Google released Gemini 3.1 Pro, a developer preview integrated across the Gemini app, NotebookLM, Gemini API / AI Studio, and Vertex AI, highlighting a significant reasoning improvement with ARC-AGI-2 = 77.1% and strong coding and agentic-tool benchmarks like SWE-Bench Verified = 80.6%. Independent evaluators such as Artificial Analysis and Arena confirmed top-tier performance and cost efficiency, though community reactions included excitement about practical gains, skepticism about benchmark targeting, and concerns over rollout inconsistencies. The release emphasizes the same core intelligence powering Gemini 3 Deep Think scaled for practical use, with notable mentions from leaders like @sundarpichai, @demishassabis, and @JeffDean.
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
gemini-3-deep-think-v2 arc-agi-2 google-deepmind google geminiapp arcprize benchmarking reasoning test-time-adaptation fluid-intelligence scientific-computing engineering-workflows 3d-modeling cost-analysis demishassabis sundarpichai fchollet jeffdean oriolvinyalsml tulseedoshi
Google DeepMind is rolling out the upgraded Gemini 3 Deep Think V2 reasoning mode to Google AI Ultra subscribers and opening early access to the Vertex AI / Gemini API for select users. Key benchmark achievements include ARC-AGI-2 at 84.6%, Humanity’s Last Exam (HLE) at 48.4% without tools, and a Codeforces Elo of 3455, showcasing Olympiad-level performance in physics and chemistry. The mode emphasizes practical scientific and engineering applications such as error detection in math papers, physical system modeling, semiconductor optimization, and a sketch to CAD/STL pipeline for 3D printing. ARC benchmark creator François Chollet highlights the benchmark's role in advancing test-time adaptation and fluid intelligence, projecting human-AI parity around 2030. This rollout is framed as a productized, compute-heavy test-time mode rather than a lab demo, with cost disclosures for ARC tasks provided.
ElevenLabs $500m Series D at $11B, Cerebras $1B Series H at $23B, Vibe Coding -> Agentic Engineering
gemini-3 claude codex google openai github microsoft deepmind agent-frameworks model-deployment benchmarking cost-optimization software-development async-processing gpu-acceleration coding-agents user-adoption game-theory workflow-integration sama sundarpichai reach_vb
Google's Gemini 3 is being integrated widely, including a new Chrome side panel and Nano Banana UX features, with rapid adoption and a 78% unit-cost reduction in serving costs. The Gemini app reached 750M+ MAU in Q4 2025, nearing ChatGPT's user base. Google is also benchmarking AI "soft skills" through games like Poker and Chess in the Kaggle Game Arena. Meanwhile, coding agents are converging in IDEs: VS Code launched Agent Sessions supporting Claude and Codex agents with features like parallel subagents and integrated browsers. GitHub Copilot now allows agent choice between Claude and OpenAI Codex for async backlog clearing. OpenAI reports 1M+ active users for Codex with expanded integration surfaces, though some users request better GPU support. The coding-agent ecosystem is professionalizing with community platforms like OpenClaw and tooling such as ClawHub and CLI updates. "Gemini 3 adoption faster than any other model" and "VS Code as home for coding agents" highlight major industry shifts.
xAI Grok Imagine API - the #1 Video Model, Best Pricing and Latency - and merging with SpaceX
genie-3 nano-banana-pro gemini lingbot-world grok-imagine runway-gen-4.5 hunyuan-3d-3.1-pro google-deepmind x-ai runway fal interactive-simulation real-time-generation promptability character-customization world-models open-source video-generation audio-generation animation-workflows model-as-a-service 3d-generation latency coherence demishassabis sundarpichai
Google DeepMind launched Project Genie (Genie 3 + Nano Banana Pro + Gemini), a prototype for creating interactive, real-time generated worlds from text or image prompts, currently available to Google AI Ultra subscribers in the U.S. (18+) with noted limitations like ~60s generation limits and imperfect physics. In parallel, the open-source LingBot-World offers a real-time interactive world model with <1s latency at 16 FPS and minute-level coherence, emphasizing interactivity and causal consistency. In video generation, xAI Grok Imagine debuted strongly with native audio support, 15s duration, and competitive pricing at $4.20/min including audio, while Runway Gen-4.5 focuses on animation workflows with new features like Motion Sketch and Character Swap. The 3D generation space sees fal adding Hunyuan 3D 3.1 Pro/Rapid to its API offerings, extending model-as-a-service workflows into 3D pipelines.
not much happened today
claude-3-7-sonnet gpt-4-1 gemini-3 qwen3-vl-embedding qwen3-vl-reranker glm-4-7 falcon-h1r-7b jamba2 stanford google google-deepmind alibaba z-ai tii ai21-labs huggingface copyright-extraction multimodality multilinguality retrieval-augmented-generation model-architecture mixture-of-experts model-quantization reasoning inference kernel-engineering memory-optimization enterprise-ai sundarpichai justinlin610
Stanford paper reveals Claude 3.7 Sonnet memorized 95.8% of Harry Potter 1, highlighting copyright extraction risks compared to GPT-4.1. Google AI Studio sponsors TailwindCSS amid OSS funding debates. Google and Sundar Pichai launch Gmail Gemini 3 features including AI Overviews and natural-language search with user controls. Alibaba Qwen releases Qwen3-VL-Embedding and Qwen3-VL-Reranker, a multimodal, multilingual retrieval stack supporting text, images, and video with quantization and instruction customization, achieving strong benchmark results. Z.ai goes public on HKEX with GLM-4.7 leading the Artificial Analysis Intelligence Index v4.0, showing gains in reasoning, coding, and agentic use, with large-scale MoE architecture and MIT license. Falcon-H1R-7B from TII targets efficient reasoning in smaller models, scoring 16 on the Intelligence Index. AI21 Labs introduces Jamba2, a memory-efficient enterprise model with hybrid SSM-Transformer architecture and Apache 2.0 license, available via SaaS and Hugging Face. vLLM shows throughput improvements in inference and kernel engineering. "Embeddings should be multimodal by default," notes Justin Lin.
Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE
gemini-3-pro gemini-2.5 grok-4.1 sonnet-4.5 gpt-5.1 google google-deepmind multimodality agentic-ai benchmarking context-window model-performance instruction-following model-pricing api model-release reasoning model-evaluation sundarpichai _philschmid oriol_vinyals
Google launched Gemini 3 Pro, a state-of-the-art model with a 1M-token context window, multimodal reasoning, and strong agentic capabilities, priced significantly higher than Gemini 2.5. It leads major benchmarks, surpassing Grok 4.1 and competing closely with Sonnet 4.5 and GPT-5.1, though GPT-5.1 excels in ultralong summarization. Independent evaluations from Artificial Analysis, Vending Bench, ARC-AGI 2, Box, and PelicanBench validate Gemini 3 as a frontier LLM. Google also introduced Antigravity, an agentic IDE powered by Gemini 3 Pro and other models, featuring task orchestration and human-in-the-loop validation. The launch marks Google's strong return to AI with more models expected soon. "Google is very, very back in the business."
Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch
kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq
Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.
not much happened today
trillium gemini-2.5-pro gemini-deepthink google huawei epoch-ai deutsche-telekom nvidia anthropic reka-ai weaviate deepmind energy-efficiency datacenters mcp context-engineering instruction-following embedding-models math-reasoning benchmarking code-execution sundarpichai yuchenj_uw teortaxestex epochairesearch scaling01 _avichawla rekaailabs anthropicai douwekiela omarsar0 nityeshaga goodside iscienceluvr lmthang
Google's Project Suncatcher prototypes scalable ML compute systems in orbit using solar energy with Trillium-generation TPUs surviving radiation, aiming for prototype satellites by 2027. China's 50% electricity subsidies for datacenters may offset chip efficiency gaps, with Huawei planning gigawatt-scale SuperPoDs for DeepSeek by 2027. Epoch launched an open data center tracking hub, and Deutsche Telekom and NVIDIA announced a $1.1B Munich facility with 10k GPUs. In agent stacks, MCP (Model-Compute-Platform) tools gain traction with implementations like LitServe, Claude Desktop, and Reka's MCP server for VS Code. Anthropic emphasizes efficient code execution with MCP. Context engineering shifts focus from prompt writing to model input prioritization, with reports and tools from Weaviate, Anthropic, and practitioners highlighting instruction-following rerankers and embedding approaches. DeepMind's IMO-Bench math reasoning suite shows Gemini DeepThink achieving high scores, with a ProofAutoGrader correlating strongly with human grading. Benchmarks and governance updates include new tasks and eval sharing in lighteval.
Claude Haiku 4.5
claude-3.5-sonnet claude-3-haiku claude-3-haiku-4.5 gpt-5 gpt-4.1 gemma-2.5 gemma o3 anthropic google yale artificial-analysis shanghai-ai-lab model-performance fine-tuning reasoning agent-evaluation memory-optimization model-efficiency open-models cost-efficiency foundation-models agentic-workflows swyx sundarpichai osanseviero clementdelangue deredleritt3r azizishekoofeh vikhyatk mirrokni pdrmnvd akhaliq sayashk gne
Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.
not much happened today
kling-2.5-turbo sora-2 gemini-2.5-flash granite-4.0 qwen-3 qwen-image-2509 qwen3-vl-235b openai google ibm alibaba kling_ai synthesia ollama huggingface arena artificialanalysis tinker scaling01 video-generation instruction-following physics-simulation image-generation model-architecture mixture-of-experts context-windows token-efficiency fine-tuning lora cpu-training model-benchmarking api workflow-automation artificialanlys kling_ai altryne teortaxestex fofrai tim_dettmers sundarpichai officiallogank andrew_n_carr googleaidevs clementdelangue wzhao_nlp alibaba_qwen scaling01 ollama
Kling 2.5 Turbo leads in text-to-video and image-to-video generation with competitive pricing. OpenAI Sora 2 shows strong instruction-following but has physics inconsistencies. Google Gemini 2.5 Flash "Nano Banana" image generation is now generally available with multi-image blending and flexible aspect ratios. IBM Granite 4.0 introduces a hybrid Mamba/Transformer architecture with large context windows and strong token efficiency, outperforming some peers on the Intelligence Index. Qwen models receive updates including fine-tuning API support and improved vision capabilities. Tinker offers a flexible fine-tuning API supporting LoRA sharing and CPU-only training loops. The ecosystem also sees updates like Synthesia 3.0 adding video agents.
nano-banana is Gemini‑2.5‑Flash‑Image, beating Flux Kontext by 170 Elo with SOTA Consistency, Editing, and Multi-Image Fusion
gemini-2.5-flash-image-preview hermes-4 nemotron-nano-9b-v2 internvl3.5 gpt-oss qwen3 deepseek-v3.1 google-deepmind nous-research nvidia openai ollama huggingface openrouter image-editing natural-language-processing multi-image-composition character-consistency reasoning hybrid-models context-windows model-steerability pretraining finetuning alignment vision vision-language api model-integration sundarpichai _philschmid lmarena_ai omarsar0 skirano yupp_ai xanderatallah officiallogank mervenoyann
Google DeepMind revealed Gemini-2.5-Flash-Image-Preview, a state-of-the-art image editing model excelling in character consistency, natural-language edits, and multi-image composition, dominating the Image Edit Arena with a ~170-180 Elo lead and over 2.5M votes. It is integrated into multiple platforms including Google AI Studio and third-party services. Nous Research released Hermes 4, an open-weight hybrid reasoning model focused on steerability and STEM benchmarks. NVIDIA launched Nemotron Nano 9B V2, a hybrid Mamba-Transformer with 128k context, top-performing under 10B parameters, and released a 6.6T-token pretraining subset. InternVL3.5 introduced 32 vision-language models based on OpenAI's gpt-oss and Qwen3 backbones. Ollama v0.11.7 added DeepSeek v3.1 support with hybrid thinking and Turbo mode preview.
Genesis: Generative Physics Engine for Robotics (o1-mini version)
o1 o1-preview gpt-4o claude-3.5-sonnet gemini-2.0-pro llama-3-3b llama-3-70b openai google-deepmind meta-ai-fair hugging-face function-calling structured-outputs vision performance-benchmarks sdk webrtc reasoning math code-generation transformer-architecture model-training humanoid-robots search model-efficiency dataset-sharing aidan_mclau sundarpichai adcock_brett
OpenAI launched the o1 model API featuring function calling, structured outputs, vision support, and developer messages, achieving 60% fewer reasoning tokens than its preview. The model excels in math and code with a 0.76 LiveBench Coding score, outperforming Sonnet 3.5. Beta SDKs for Go and Java and WebRTC support with 60% lower prices were also released. Google Gemini 2.0 Pro (Gemini Exp 1206) deployment accelerated, showing improved coding, math, and reasoning performance. Meta AI FAIR introduced research on training transformers directly on raw bytes using dynamic entropy-based patching. Commercial humanoid robots were successfully deployed by an industry player. Hugging Face researchers demonstrated that their 3B Llama model can outperform the 70B Llama model on MATH-500 accuracy using search techniques, highlighting efficiency gains with smaller models. Concerns about reproducibility and domain-specific limitations were noted.
OpenAI Sora Turbo and Sora.com
sora-turbo o1 claude-3.5-sonnet claude-3.5 gemini llama-3-3-euryale-v2.3 mistral-large behemoth endurance-v1.1 openai google nvidia hugging-face mistral-ai text-to-video-generation quantum-computing coding-capabilities transformers algorithmic-innovation storytelling roleplay model-parameter-tuning anti-monopoly-investigation sama sundarpichai bindureddy denny_zhou nrehiew_
OpenAI launched Sora Turbo, enabling text-to-video generation for ChatGPT Plus and Pro users with monthly generation limits and regional restrictions in Europe and the UK. Google announced a quantum computing breakthrough with the development of the Willow chip, potentially enabling commercial quantum applications. Discussions on O1 model performance highlighted its lag behind Claude 3.5 Sonnet and Gemini in coding tasks, with calls for algorithmic innovation beyond transformer scaling. The Llama 3.3 Euryale v2.3 model was praised for storytelling and roleplay capabilities, with users suggesting parameter tuning to reduce creative liberties and repetition. Alternatives like Mistral-Large, Behemoth, and Endurance v1.1 were also noted. Additionally, Nvidia faces an anti-monopoly investigation in China. Memes and humor around GPU issues and embargo mishaps were popular on social media.