Person: "andrewyng"

Gemini 2.5 Computer Use preview beats Sonnet 4.5 and OAI CUA

gemini-2.5 gpt-5-pro glm-4.6 codex google-deepmind openai microsoft anthropic zhipu-ai llamaindex mongodb agent-frameworks program-synthesis security multi-agent-systems computer-use-models open-source moe developer-tools workflow-automation api vision reasoning swyx demishassabis philschmid assaf_elovic hwchase17 jerryjliu0 skirano fabianstelzer blackhc andrewyng

Google DeepMind released a new Gemini 2.5 Computer Use model for browser and Android UI control, evaluated by Browserbase. OpenAI showcased GPT-5 Pro, new developer tools including Codex with Slack integration, and agent-building SDKs at Dev Day. Google DeepMind's CodeMender automates security patching for large codebases. Microsoft introduced an open-source Agent Framework for multi-agent enterprise systems. AI community discussions highlight agent orchestration, program synthesis, and UI control advancements. GLM-4.6 update from Zhipu features a large Mixture-of-Experts model with 355B parameters.

Aug 01

Gemini 2.5 Deep Think finally ships

gemini-2.5-deep-think gpt-oss gpt-5 kimi-k2-turbo-preview qwen3-coder-flash glm-4.5 step-3 claude openai anthropic google-deepmind kimi-moonshot alibaba ollama zhipu-ai stepfun parallel-thinking model-releases moe attention-mechanisms multimodal-reasoning model-performance context-windows open-source-models model-leaks creative-ai coding reasoning model-optimization demishassabis philschmid scaling01 teortaxestex teknium1 lmarena_ai andrewyng

OpenAI is rumored to soon launch new GPT-OSS and GPT-5 models amid drama with Anthropic revoking access to Claude. Google DeepMind quietly launched Gemini 2.5 Deep Think, a model optimized for parallel thinking that achieved gold-medal level at the IMO and excels in reasoning, coding, and creative tasks. Leaks suggest OpenAI is developing a 120B MoE and a 20B model with advanced attention mechanisms. Chinese AI companies like Kimi Moonshot, Alibaba, and ZHIpu AI are releasing faster and more capable open models such as kimi-k2-turbo-preview, Qwen3-Coder-Flash, and GLM-4.5, signaling strong momentum and potential to surpass the U.S. in AI development. "The final checkpoint was selected just 5 hours before the IMO problems were released," highlighting rapid development cycles.

Jun 18

Zuck goes Superintelligence Founder Mode: $100M bonuses + $100M+ salaries + NFDG Buyout?

llama-4 maverick scout minimax-m1 afm-4.5b chatgpt midjourney-v1 meta-ai-fair openai deeplearning-ai essential-ai minimax arcee midjourney long-context multimodality model-release foundation-models dataset-release model-training video-generation enterprise-ai model-architecture moe prompt-optimization sama nat dan ashvaswani clementdelangue amit_sangani andrewyng _akhaliq

Meta AI is reportedly offering 8-9 figure signing bonuses and salaries to top AI talent, confirmed by Sam Altman. They are also targeting key figures like Nat and Dan from the AI Grant fund for strategic hires. Essential AI released the massive 24-trillion-token Essential-Web v1.0 dataset with rich metadata and a 12-category taxonomy. DeepLearning.AI and Meta AI launched a course on Llama 4, featuring new MoE models Maverick (400B) and Scout (109B) with context windows up to 10M tokens. MiniMax open-sourced MiniMax-M1, a long-context LLM with a 1M-token window, and introduced the Hailuo 02 video model. OpenAI rolled out "Record mode" for ChatGPT Pro, Enterprise, and Edu on macOS. Arcee launched the AFM-4.5B foundation model for enterprise. Midjourney released its V1 video model enabling image animation. These developments highlight major advances in model scale, long-context reasoning, multimodality, and enterprise AI applications.

May 29

DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release

deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun

DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.

Feb 06

Gemini 2.0 Flash GA, with new Flash Lite, 2.0 Pro, and Flash Thinking

gemini-2.0-flash gemini-2.0-flash-lite gemini-2.0-pro-experimental gemini-1.5-pro deepseek-r1 gpt-2 llama-3-1 google-deepmind hugging-face anthropic multimodality context-windows cost-efficiency pretraining fine-tuning reinforcement-learning transformer tokenization embeddings mixture-of-experts andrej-karpathy jayalammar maartengr andrewyng nearcyan

Google DeepMind officially launched Gemini 2.0 models including Flash, Flash-Lite, and Pro Experimental, with Gemini 2.0 Flash outperforming Gemini 1.5 Pro while being 12x cheaper and supporting multimodal input and a 1 million token context window. Andrej Karpathy released a 3h31m video deep dive into large language models, covering pretraining, fine-tuning, and reinforcement learning with examples like GPT-2 and Llama 3.1. A free course on Transformer architecture was introduced by Jay Alammar, Maarten Gr, and Andrew Ng, focusing on tokenizers, embeddings, and mixture-of-expert models. DeepSeek-R1 reached 1.2 million downloads on Hugging Face with a detailed 36-page technical report. Anthropic increased rewards to $10K and $20K for their jailbreak challenge, while BlueRaven extension was updated to hide Twitter metrics for unbiased engagement.

Jan 10

not much happened today

rstar-math o1-preview qwen2.5-plus qwen2.5-coder-32b-instruct phi-4 claude-3.5-sonnet openai anthropic alibaba microsoft cohere langchain weights-biases deepseek rakuten rbc amd johns-hopkins math process-reward-model mcts vision reasoning synthetic-data pretraining rag automation private-deployment multi-step-workflow open-source-dataset text-embeddings image-segmentation chain-of-thought multimodal-reasoning finetuning recursive-self-improvement collaborative-platforms ai-development partnerships cuda triton ai-efficiency ai-assisted-coding reach_vb rasbt akshaykagrawal arankomatsuzaki teortaxestex aidangomez andrewyng

rStar-Math surpasses OpenAI's o1-preview in math reasoning with 90.0% accuracy using a 7B LLM and MCTS with a Process Reward Model. Alibaba launches Qwen Chat featuring Qwen2.5-Plus and Qwen2.5-Coder-32B-Instruct models enhancing vision-language and reasoning. Microsoft releases Phi-4, trained on 40% synthetic data with improved pretraining. Cohere introduces North, a secure AI workspace integrating LLMs, RAG, and automation for private deployments. LangChain showcases a company research agent with multi-step workflows and open-source datasets. Transformers.js demos released for text embeddings and image segmentation in JavaScript. Research highlights include Meta Meta-CoT for enhanced chain-of-thought reasoning, DeepSeek V3 with recursive self-improvement, and collaborative AI development platforms. Industry partnerships include Rakuten with LangChain, North with RBC supporting 90,000 employees, and Agent Laboratory collaborating with AMD and Johns Hopkins. Technical discussions emphasize CUDA and Triton for AI efficiency and evolving AI-assisted coding stacks by Andrew Ng.

Nov 15, 2024

Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo

claude-3-sonnet gpt-4 gemini-1.5 claude-3.5-sonnet anthropic openai langchain meta-ai-fair benchmarking prompt-engineering rag visuotactile-perception ai-governance theoretical-alignment ethical-alignment jailbreak-robustness model-releases alignment richardmcngo andrewyng philschmid

Anthropic released the 3.5 Sonnet benchmark for jailbreak robustness, emphasizing adaptive defenses. OpenAI enhanced GPT-4 with a new RAG technique for contiguous chunk retrieval. LangChain launched Promptim for prompt optimization. Meta AI introduced NeuralFeels with neural fields for visuotactile perception. RichardMCNgo resigned from OpenAI, highlighting concerns on AI governance and theoretical alignment. Discussions emphasized the importance of truthful public information and ethical alignment in AI deployment. The latest Gemini update marks a new #1 LLM amid alignment challenges. The AI community continues to focus on benchmarking, prompt-engineering, and alignment issues.

Nov 08, 2024

not much happened today

claude-3.5-sonnet opencoder anthropic microsoft sambanova openai langchain llamaindex multi-agent-systems natural-language-interfaces batch-processing harmful-content-detection secret-management retrieval-augmented-generation error-analysis memory-management web-scraping autonomous-agents sophiamyang tom_doerr omarsar0 _akhaliq andrewyng giffmana

This week in AI news, Anthropic launched Claude Sonnet 3.5, enabling desktop app control via natural language. Microsoft introduced Magentic-One, a multi-agent system built on the AutoGen framework. OpenCoder was unveiled as an AI-powered code cookbook for large language models. SambaNova is sponsoring a hackathon with prizes up to $5000 for building real-time AI agents. Sophiamyang announced new Batch and Moderation APIs with 50% lower cost and multi-dimensional harmful text detection. Open-source tools like Infisical for secret management, CrewAI for autonomous agent orchestration, and Crawlee for web scraping were released. Research highlights include SCIPE for error analysis in LLM chains, Context Refinement Agent for improved retrieval-augmented generation, and MemGPT for managing LLM memory. The week also saw a legal win for OpenAI in the RawStory copyright case, affirming that facts used in LLM training are not copyrightable.