All tags
Person: "xikun_zhang_"
not much happened today
gpt-oss-120b gpt-oss-20b kimi-k2 deepseek-r1 qwen-3-32b openai huggingface microsoft llamaindex ollama baseten fireworksai cerebras groq together anthropic google uk-aisi sliding-window-attention mixture-of-experts rope context-length mxfp4-format synthetic-data reasoning-core-hypothesis red-teaming benchmarking coding-benchmarks model-performance fine-tuning woj_zaremba sama huybery drjimfan jxmnop scaling01 arunv30 kevinweil xikun_zhang_ jerryjliu0 ollama basetenco reach_vb gneubig shxf0072 _lewtun
OpenAI released its first open models since GPT-2, gpt-oss-120b and gpt-oss-20b, which quickly trended on Hugging Face. Microsoft supports these models via Azure AI Foundry and Windows Foundry Local. Key architectural innovations include sliding window attention, mixture of experts (MoE), a RoPE variant, and a 256k context length. The models use a new MXFP4 format supported by llama.cpp. Hypotheses suggest gpt-oss was trained on synthetic data to enhance safety and performance, supporting the Reasoning Core Hypothesis. OpenAI announced a $500K bounty for red teaming with partners including Anthropic, Google, and the UK AISI. Performance critiques highlight inconsistent benchmarking results, with GPT-OSS-120B scoring 41.8% on the Aider Polyglot coding benchmark, trailing competitors like Kimi-K2 and DeepSeek-R1. Some users note the model excels in math and reasoning but lacks common sense and practical utility.
not much happened today
gpt-5 gpt4-0314 qwen3-235b-thinking runway-aleph imagen-4-ultra smollm3 grok-4 openai alibaba runway hugging-face google anthropic pytorch lmarena reinforcement-learning reasoning video-generation image-generation model-optimization open-source model-performance inference-speed integration stability sama clementdelangue xikun_zhang_ teknnium1 chujiezheng
OpenAI has fully rolled out its ChatGPT agent to all Plus, Pro, and Team users and is building hype for the upcoming GPT-5, which reportedly outperforms Grok-4 and can build a cookie clicker game in two minutes. Alibaba's Qwen team released the open-source reasoning model Qwen3-235B-Thinking, achieving an 89% win rate over gpt4-0314 using a new RL algorithm called Group Sequence Policy Optimization (GSPO). Runway introduced Runway Aleph, a state-of-the-art in-context video model for editing and generating video content. Hugging Face highlights the growing momentum of open-source AI, especially from Chinese teams. Other updates include Kling's upgrades for image-to-video generation and Google's Imagen 4 Ultra being recognized as a top text-to-image model. Anthropic integrated Claude with Canva for branded visual designs but faces stability issues. The PyTorch team released optimized checkpoints for SmolLM3 to speed up inference.
3x in 3 months: Cursor @ $28b, Cognition + Windsurf @ $10b
qwen3-coder chatgpt-agent claude-code mini cursor cognition windsurf alibaba openai anthropic perplexity agentic-ai fundraising software-engineering ai-coding agentic-economy model-integration community-feedback performance-benchmarking bindureddy xikun_zhang_ aravsrinivas gergelyorosz jeremyphoward
Cursor is reportedly fundraising at a $28 billion valuation with $1 billion ARR, while the combined Cognition+Windsurf entity is fundraising at a $10 billion valuation after acquiring Windsurf remainco for $300 million. The competition between AI coding agents intensifies as Cursor focuses on Async SWE Agents and Cognition+Windsurf acquires an agentic IDE. Alibaba's Qwen3-Coder gains widespread adoption for coding tasks and integration into tools like Claude Code and LM Studio. OpenAI rolls out ChatGPT Agent to all Plus, Pro, and Team users, sparking discussions about an "agentic economy" emphasizing AI literacy. Anthropic's Claude Code is praised as a premier development tool with active community feedback. Perplexity's Comet browser assistant receives positive reviews and new feature showcases. The debate continues on whether AI coding tools will replace developers, with critiques highlighting the ongoing human effort required. A new minimalistic software engineering agent, mini, achieves 65% on SWE-bench with just 100 lines of code.
ChatGPT Agent: new o* model + unified Deep Research browser + Operator computer use + Code Interpreter terminal
o3 o4 gptnext openai reinforcement-learning benchmarking model-performance model-risk long-context model-deployment fine-tuning sama gdb kevinweil xikun_zhang_ keren_gu boazbaraktcs
OpenAI launched the ChatGPT Agent, a new advanced AI system capable of browsing the web, coding, analyzing data, and creating reports, marking a significant step towards human-like computer use. The agent, distinct from and superior to o3, is considered the first public exposure of what was internally called o4, now merged into GPTNext. It features end-to-end reinforcement learning, can operate for extended periods (tested up to 2 hours), and is classified as "High" risk for biological misuse, with safeguards activated. Early benchmarks show mixed results, excelling in some tests like WebArena and BrowserComp but underperforming on others like PaperBench. Key figures involved include Sam Altman, Greg Brockman, and Kevin Weil, with technical insights from xikun_zhang_ and risk commentary from KerenGu and boazbaraktcs. The launch sparked speculation about GPT-5, which was confirmed not to be the case.