All tags
Company: "alibaba"
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.
Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model
gemini-2.5-pro claude-3.7-sonnet llama-nemotron qwen3 google-deepmind nvidia alibaba hugging-face multimodality coding reasoning model-release speech-recognition recommender-systems benchmarking demishassabis _philschmid lmarena_ai scaling01 fchollet
Gemini 2.5 Pro has been updated with enhanced multimodal image-to-code capabilities and dominates the WebDev Arena Leaderboard, surpassing Claude 3.7 Sonnet in coding and other tasks. Nvidia released the Llama-Nemotron model family on Hugging Face, noted for efficient reasoning and inference. Alibaba's Qwen3 models range from 0.6B to 235B parameters, including dense and MoE variants. KerasRS was released by Fran ois Chollet as a new recommender system library compatible with JAX, PyTorch, and TensorFlow, optimized for TPUs. These updates highlight advancements in coding, reasoning, and speech recognition models.
Cursor @ $9b, OpenAI Buys Windsurf @ $3b
llama-nemotron-ultra llama-nemotron-super llama-nemotron-nano qwen3-235b-a22b prover-v2 phi-4-reasoning ernie-4.5-turbo ernie-x1-turbo suno-v4.5 gen-4-references o1-mini openai cursor nvidia alibaba deepseek microsoft baidu suno runway keras reasoning inference-efficiency open-license moe-models math-reasoning theorem-proving model-performance music-generation image-generation recommender-systems tpu-optimization _akhaliq adcock_brett lmarena_ai fchollet
OpenAI is reportedly close to closing a deal with Windsurf, coinciding with Cursor's $900M funding round at a $9B valuation. Nvidia launched the Llama-Nemotron series featuring models from 8B to 253B parameters, praised for reasoning and inference efficiency. Alibaba released the Qwen3 family with MoE and dense models up to 235B parameters, ranking highly in coding and math benchmarks. DeepSeek introduced Prover-V2, an open-source AI for math reasoning with an 88.9% pass rate on MiniF2F-test. Microsoft released reasoning-focused Phi-4 models, outperforming OpenAI's o1-mini. Baidu debuted turbo versions of ERNIE 4.5 and X1 for faster, cheaper inference. Suno v4.5 added advanced AI music generation features, while Runway Gen-4 References enable placing characters into scenes with high consistency. KerasRS, a new recommender system library optimized for TPUs, was released by Fran ois Chollet.
not much happened today
qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb
Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.
not much happened today
phi-4 phi-4-mini-reasoning qwen3-235b qwen3-moe-235b qwen3-moe-30b qwen3-dense-32b qwen3-dense-14b qwen3-dense-8b qwen3-dense-4b qwen3-dense-0.6b qwen2.5-omni-3b deepseek-prover-v2 llama llama-guard-4 prompt-guard-2 mimo-7b microsoft anthropic cursor alibaba togethercompute deepseek meta-ai-fair xiaomi openrouterai cohere reasoning model-fine-tuning model-evaluation benchmarking model-popularity open-source math model-scaling model-filtering jailbreak-prevention cline reach_vb vipulved akhaliq omarsar0 zhs05232838 huajian_xin mervenoyann karpathy random_walker sarahookr blancheminerva clefourrier
Microsoft released Phi-reasoning 4, a finetuned 14B reasoning model slightly behind QwQ but limited by data transparency and token efficiency issues. Anthropic introduced remote MCP server support and a 45-minute Research mode in Claude. Cursor published a model popularity list. Alibaba launched Qwen3-235B and other Qwen3 variants, highlighting budget-friendly coding and reasoning capabilities, with availability on Together AI API. Microsoft also released Phi-4-Mini-Reasoning with benchmark performance on AIME 2025 and OmniMath. DeepSeek announced DeepSeek-Prover V2 with state-of-the-art math problem solving, scaling to 671B parameters. Meta AI's Llama models hit 1.2 billion downloads, with new Llama Guard 4 and Prompt Guard 2 for input/output filtering and jailbreak prevention. Xiaomi released the open-source reasoning model MiMo-7B trained on 25 trillion tokens. Discussions on AI model evaluation highlighted issues with the LMArena leaderboard, data access biases favoring proprietary models, and challenges in maintaining fair benchmarking, with suggestions for alternatives like OpenRouterAI rankings. "LMArena slop and biased" and "61.3% of all data going to proprietary model providers" were noted concerns.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
LlamaCon: Meta AI gets into the Llama API platform business
llama-4 qwen3 qwen3-235b-a22b qwen3-30b-a3b qwen3-4b qwen2-5-72b-instruct o3-mini meta-ai-fair cerebras groq alibaba vllm ollama llamaindex hugging-face llama-cpp model-release fine-tuning reinforcement-learning moe multilingual-models model-optimization model-deployment coding benchmarking apache-license reach_vb huybery teortaxestex awnihannun thezachmueller
Meta celebrated progress in the Llama ecosystem at LlamaCon, launching an AI Developer platform with finetuning and fast inference powered by Cerebras and Groq hardware, though it remains waitlisted. Meanwhile, Alibaba released the Qwen3 family of large language models, including two MoE models and six dense models ranging from 0.6B to 235B parameters, with the flagship Qwen3-235B-A22B achieving competitive benchmark results and supporting 119 languages and dialects. The Qwen3 models are optimized for coding and agentic capabilities, are Apache 2.0 licensed, and have broad deployment support including local usage with tools like vLLM, Ollama, and llama.cpp. Community feedback highlights Qwen3's scalable performance and superiority over models like OpenAI's o3-mini.
Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
qwen-3 qwen3-235b-a22b qwen3-30b-a3b deepseek-r1 o1 o3-mini grok-3 gemini-2.5-pro alibaba google-deepmind deepseek mistral-ai mixture-of-experts reinforcement-learning benchmarking model-release model-architecture long-context multi-agent-systems inference dataset-release awnihannun prince_canuma actuallyisaak oriolvinyalsml iscienceluvr reach_vb teortaxestex omarsar0
Qwen 3 has been released by Alibaba featuring a range of models including two MoE variants, Qwen3-235B-A22B and Qwen3-30B-A3B, which demonstrate competitive performance against top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The models introduce an "enable_thinking=True" mode with advanced soft switching for inference scaling. The release is notable for its Apache 2.0 license and broad inference platform support including MCP. The dataset improvements and multi-stage RL post-training contribute to performance gains. Meanwhile, Gemini 2.5 Pro from Google DeepMind shows strong coding and long-context reasoning capabilities, and DeepSeek R2 is anticipated soon. Twitter discussions highlight Qwen3's finegrained MoE architecture, large context window, and multi-agent system applications.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
not much happened today
nemotron-h nvidia-eagle-2.5 gpt-4o qwen2.5-vl-72b gemini-2.5-flash gemini-2.0-pro gemini-exp-1206 gemma-3 qwen2.5-32b deepseek-r1-zero-32b uni3c seedream-3.0 adobe-dragon kimina-prover qwen2.5-72b bitnet-b1.58-2b4t nvidia deepseek hugging-face alibaba bytedance adobe transformers model-optimization multimodality long-context reinforcement-learning torch-compile image-generation diffusion-models distributional-rewards model-efficiency model-training native-quantization sampling-techniques philschmid arankomatsuzaki osanseviero iScienceLuvr akhaliq
Nemotron-H model family introduces hybrid Mamba-Transformer models with up to 3x faster inference and variants including 8B, 56B, and a compressed 47B model. Nvidia Eagle 2.5 is a frontier VLM for long-context multimodal learning, matching GPT-4o and Qwen2.5-VL-72B on long-video understanding. Gemini 2.5 Flash shows improved dynamic thinking and cost-performance, outperforming previous Gemini versions. Gemma 3 now supports torch.compile for about 60% faster inference on consumer GPUs. SRPO using Qwen2.5-32B surpasses DeepSeek-R1-Zero-32B on benchmarks with reinforcement learning only. Alibaba's Uni3C unifies 3D-enhanced camera and human motion controls for video generation. Seedream 3.0 by ByteDance is a bilingual image generation model with high-resolution outputs up to 2K. Adobe DRAGON optimizes diffusion generative models with distributional rewards. Kimina-Prover Preview is an LLM trained with reinforcement learning from Qwen2.5-72B, achieving 80.7% pass@8192 on miniF2F. BitNet b1.58 2B4T is a native 1-bit LLM with 2B parameters trained on 4 trillion tokens, matching full-precision LLM performance with better efficiency. Antidistillation sampling counters unwanted model distillation by modifying reasoning traces from frontier models.
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
not much happened today
grok-3 grok-3-mini gpt-4.5 claude-3.7-sonnet quasar-alpha optimus-alpha gpt-4.1 kaleidoscope internvl3 internvit qwen2.5vl transmamba fantasytalking openai alibaba cmu reinforcement-learning reasoning benchmarks vision multilinguality multimodality transformers attention-mechanisms agents code-generation model-performance rasbt sarahookr mervenoyann gneubig svpino mathemagic1an
The AI news recap highlights independent evaluations showing Grok-3 outperforming models like GPT-4.5 and Claude 3.7 Sonnet on reasoning benchmarks, while Grok-3 mini excels in reasoning tasks. Research on reinforcement learning (RL) fine-tuning reveals potential improvements for small reasoning models but also notes instability in reported gains. Benchmark results suggest Quasar Alpha and Optimus Alpha may be versions of GPT-4.1. Vision and multimodal models like Kaleidoscope, supporting 18 languages, and InternVL3, built on InternViT and Qwen2.5VL, demonstrate advances in multilingual vision and reasoning. The fusion model TransMamba combines transformer precision with speed via SSM mechanisms. Alibaba's FantasyTalking generates realistic talking portraits. Agent-focused events at CMU and tools like FilmAgent AI for virtual film production and BrowseComp benchmark for browsing agents were announced. The coding assistant Augment supports multiple IDEs with code analysis and suggestions. Discussions also covered Google’s new agent-to-agent protocol concept.
not much happened today
gpt-2 r1 gemma-3 gemmacoder3-12b qwen2.5-omni openai deepseek berkeley alibaba togethercompute nvidia azure runway langchain bmw amazon open-source function-calling benchmarking code-reasoning multimodality inference-speed image-generation voice-generation animation robotics realtime-transcription webrtc sama clémentdelangue lioronai scaling01 cognitivecompai osanseviero jack_w_rae ben_burtenshaw theturingpost vipulved kevinweil tomlikesrobots adcock_brett juberti
OpenAI plans to release its first open-weight language model since GPT-2 in the coming months, signaling a move towards more open AI development. DeepSeek launched its open-source R1 model earlier this year, challenging perceptions of China's AI progress. Gemma 3 has achieved function calling capabilities and ranks on the Berkeley Function-Calling Leaderboard, while GemmaCoder3-12b improves code reasoning performance on LiveCodeBench. Alibaba_Qwen's Qwen2.5-Omni introduces a novel Thinker-Talker system and TMRoPE for multimodal input understanding. The TogetherCompute team achieved 140 TPS on a 671B parameter model, outperforming Azure and DeepSeek API on Nvidia GPUs. OpenAI also expanded ChatGPT features with image generation for all free users and a new voice release. Runway Gen-4 enhances animation for miniature dioramas, and LangChain launched a chat-based generative UI agent. Commercial deployment of Figure 03 humanoid robots at BMW highlights advances in autonomy and manufacturing scaling. New tools include OpenAI's realtime transcription API with WebRTC support and Amazon's Nova Act AI browser agent.
OpenAI adopts MCP
gemini-2.5-pro gemini-1.5-pro gemini-2.0-flash qwen-2.5-omni-7b deepseek-v3-0324 deepseek-r1 openai google-deepmind alibaba togethercompute model-benchmarking multimodality reasoning scaling-laws model-quantization synthetic-data model-performance context-windows speech-recognition translation audio-processing video-processing swyx
OpenAI announced support for MCP, a significant technical update. Google's Gemini 2.5 Pro leads benchmarks with top scores in MMLU-Pro (86%), GPQA Diamond (83%), and AIME 2024 (88%), featuring a 1 million token context window and multimodal inputs. Alibaba's Qwen 2.5 Omni 7B was released as a fully multimodal, interactive, open-source model with a novel "thinker-talker" architecture supporting voice and video chat. DeepSeek V3-0324 outperforms its predecessor on multiple benchmarks. Research on reasoning features in large language models using sparse autoencoders was highlighted, alongside a study on scaling laws of synthetic data showing performance plateaus near 300B tokens. Discussions also covered the fastest output speeds of Gemini models and concerns about over-reliance on benchmarks for intelligence measurement. Swyx will curate the Data Council AI Engineering Track in April.
Halfmoon is Reve Image: a new SOTA Image Model from ex-Adobe/Stability trio
deepseek-v3-0324 qwen-2.5-vl-32b-instruct recraft artificial-analysis stability-ai adobe deepseek alibaba text-to-image prompt-understanding model-composition visual-generation language-understanding model-performance complex-prompting iterative-generation christian-cantrell taesung-park michael-gharbi
Reve, a new composite AI model from former Adobe and Stability alums Christian Cantrell, Taesung Park, and Michaël Gharbi, has emerged as the top-rated image generation model, surpassing previous state-of-the-art models like Recraft and Ideogram in text rendering and typography. The team emphasizes "enhancing visual generative models with logic" and "understanding user intent with advanced language capabilities" to iteratively amend visuals based on natural language input. Additionally, DeepSeek-V3-0324 and Alibaba's Qwen2.5-VL-32B-Instruct models were released with notable performance improvements, including better vision task benchmarks and mathematical reasoning.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
The new OpenAI Agents Platform
reka-flash-3 o1-mini claude-3-7-sonnet llama-3-3-70b sonic-2 qwen-chat olympiccoder openai reka-ai hugging-face deepseek togethercompute alibaba ai-agents api model-releases fine-tuning reinforcement-learning model-training model-inference multimodality voice-synthesis gpu-clusters model-distillation performance-optimization open-source sama reach_vb
OpenAI introduced a comprehensive suite of new tools for AI agents, including the Responses API, Web Search Tool, Computer Use Tool, File Search Tool, and an open-source Agents SDK with integrated observability tools, marking a significant step towards the "Year of Agents." Meanwhile, Reka AI open-sourced Reka Flash 3, a 21B parameter reasoning model that outperforms o1-mini and powers their Nexus platform, with weights available on Hugging Face. The OlympicCoder series surpassed Claude 3.7 Sonnet and much larger models on competitive coding benchmarks. DeepSeek built a 32K GPU cluster capable of training V3-level models in under a week and is exploring AI distillation. Hugging Face announced Cerebras inference support, achieving over 2,000 tokens/s on Llama 3.3 70B, 70x faster than leading GPUs. Reka's Sonic-2 voice AI model delivers 40ms latency via the Together API. Alibaba's Qwen Chat enhanced its multimodal interface with video understanding up to 500MB, voice-to-text, guest mode, and expanded file uploads. Sama praised OpenAI's new API as "one of the most well-designed and useful APIs ever."
not much happened today
jamba-1.6 mistral-ocr qwq-32b o1 o3-mini instella llama-3-2-3b gemma-2-2b qwen-2-5-3b babel-9b babel-83b gpt-4o claude-3-7-sonnet ai21-labs mistral-ai alibaba openai amd anthropic hugging-face multimodality ocr multilinguality structured-output on-prem-deployment reasoning benchmarking api open-source model-training gpu-optimization prompt-engineering function-calling
AI21 Labs launched Jamba 1.6, touted as the best open model for private enterprise deployment, outperforming Cohere, Mistral, and Llama on benchmarks like Arena Hard. Mistral AI released a state-of-the-art multimodal OCR model with multilingual and structured output capabilities, available for on-prem deployment. Alibaba Qwen introduced QwQ-32B, an open-weight reasoning model with 32B parameters and cost-effective usage, showing competitive benchmark scores. OpenAI released o1 and o3-mini models with advanced API features including streaming and function calling. AMD unveiled Instella, open-source 3B parameter language models trained on AMD Instinct MI300X GPUs, competing with Llama-3.2-3B and others. Alibaba also released Babel, open multilingual LLMs performing comparably to GPT-4o. Anthropic launched Claude 3.7 Sonnet, enhancing reasoning and prompt engineering capabilities.
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
AI Engineer Summit Day 1
grok-3 o3-mini deepseek-r1 qwen-2.5-vl openai anthropic xai togethercompute alibaba sakana-ai benchmarking model-performance cuda model-training open-source debugging inference-speed batch-size reinforcement-learning aidan_mclau giffmana nrehiew_ teortaxestex epochairesearch andrew_n_carr borismpower yuhu_ai_
The AIE Summit in NYC highlighted key talks including Grace Isford's Trends Keynote, Neo4j/Pfizer's presentation, and OpenAI's first definition of Agents. Speakers announced $930 million in funding. On AI Twitter, discussions focused on Grok-3 and o3-mini models, with debates on performance and benchmarking, including Grok-3's record compute scale of 4e26 to 5e26 FLOP. The o3-mini model uncovered a critical CUDA kernel bug in Sakana AI's code. DeepSeek-R1 was promoted as an open-source alternative with notable training batch sizes. Additionally, Alibaba announced the Qwen 2.5-VL model release.
small news items
gpt-4.5 gpt-5 deepseek-r1-distilled-qwen-1.5b o1-preview modernbert-0.3b qwen-0.5b o3 openai ollama mistral perplexity cerebras alibaba groq bytedance math benchmarking fine-tuning model-performance reinforcement-learning model-architecture partnerships funding jeremyphoward arankomatsuzaki sama nrehiew_ danhendrycks akhaliq
OpenAI announced plans for GPT-4.5 (Orion) and GPT-5, with GPT-5 integrating the o3 model and offering unlimited chat access in the free tier. DeepSeek R1 Distilled Qwen 1.5B outperforms OpenAI's o1-preview on math benchmarks, while ModernBERT 0.3b surpasses Qwen 0.5b at MMLU without fine-tuning. Mistral and Perplexity adopt Cerebras hardware for 10x performance gains. OpenAI's o3 model won a gold medal at the 2024 International Olympiad in Informatics. Partnerships include Qwen with Groq. Significant RLHF activity is noted in Nigeria and the global south, and Bytedance is expected to rise in AI prominence soon. "GPT5 is all you need."
not much happened today
oute-tts-0.3-1b oute-tts-0.3-500m olm-1b qwen-2.5-0.5b hover gpt-4o deepseek-v3 harvey meta-ai-fair stability-ai alibaba deepseek hugging-face text-to-speech zero-shot-learning multilinguality emotion-control motor-control reinforcement-learning local-ai distributed-inference pipeline-parallelism mathematical-reasoning process-reward-models legal-ai education-ai ai-security humor reach_vb drjimfan vikhyatk mervenoyann aiatmeta iscienceluvr alibaba_qwen awnihannun ajeya_cotra emollick qtnx_ designerx
Harvey secured a new $300M funding round. OuteTTS 0.3 1B & 500M text-to-speech models were released featuring zero-shot voice cloning, multilingual support (en, jp, ko, zh, fr, de), and emotion control, powered by OLMo-1B and Qwen 2.5 0.5B. The HOVER model, a 1.5M-parameter neural net for agile motor control, was introduced, leveraging human motion capture datasets and massively parallel reinforcement learning. kokoro.js enables running AI models locally in browsers with minimal dependencies. Meta AI awarded $200K LLM evaluation grants for projects on regional language understanding, complex reasoning, and interactive programming environments. Stability AI's Twitter account was hacked, prompting security warnings. Alibaba Qwen improved Process Reward Models (PRMs) for better mathematical reasoning using a consensus filtering mechanism. DeepSeek V3 uses pipeline parallelism to enhance distributed inference and long-context generation efficiency. Discussions on AI policy in legal frameworks and AI's role in democratizing education were highlighted. Lighthearted AI-related humor was also shared.
not much happened today
rstar-math o1-preview qwen2.5-plus qwen2.5-coder-32b-instruct phi-4 claude-3.5-sonnet openai anthropic alibaba microsoft cohere langchain weights-biases deepseek rakuten rbc amd johns-hopkins math process-reward-model mcts vision reasoning synthetic-data pretraining rag automation private-deployment multi-step-workflow open-source-dataset text-embeddings image-segmentation chain-of-thought multimodal-reasoning finetuning recursive-self-improvement collaborative-platforms ai-development partnerships cuda triton ai-efficiency ai-assisted-coding reach_vb rasbt akshaykagrawal arankomatsuzaki teortaxestex aidangomez andrewyng
rStar-Math surpasses OpenAI's o1-preview in math reasoning with 90.0% accuracy using a 7B LLM and MCTS with a Process Reward Model. Alibaba launches Qwen Chat featuring Qwen2.5-Plus and Qwen2.5-Coder-32B-Instruct models enhancing vision-language and reasoning. Microsoft releases Phi-4, trained on 40% synthetic data with improved pretraining. Cohere introduces North, a secure AI workspace integrating LLMs, RAG, and automation for private deployments. LangChain showcases a company research agent with multi-step workflows and open-source datasets. Transformers.js demos released for text embeddings and image segmentation in JavaScript. Research highlights include Meta Meta-CoT for enhanced chain-of-thought reasoning, DeepSeek V3 with recursive self-improvement, and collaborative AI development platforms. Industry partnerships include Rakuten with LangChain, North with RBC supporting 90,000 employees, and Agent Laboratory collaborating with AMD and Johns Hopkins. Technical discussions emphasize CUDA and Triton for AI efficiency and evolving AI-assisted coding stacks by Andrew Ng.
not much happened today
qwen-o1 qvq claude-3.5-sonnet gpt-4o o3 o3-mini alibaba openai mit idsia llamaindex ollama vision benchmarking llm-calibration intentionality alignment-faking deliberative-alignment artificial-life gdpr-compliance contract-review-agent app-creation synthetic-data post-transformers smol-models agents bret-taylor
The Qwen team launched QVQ, a vision-enabled version of their experimental QwQ o1 clone, benchmarking comparably to Claude 3.5 Sonnet. Discussions include Bret Taylor's insights on autonomous software development distinct from the Copilot era. The Latent Space LIVE! talks cover highlights of 2024 AI startups, vision, open models, post-transformers, synthetic data, smol models, and agents. Twitter recaps by Claude 3.5 Sonnet highlight proposals for benchmarks measuring LLM calibration and falsehood confidence, with QVQ outperforming GPT-4o and Claude Sonnet 3.5. AI alignment debates focus on intentionality and critiques of alignment faking in models like Claude. Updates from OpenAI include new o3 and o3-mini models and a deliberative alignment strategy. The ASAL project is a collaboration between MIT, OpenAI, and Swiss AI Lab IDSIA to automate artificial life discovery. Personal stories reveal frustrations with USCIS green card denials despite high qualifications. New tools like GeminiCoder enable rapid app creation, and a contract review agent using Reflex and Llama Index checks GDPR compliance. Holiday greetings and memes were also shared.
not much happened to end the week
gemini deepseek-r1 o1 chatgpt gpt-4 claude-3.5-sonnet o1-preview o1-mini gpt4o qwq-32b google-deepmind deeplearningai amazon tesla x-ai alibaba ollama multimodality benchmarking quantization reinforcement-learning ai-safety translation reasoning interpretability model-comparison humor yoshua-bengio kevinweil ylecun
AI News for 11/29/2024-11/30/2024 covers key updates including the Gemini multimodal model advancing in musical structure understanding, a new quantized SWE-Bench for benchmarking at 1.3 bits per task, and the launch of the DeepSeek-R1 model focusing on transparent reasoning as an alternative to o1. The establishment of the 1st International Network of AI Safety Institutes highlights global collaboration on AI safety. Industry updates feature Amazon's Olympus AI model, Tesla's Optimus, and experiments with ChatGPT as a universal translator. Community reflections emphasize the impact of large language models on daily life and medical AI applications. Discussions include scaling sparse autoencoders to gpt-4 and the need for transparency in reasoning LLMs. The report also notes humor around ChatGPT's French nickname.
Common Corpus: 2T Open Tokens with Provenance
qwen-2.5-coder claude-3.5-sonnet janusflow-1.3b ocronos-vintage pleais huggingface langchainai deepseek alibaba anthropic provenance ocr multilingual-datasets prompt-engineering multimodality image-generation code-generation quantization model-scaling inference-efficiency tim-dettmers tom-doerr omarsar0 swyx madiator reach_vb
Pleais via Huggingface released Common Corpus, the largest fully open multilingual dataset with over 2 trillion tokens including detailed provenance information. They also introduced OCRonos-Vintage, a 124M-parameter OCR correction model that efficiently fixes digitization errors on CPU and GPU, unlocking knowledge from PDFs. On AI tools, LangChainAI launched Prompt Canvas for collaborative prompt engineering, while DeepSeek released JanusFlow 1.3B, a unified multimodal LLM integrating autoregressive and rectified flow models for enhanced image understanding and generation. Alibaba Cloud announced Qwen2.5-Coder, a code-focused LLM with advanced coding capabilities, and Claude 3.5 Sonnet was highlighted for superior code generation. Discussions on quantization challenges and scaling laws for precision by Tim Dettmers and others emphasized the impact of low-precision training on model scalability and inference efficiency. "Scaling Laws for Precision" paper insights and alternative efficiency methods were also noted.
BitNet was a lie?
qwen-2.5-coder-32b-instruct gpt-4o llama-3 sambanova alibaba hugging-face quantization scaling-laws model-efficiency fine-tuning model-performance code-generation open-source unit-testing ci-cd tanishq-kumar tim-dettmers
Scaling laws for quantization have been modified by a group led by Chris Re, analyzing over 465 pretraining runs and finding benefits plateau at FP6 precision. Lead author Tanishq Kumar highlights that longer training and more data increase sensitivity to quantization, explaining challenges with models like Llama-3. Tim Dettmers, author of QLoRA, warns that the era of efficiency gains from low-precision quantization is ending, signaling a shift from scaling to optimizing existing resources. Additionally, Alibaba announced Qwen 2.5-Coder-32B-Instruct, which matches or surpasses GPT-4o on coding benchmarks, and open-source initiatives like DeepEval for LLM testing are gaining traction.
a calm before the storm
o1 o1-mini qwen2.5 gpt-4 llama-2-70b llama-7b anthropic openai alibaba microsoft blackrock groq aramco disney eth-zurich pudu-robotics slack long-context kv-cache-quantization diffusion-models reinforcement-learning robotics ai-integration multilinguality model-benchmarking model-performance model-optimization adcock_brett philschmid rohanpaul_ai jvnixon kateclarktweets sama
Anthropic is raising funds at a valuation up to $40 billion ahead of anticipated major releases. OpenAI launched new reasoning models o1 and o1-mini, with increased rate limits and a multilingual MMLU benchmark. Alibaba released the open-source Qwen2.5 model supporting 29+ languages, showing competitive performance to gpt-4 at lower cost. Microsoft and Blackrock plan to invest $30 billion in AI data centers, with Groq partnering with Aramco to build the world's largest AI inference center. Robotics advances include Disney Research and ETH Zurich's diffusion-based motion generation for robots and Pudu Robotics' semi-humanoid robot. Slack and Microsoft introduced AI-powered agents integrated into their platforms. Research highlights include long-context scaling for llama-2-70b using Dual Chunk Attention and KV cache quantization enabling 1 million token context on llama-7b models.
o1 destroys Lmsys Arena, Qwen 2.5, Kyutai Moshi release
o1-preview o1-mini qwen-2.5 qwen-plus llama-3-1 deepseek-v2.5 openai anthropic google alibaba deepseek kyutai weights-biases mistral-ai chain-of-thought multimodality model-benchmarking model-performance streaming-neural-architecture llm-observability experiment-tracking rate-limiting sama guillaumelample
OpenAI's o1-preview model has achieved a milestone by fully matching top daily AI news stories without human intervention, consistently outperforming other models like Anthropic, Google, and Llama 3 in vibe check evaluations. OpenAI models dominate the top 4 slots on LMsys benchmarks, with rate limits increasing to 500-1000 requests per minute. In open source, Alibaba's Qwen 2.5 suite surpasses Llama 3.1 at the 70B scale and updates its closed Qwen-Plus models to outperform DeepSeek V2.5 but still lag behind leading American models. Kyutai Moshi released its open weights realtime voice model featuring a unique streaming neural architecture with an "inner monologue." Weights & Biases introduced Weave, an LLM observability toolkit that enhances experiment tracking and evaluation, turning prompting into a more scientific process. The news also highlights upcoming events like the WandB LLM-as-judge hackathon in San Francisco. "o1-preview consistently beats out our vibe check evals" and "OpenAI models are gradually raising rate limits by the day."
not much happened today
llama-3-1 claude-3-5-sonnet llama-3-1-405b ltm-2-mini qwen2-vl gpt-4o-mini meta-ai-fair hugging-face magic-ai-labs lmsys alibaba openai long-context style-control multimodality ai-safety model-evaluation web-crawling pdf-processing ai-hype-cycles call-center-automation sam-altman ajeya-cotra fchollet rohanpaul_ai philschmid
Meta announced significant adoption of LLaMA 3.1 with nearly 350 million downloads on Hugging Face. Magic AI Labs introduced LTM-2-Mini, a long context model with a 100 million token context window, and a new evaluation method called HashHop. LMSys added style control to their Chatbot Arena leaderboard, improving rankings for models like Claude 3.5 Sonnet and LLaMA 3.1 405B. Alibaba released Qwen2-VL, a multimodal LLM under Apache 2.0 license, competitive with GPT-4o mini. OpenAI CEO Sam Altman announced collaboration with the US AI Safety Institute for pre-release model testing. Discussions on AI safety and potential AI takeover risks were highlighted by Ajeya Cotra. Tools like firecrawl for web crawling and challenges in PDF processing were noted. AI hype cycles and market trends were discussed by François Chollet, and potential AI disruption in call centers was shared by Rohan Paul.
CogVideoX: Zhipu's Open Source Sora
cogvideox llama-3-1 llama-3-405b moondream phi-3.5 llama-rank zhipu-ai alibaba meta-ai-fair google hugging-face nvidia togethercompute salesforce video-generation serverless-computing vision document-vqa text-vqa mixture-of-experts retrieval-augmented-generation long-context model-routing webgpu background-removal long-form-generation superposition-prompting rohanpaul_ai philschmid vikhyatk algo_diver jayalammar davidsholz
Zhipu AI, Alibaba's AI arm and China's 3rd largest AI lab, released the open 5B video generation model CogVIdeoX, which can run without GPUs via their ChatGLM web and desktop apps. Meta AI announced trust & safety research and CyberSecEval 3 alongside the release of Llama 3.1, with Llama 3 405B now available serverless on Google Cloud Vertex AI and Hugging Face x NVIDIA NIM API. Updates include Moondream, an open vision-language model improving DocVQA and TextVQA tasks, and the lightweight MoE chat model Phi-3.5 with 16x3.8B parameters. Together Compute introduced the Rerank API featuring Salesforce's LlamaRank model for document and code ranking. Research highlights include superposition prompting for RAG without fine-tuning, the AgentWrite pipeline for long-form content generation over 20,000 words, and a comparison showing Long Context methods outperform RAG at higher costs. Tools include Not Diamond, an AI model router, AI command line interfaces, and an open-source WebGPU background removal tool. "You don't even need GPUs to run it," referring to CogVIdeoX.
a quiet weekend
sam-2 qwen2-math gpt-4 claude-3.5 figure deepmind boston-dynamics alibaba llamaindex robotics object-segmentation real-time-processing disease-prediction speech-recognition cli-tools model-performance adcock_brett rasbt hamel-husain rohanpaul_ai
Figure unveiled Figure 02, claimed as the most advanced humanoid robot, operating autonomously at BMW's Plant Spartanburg. DeepMind developed a table tennis robot achieving 100% wins against beginners and 55% against intermediates. Boston Dynamics showcased the dexterity of its fully-electric Atlas robot performing pushups and burpees. An autonomous dental robot performed the world's first dental procedure on a human, reducing a 2-hour process to 15 minutes using a 3D volumetric scanner. SAM 2 was introduced as an open model for real-time object segmentation without custom adaptation. Alibaba released Qwen2-Math, outperforming GPT-4 and Claude 3.5 in math capabilities. A new Listening-While-Speaking Language Model (LSLM) enables simultaneous listening and speaking in real-time. Researchers developed a disease prediction AI with 95% accuracy for diseases like coronary artery disease, type 2 diabetes, and breast cancer. Tools like LlamaParse CLI and MLX Whisper package enhance PDF parsing and speech recognition, with the latter running 40X faster than realtime on M1 Max. The news highlights significant advancements in robotics, AI models, and practical AI tools.
Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
llama-3-1-405b llama-3-8b llama-3-70b llama-3-1-8b gpt-4o gpt-4o-mini claude-3-5 qwen-2 meta-ai-fair openai alibaba multilinguality code-generation context-windows model-training synthetic-data benchmarking reasoning fine-tuning model-performance dataset-release swyx philschmid jjitsev lewtun teknium1 adcock_brett
Llama 3.1 leaks reveal a 405B dense model with 128k context length, trained on 39.3M GPU hours using H100-80GB GPUs, and fine-tuned with over 25M synthetic examples. The model shows significant benchmark improvements, especially for the 8B and 70B variants, with some evals suggesting the 70B outperforms GPT-4o. GPT-4o Mini launched as a cost-efficient variant with strong performance but some reasoning weaknesses. Synthetic datasets like NuminaMath enable models such as Alibaba Qwen 2 to surpass GPT-4o and Claude 3.5 in math competitions. Discussions include reasoning task benchmarks and dataset building for improved reasoning.
Gemma 2: The Open Model for Everyone
gemma-2 qwen-72b mixtral-8x22b-instruct claude-3.5-sonnet google-deepmind alibaba mistral-ai anthropic knowledge-distillation attention-mechanisms multilingual-models multimodality model-training model-optimization memory-optimization fine-tuning kathleen-kenealy daniel-han
Gemma 2, a 27B parameter model from google-deepmind, was released with innovations like 1:1 local-global attention alternation and logit soft-capping, leveraging knowledge distillation to train smaller models on over 50× the compute-optimal token quantity. The model supports multilingual and multimodal capabilities, with fine-tuning success on over 200 Indic language variants. The Open LLM Leaderboard highlights alibaba's Qwen 72B as the top model, with mistral-ai's Mixtral-8x22B-Instruct also ranking highly. Anthropic launched Claude 3.5 Sonnet, improving intelligence at mid-tier cost and speed. Research on eliminating matrix multiplication in LLMs promises significant memory savings without performance loss. Kathleen Kenealy and Daniel Han provided insights on Gemma 2's tokenizer and attention scaling respectively.
HippoRAG: First, do know(ledge) Graph
qwen-2 gpt-4 hipporag alibaba openai knowledge-graphs personalized-pagerank multi-hop-retrieval chain-of-thought implicit-reasoning sparse-autoencoders model-interpretability model-efficiency model-architecture fine-tuning reinforcement-learning rohanpaul_ai omarsar0 nabla_theta huybery
Alibaba released new open-source Qwen2 models ranging from 0.5B to 72B parameters, achieving SOTA results on benchmarks like MMLU and HumanEval. Researchers introduced Sparse Autoencoders to interpret GPT-4 neural activity, improving feature representation. The HippoRAG paper proposes a hippocampus-inspired retrieval augmentation method using knowledge graphs and Personalized PageRank for efficient multi-hop reasoning. New techniques like Stepwise Internalization enable implicit chain-of-thought reasoning in LLMs, enhancing accuracy and speed. The Buffer of Thoughts (BoT) method improves reasoning efficiency with significant cost reduction. A novel scalable MatMul-free LLM architecture competitive with SOTA Transformers at billion-parameter scale was also presented. "Single-Step, Multi-Hop retrieval" is highlighted as a key advancement in retrieval speed and cost.
Qwen 2 beats Llama 3 (and we don't know how)
qwen-2 llama-3 llama-3-70b gpt-4 nllb alibaba groq meta-ai-fair multilinguality benchmarking inference-speed sparse-autoencoders scaling-laws post-training instruction-following rejection-sampling execution-feedback model-release multilingual-models model-training philschmid huybery jonathanross321 awnihannun gdb nabla_theta ylecun
Alibaba released Qwen 2 models under Apache 2.0 license, claiming to outperform Llama 3 in open models with multilingual support in 29 languages and strong benchmark scores like MMLU 82.3 and HumanEval 86.0. Groq demonstrated ultra-fast inference speed on Llama-3 70B at 40,792 tokens/s and running 4 Wikipedia articles in 200ms. Research on sparse autoencoders (SAEs) for interpreting GPT-4 neural activity showed new training methods, metrics, and scaling laws. Meta AI announced the No Language Left Behind (NLLB) model capable of high-quality translations between 200 languages, including low-resource ones. "Our post-training phase is designed with the principle of scalable training with minimal human annotation," highlighting techniques like rejection sampling for math and execution feedback for coding.
Not much happened today
gpt-4o gemini-1.5-pro gemini-1.5-flash imagen-3 veo reka-core qwen-1.5-110b openai google-deepmind anthropic rekailabs alibaba salesforce multimodality long-context model-releases reinforcement-learning model-benchmarking text-to-image video-generation ai-assistants ilya-sutskever jakub-pachocki mike-krieger sama
Ilya Sutskever steps down as Chief Scientist at OpenAI after nearly a decade, with Jakub Pachocki named as his successor. Google DeepMind announces Gemini 1.5 Pro and Gemini 1.5 Flash models featuring 2 million token context and improved multimodal capabilities, alongside demos of Project Astra AI assistant, Imagen 3 text-to-image model, and Veo generative video model. GPT-4o tops the VHELM leaderboard and outperforms competitors on LMSYS Chatbot Arena. Reka Core multimodal model with 128K context and Alibaba's Qwen1.5-110B open-source model are released. Salesforce shares an online RLHF recipe.