All tags
Model: "qwen3"
not much happened today
kernelllm-8b gpt-4o deepseek-v3 mistral-medium-3 qwen3 blip3-o xgen-small anisora stable-audio-open-small alphaevolve meta-ai-fair mistral-ai qwen deepseek salesforce bilibili stability-ai google benchmarking model-performance multilinguality hardware-optimization multimodality image-generation video-generation text-to-audio model-parallelism chain-of-thought instruction-following reasoning mitigation-strategies reach_vb lmarena_ai theadimeline adcock_brett jxmnop dair_ai omarsar0
Meta released KernelLLM 8B, outperforming GPT-4o and DeepSeek V3 on KernelBench-Triton Level 1. Mistral Medium 3 debuted strongly in multiple benchmarks. Qwen3 models introduced a unified framework with multilingual support. DeepSeek-V3 features hardware-aware co-design. BLIP3-o family released for multimodal tasks using diffusion transformers. Salesforce launched xGen-Small models excelling in long-context and math benchmarks. Bilibili released AniSORA for anime video generation. Stability AI open-sourced Stable Audio Open Small optimized for Arm devices. Google’s AlphaEvolve coding agent improved Strassen's algorithm for the first time since 1969. Research shows chain-of-thought reasoning can harm instruction-following ability, with mitigation strategies like classifier-selective reasoning being most effective, but reasoning techniques show high variance and limited generalization. "Chain-of-thought (CoT) reasoning can harm a model’s ability to follow instructions" and "Mitigation strategies such as few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning can counteract reasoning-induced failures".
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.
Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model
gemini-2.5-pro claude-3.7-sonnet llama-nemotron qwen3 google-deepmind nvidia alibaba hugging-face multimodality coding reasoning model-release speech-recognition recommender-systems benchmarking demishassabis _philschmid lmarena_ai scaling01 fchollet
Gemini 2.5 Pro has been updated with enhanced multimodal image-to-code capabilities and dominates the WebDev Arena Leaderboard, surpassing Claude 3.7 Sonnet in coding and other tasks. Nvidia released the Llama-Nemotron model family on Hugging Face, noted for efficient reasoning and inference. Alibaba's Qwen3 models range from 0.6B to 235B parameters, including dense and MoE variants. KerasRS was released by Fran ois Chollet as a new recommender system library compatible with JAX, PyTorch, and TensorFlow, optimized for TPUs. These updates highlight advancements in coding, reasoning, and speech recognition models.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
LlamaCon: Meta AI gets into the Llama API platform business
llama-4 qwen3 qwen3-235b-a22b qwen3-30b-a3b qwen3-4b qwen2-5-72b-instruct o3-mini meta-ai-fair cerebras groq alibaba vllm ollama llamaindex hugging-face llama-cpp model-release fine-tuning reinforcement-learning moe multilingual-models model-optimization model-deployment coding benchmarking apache-license reach_vb huybery teortaxestex awnihannun thezachmueller
Meta celebrated progress in the Llama ecosystem at LlamaCon, launching an AI Developer platform with finetuning and fast inference powered by Cerebras and Groq hardware, though it remains waitlisted. Meanwhile, Alibaba released the Qwen3 family of large language models, including two MoE models and six dense models ranging from 0.6B to 235B parameters, with the flagship Qwen3-235B-A22B achieving competitive benchmark results and supporting 119 languages and dialects. The Qwen3 models are optimized for coding and agentic capabilities, are Apache 2.0 licensed, and have broad deployment support including local usage with tools like vLLM, Ollama, and llama.cpp. Community feedback highlights Qwen3's scalable performance and superiority over models like OpenAI's o3-mini.