All tags
Topic: "dataset-release"
Gemini 2.5 Pro (06-05) launched at AI Engineer World's Fair
gemini-2.5-pro qwen3-embedding-8b openthinker3-7b google qwen lighton morph-labs openai nvidia benchmarking reasoning coding math embedding-models late-interaction dataset-release model-performance model-architecture ai-conferences greg_brockman jensen_huang christian_szegedy swyx
At the second day of AIE, Google's Gemini 2.5 Pro reclaimed the top spot on the LMArena leaderboard with a score of 1470 and a +24 Elo increase, showing improvements in coding, reasoning, and math. Qwen3 released state-of-the-art embedding and reranking models, with Qwen3-Embedding-8B topping the MTEB multilingual leaderboard. OpenThinker3-7B emerged as the top open reasoning model trained on the OpenThoughts3-1.2M dataset, outperforming previous models by 33%. LightOn introduced FastPlaid, achieving up to a 554% speedup for late-interaction models. Morph Labs hired Christian Szegedy as Chief Scientist to lead Verified Superintelligence development. The AI Engineer World's Fair featured a fireside chat with Greg Brockman and NVIDIA CEO Jensen Huang, highlighting the return of basic research and engineering best practices.
not much happened today
open-code-reasoning-32b open-code-reasoning-14b open-code-reasoning-7b mistral-medium-3 llama-4-maverick gemini-2.5-pro gemini-2.5-flash claude-3.7-sonnet absolute-zero-reasoner x-reasoner fastvlm parakeet-asr openai nvidia mistral-ai google apple huggingface reinforcement-learning fine-tuning code-generation reasoning vision on-device-ai model-performance dataset-release model-optimization reach_vb artificialanlys scaling01 iscienceluvr arankomatsuzaki awnihannun risingsayak
OpenAI launched both Reinforcement Finetuning and Deep Research on GitHub repos, drawing comparisons to Cognition's DeepWiki. Nvidia open-sourced Open Code Reasoning models (32B, 14B, 7B) with Apache 2.0 license, showing 30% better token efficiency and compatibility with llama.cpp, vLLM, transformers, and TGI. Independent evaluations highlight Mistral Medium 3 rivaling Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet in coding and math reasoning, priced significantly lower but no longer open-source. Google's Gemini 2.5 Pro is noted as their most intelligent model with improved coding from simple prompts, while Gemini 2.5 Flash incurs a 150x cost increase over Gemini 2.0 Flash due to higher token usage and cost. The Absolute Zero Reasoner (AZR) achieves SOTA performance in coding and math reasoning via reinforced self-play without external data. Vision-language model X-REASONER is post-trained on general-domain text for reasoning. Apple ML research released FastVLM with on-device iPhone demo. HiDream LoRA trainer supports QLoRA fine-tuning under memory constraints. Nvidia's Parakeet ASR model tops Hugging Face ASR leaderboard with MLX implementation. New datasets SwallowCode and SwallowMath boost LLM performance in math and code. Overall, a quiet day with significant model releases and performance insights.
Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
qwen-3 qwen3-235b-a22b qwen3-30b-a3b deepseek-r1 o1 o3-mini grok-3 gemini-2.5-pro alibaba google-deepmind deepseek mistral-ai mixture-of-experts reinforcement-learning benchmarking model-release model-architecture long-context multi-agent-systems inference dataset-release awnihannun prince_canuma actuallyisaak oriolvinyalsml iscienceluvr reach_vb teortaxestex omarsar0
Qwen 3 has been released by Alibaba featuring a range of models including two MoE variants, Qwen3-235B-A22B and Qwen3-30B-A3B, which demonstrate competitive performance against top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The models introduce an "enable_thinking=True" mode with advanced soft switching for inference scaling. The release is notable for its Apache 2.0 license and broad inference platform support including MCP. The dataset improvements and multi-stage RL post-training contribute to performance gains. Meanwhile, Gemini 2.5 Pro from Google DeepMind shows strong coding and long-context reasoning capabilities, and DeepSeek R2 is anticipated soon. Twitter discussions highlight Qwen3's finegrained MoE architecture, large context window, and multi-agent system applications.
lots of little things happened this week
llama-3-3-nemotron-super-49b-v1 claude anthropic nvidia sakana-ai meta-ai-fair reinforcement-learning reasoning benchmarks multi-turn-collaboration instruction-following dataset-release model-evaluation percy-liang
Anthropic introduced a novel 'think' tool enhancing instruction adherence and multi-step problem solving in agents, with combined reasoning and tool use demonstrated by Claude. NVIDIA's Llama-3.3-Nemotron-Super-49B-v1 ranked #14 on LMArena, noted for strong math reasoning and a 15M post-training dataset. Sakana AI launched a Sudoku-based reasoning benchmark to advance AI problem-solving capabilities. Meta AI released SWEET-RL, a reinforcement learning algorithm improving long-horizon multi-turn tasks by 6%, and introduced CollaborativeAgentBench, a benchmark for collaborative LLM agents working with humans on programming and design tasks. Percy Liang relaunched the HELM benchmark with 5 challenging datasets evaluating 22 top language models.
not much happened today
gemini-2.0-flash-thinking-experimental-1-21 zonos openr1-math-220k huginn-3.5b deepseek-r1 o1 claude google zyphraai hugging-face anthropic deepseek openai vision multilingual-models text-to-speech voice-cloning math reasoning latent-reasoning chain-of-thought dataset-release fine-tuning model-training model-performance context-windows benchmarking jeremyphoward andrej-karpathy tom-goldstein reach_vb iscienceluvr
Google released Gemini 2.0 Flash Thinking Experimental 1-21, a vision-language reasoning model with a 1 million-token context window and improved accuracy on science, math, and multimedia benchmarks, surpassing DeepSeek-R1 but trailing OpenAI's o1. ZyphraAI launched Zonos, a multilingual Text-to-Speech model with instant voice cloning and controls for speaking rate, pitch, and emotions, running at ~2x real-time speed on RTX 4090. Hugging Face released OpenR1-Math-220k, a large-scale math reasoning dataset with 220K problems and 800K reasoning traces generated on 512 H100 GPUs. Tom Goldstein introduced Huginn-3.5B, an open-source latent reasoning model trained on 800B tokens that outperforms larger models on reasoning tasks like GSM8K. Discussions by Jeremy Howard and iScienceLuvr highlight advances in implicit latent reasoning and debate the future of human-readable reasoning traces. Anthropic launched the Anthropic Economic Index to analyze AI's economic impact using millions of Claude conversations.
not much happened today
claudette llama-3-1 yi-lightning gpt-4o claude-3.5-sonnet answer-ai tencent notebooklm motherduck perplexity dropbox openai meta-ai-fair yi-ai zyphra-ai anthropic langchain openai synthetic-data fine-tuning sql audio-processing on-device-ai dataset-release transformer llm-reasoning ai-safety code-generation ai-pricing ai-job-market fchollet aravsrinivas svpino swyx
Answer.ai launched fastdata, a synthetic data generation library using
claudette
and Tencent's Billion Persona paper. NotebookLM became customizable, and Motherduck introduced notable LLMs in SQL implementations. Perplexity and Dropbox announced competitors to Glean. OpenAI unveiled audio chat completions priced at 24 cents per minute. Meta AI released Llama 3.1, powering Lenovo AI Now's on-device agent. Yi-Lightning model ranked #6 globally, surpassing GPT-4o. Zyphra AI released the large Zyda-2 dataset with 5 trillion tokens. François Chollet clarified transformer architecture as set-processing, not sequence-processing. Research suggests memorization aids LLM reasoning. Anthropic updated its Responsible Scaling Policy for AI safety. Tools like Perplexity Finance, Open Canvas by LangChain, and AlphaCodium code generation tool were highlighted. Approximately $500 million was raised for AI agent startups, with ongoing discussions on AI's job market impact. Combining prompt caching with the Batches API can yield a 95% discount on Claude 3.5 Sonnet tokens. Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
llama-3-1-405b llama-3-8b llama-3-70b llama-3-1-8b gpt-4o gpt-4o-mini claude-3-5 qwen-2 meta-ai-fair openai alibaba multilinguality code-generation context-windows model-training synthetic-data benchmarking reasoning fine-tuning model-performance dataset-release swyx philschmid jjitsev lewtun teknium1 adcock_brett
Llama 3.1 leaks reveal a 405B dense model with 128k context length, trained on 39.3M GPU hours using H100-80GB GPUs, and fine-tuned with over 25M synthetic examples. The model shows significant benchmark improvements, especially for the 8B and 70B variants, with some evals suggesting the 70B outperforms GPT-4o. GPT-4o Mini launched as a cost-efficient variant with strong performance but some reasoning weaknesses. Synthetic datasets like NuminaMath enable models such as Alibaba Qwen 2 to surpass GPT-4o and Claude 3.5 in math competitions. Discussions include reasoning task benchmarks and dataset building for improved reasoning.
GraphRAG: The Marriage of Knowledge Graphs and RAG
gemma-2 llama-3-70b claude-3.5-sonnet nemotron-340b qwen2-72b llama-3 microsoft-research anthropic nvidia hugging-face retrieval-augmented-generation knowledge-graphs token-usage inference-time attention-mechanisms instruction-following coding math long-range-reasoning synthetic-data dataset-release fine-tuning context-windows function-calling travis-fischer rasbt alexandr-wang osanseviero rohanpaul_ai hamelhusain svpino aaaazzam omarsar0
Microsoft Research open sourced GraphRAG, a retrieval augmented generation (RAG) technique that extracts knowledge graphs from sources and clusters them for improved LLM answers, though it increases token usage and inference time. Gemma 2 models were released focusing on efficient small LLMs with innovations like sliding window attention and RMS norm, nearly matching the larger Llama 3 70B. Anthropic's Claude 3.5 Sonnet leads in instruction following and coding benchmarks, while Nvidia's Nemotron 340B model was released in June. Qwen2-72B tops the HuggingFace Open LLM leaderboard excelling in math and long-range reasoning. Discussions on RAG highlighted its limitations and improvements in context usage via function calls. A persona-driven synthetic data generation approach introduced 1 billion personas, with a fine-tuned model matching GPT-4 performance on math benchmarks at 7B scale. The 200GB AutoMathText dataset was also noted for math data synthesis.
The Last Hurrah of Stable Diffusion?
llama-3-8b llama-3 qwen-2 gpt-4 gpt-4o stability-ai togethercompute model-architecture fine-tuning benchmarks dataset-release model-evaluation reasoning model-training retrieval-augmented-generation multimodality emad-mostaque rohanpaul_ai fchollet mikeknoop micahgoldblum teknium1 rasbt percyliang
Stability AI launched Stable Diffusion 3 Medium with models ranging from 450M to 8B parameters, featuring the MMDiT architecture and T5 text encoder for image text rendering. The community has shown mixed reactions following the departure of key researchers like Emad Mostaque. On AI models, Llama 3 8B Instruct shows strong evaluation correlation with GPT-4, while Qwen 2 Instruct surpasses Llama 3 on MMLU benchmarks. The Mixture of Agents (MoA) framework outperforms GPT-4o on AlpacaEval 2.0. Techniques like Spectrum and QLoRA enable efficient fine-tuning with less VRAM. Research on grokking reveals transformers can transition from memorization to generalization through extended training. Benchmark initiatives include the $1M ARC Prize Challenge for AGI progress and LiveBench, a live LLM benchmark to prevent dataset contamination. The Character Codex Dataset offers open data on over 15,000 characters for RAG and synthetic data. The MLX 0.2 tool enhances LLM experience on Apple Silicon Macs with improved UI and faster retrieval-augmented generation.
Stable Diffusion 3 — Rombach & Esser did it again!
stable-diffusion-3 claude-3 orca dolphincoder-starcoder2-15b stability-ai anthropic microsoft latitude perplexity-ai llamaindex tripo-ai diffusion-models multimodality benchmarking human-evaluation text-generation image-generation 3d-modeling fine-tuning roleplay coding dataset-release soumith-chintala bill-peebles swyx kevinafischer jeremyphoward akhaliq karinanguyen_ aravsrinivas
Over 2500 new community members joined following Soumith Chintala's shoutout, highlighting growing interest in SOTA LLM-based summarization. The major highlight is the detailed paper release of Stable Diffusion 3 (SD3), showcasing advanced text-in-image control and complex prompt handling, with the model outperforming other SOTA image generation models in human-evaluated benchmarks. The SD3 model is based on an enhanced Diffusion Transformer architecture called MMDiT. Meanwhile, Anthropic released Claude 3 models, noted for human-like responses and emotional depth, scoring 79.88% on HumanEval but costing over twice as much as GPT-4. Microsoft launched new Orca-based models and datasets, and Latitude released DolphinCoder-StarCoder2-15b with strong coding capabilities. Integration of image models by Perplexity AI and 3D CAD generation by PolySpectra powered by LlamaIndex were also highlighted. "SD3's win rate beats all other SOTA image gen models (except perhaps Ideogram)" and "Claude 3 models are very good at generating d3 visualizations from text descriptions."
Dia de las Secuelas (StarCoder, The Stack, Dune, SemiAnalysis)
starcoder-2 starcoder2-15b hugging-face bigcode code-generation model-training dataset-release model-performance dylan-patel
HuggingFace/BigCode has released StarCoder v2, including the StarCoder2-15B model trained on over 600 programming languages using the The Stack v2 dataset. This release marks a state-of-the-art achievement for models of this size, with opt-out requests excluded from training data. A detailed technical report is available, highlighting the model's capabilities and training methodology. Additionally, a live event featuring Dylan Patel discussing GPU economics is announced for San Francisco.
Trust in GPTs at all time low
llama-3 mistral-medium llava-1.6 miquella-120b-gguf tinymodels miqumaid harmony-4x7b-bf16 smaug-34b-v0.1 openai hugging-face mistral-ai nous-research bittensor context-management fine-tuning model-merging quantization gpu-servers visual-reasoning ocr dataset-release incentive-structures nick-dobos manojbh teknium arthurmensch
Discord communities were analyzed with 21 guilds, 312 channels, and 8530 messages reviewed, saving an estimated 628 minutes of reading time. Discussions highlighted challenges with GPTs and the GPT store, including critiques of the knowledge files capability and context management issues. The CUDA MODE Discord was introduced for CUDA coding support. Key conversations in the TheBloke Discord covered Xeon GPU server cost-effectiveness, Llama3 and Mistral Medium model comparisons, LLaVA-1.6's visual reasoning and OCR capabilities, and the leaked Miqu 70B model. Technical topics included fine-tuning TinyLlama and MiquMaid+Euryale models, and model merging with examples like Harmony-4x7B-bf16 and Smaug-34B-v0.1. The Nous Research AI Discord discussed style influence in LLMs, quantization issues, Bittensor incentives for AI model improvements, and the identification of MIQU as Mistral Medium. The release of the Open Hermes 2.5 dataset on Hugging Face was also announced. "Discussions pointed towards the need for better context management in GPTs, contrasting with OpenAI's no-code approach."
1/12/2024: Anthropic coins Sleeper Agents
nous-mixtral 120b anthropic openai nous-research hugging-face reinforcement-learning fine-tuning backdoors model-security adversarial-training chain-of-thought model-merging dataset-release security-vs-convenience leo-gao andrej-karpathy
Anthropic released a new paper exploring the persistence of deceptive alignment and backdoors in models through stages of training including supervised fine-tuning and reinforcement learning safety training. The study found that safety training and adversarial training did not eliminate backdoors, which can cause models to write insecure code or exhibit hidden behaviors triggered by specific prompts. Notable AI figures like leo gao and andrej-karpathy praised the work, highlighting its implications for future model security and the risks of sleeper agent LLMs. Additionally, the Nous Research AI Discord community discussed topics such as the trade-off between security and convenience, the Hulk Dataset 0.1 for LLM fine-tuning, curiosity about a 120B model and Nous Mixtral, debates on LLM leaderboard legitimacy, and the rise of Frankenmerge techniques for model merging and capacity enhancement.