All tags
Topic: "open-source-models"
not much happened today
codex claude-4-opus claude-4-sonnet gemini-2.5-pro gemini-2.5 qwen-2.5-vl qwen-3 playdiffusion openai anthropic google perplexity-ai bing playai suno hugging-face langchain-ai qwen mlx assemblyai llamacloud fine-tuning model-benchmarking text-to-video agentic-ai retrieval-augmented-generation open-source-models speech-editing audio-processing text-to-speech ultra-low-latency multimodality public-notebooks sama gdb kevinweil lmarena_ai epochairesearch reach_vb wightmanr deeplearningai mervenoyann awnihannun jordirib1 aravsrinivas omarsar0 lioronai jerryjliu0 nerdai tonywu_71 _akhaliq clementdelangue _mfelfel
OpenAI rolled out Codex to ChatGPT Plus users with internet access and fine-grained controls, improving memory features for free users. Anthropic's Claude 4 Opus and Sonnet models lead coding benchmarks, while Google's Gemini 2.5 Pro and Flash models gain recognition with new audio capabilities. Qwen 2.5-VL and Qwen 3 quantizations are noted for versatility and support. Bing Video Creator launched globally enabling text-to-video generation, and Perplexity Labs sees increased demand for travel search. New agentic AI tools and RAG innovations include LlamaCloud and FedRAG. Open-source releases include Holo-1 for web navigation and PlayAI's PlayDiffusion for speech editing. Audio and multimodal advances feature Suno's music editing upgrades, Google's native TTS in 24+ languages, and Universal Streaming's ultra-low latency speech-to-text. Google NotebookLM now supports public notebooks. "Codex's internet access brings tradeoffs, with explicit warnings about risk" and "Gemini 2.5 Pro is cited as a daily driver by users".
not much happened today
gemini-2.0-flash imagen-3 mistral-small-3.1 mistral-3 gpt-4o-mini claude-3.5-haiku olm0-32b qwen-2.5 shieldgemma-2 julian fasttransform nvidia google mistral-ai allen-ai anthropic langchainai perplexity-ai kalshi stripe qodoai multimodality image-generation context-windows model-pricing open-source-models image-classification frameworks python-libraries partnerships jeremyphoward karpathy abacaj mervenoyann
At Nvidia GTC Day 1, several AI updates were highlighted: Google's Gemini 2.0 Flash introduces image input/output but is not recommended for text-to-image tasks, with Imagen 3 preferred for that. Mistral AI released Mistral Small 3.1 with 128k token context window and competitive pricing. Allen AI launched OLMo-32B, an open LLM outperforming GPT-4o mini and Qwen 2.5. ShieldGemma 2 was introduced for image safety classification. LangChainAI announced multiple updates including Julian powered by LangGraph and integration with AnthropicAI's MCP. Jeremy Howard released fasttransform, a Python library for data transformations. Perplexity AI partnered with Kalshi for NCAA March Madness predictions.
s1: Simple test-time scaling (and Kyutai Hibiki)
qwen-2.5-32b gemini-2.0-flash smollm2 granite-vision-3.1-2b google-deepmind qwen gemini hugging-face ibm deepseek reasoning fine-tuning scaling-laws open-source-models data-centric-training vision multilingual-models language-model-reasoning niklas-muennighoff
"Wait" is all you need introduces a novel reasoning model finetuned from Qwen 2.5 32B using just 1000 questions with reasoning traces distilled from Gemini 2.0 Flash Thinking, enabling controllable test-time compute by appending "Wait" to extend reasoning. Lead author Niklas Muennighoff, known for work on Bloom, StarCoder, and BIG-bench, highlights this method's efficiency and its reproduction of the famous o1 scaling chart. Additionally, Kyutai Moshi's Hibiki project demonstrates impressive offline French-English live translation on iPhone. Recent AI model releases include DeepSeek R1 and R3 open source models, potentially marking a major open-source milestone, Hugging Face's SmolLM2 emphasizing data-centric training for small LMs, and IBM's Granite-Vision-3.1-2B, a small vision-language model with strong performance. Key research papers spotlight LIMO for minimal demonstration reasoning achieving high accuracy on AIME and MATH benchmarks, and Token-Assisted Reasoning mixing latent and text tokens to improve language model reasoning.
not much happened today
llama-3-2 llama-3 gemma-2 phi-3-5-mini claude-3-haiku gpt-4o-mini molmo gemini-1.5 gemini meta-ai-fair openai allenai google-deepmind multimodality model-optimization benchmarks ai-safety model-distillation pruning adapter-layers open-source-models performance context-windows mira-murati demis-hassabis ylecun sama
Meta AI released Llama 3.2 models including 1B, 3B text-only and 11B, 90B vision variants with 128K token context length and adapter layers for image-text integration. These models outperform competitors like Gemma 2 and Phi 3.5-mini, and are supported on major platforms including AWS, Azure, and Google Cloud. OpenAI CTO Mira Murati announced her departure. Allen AI released Molmo, an open-source multimodal model family outperforming proprietary systems. Google improved Gemini 1.5 with Flash and Pro models. Meta showcased Project Orion AR glasses and hinted at a Quest 3S priced at $300. Discussions covered new benchmarks for multimodal models, model optimization, and AI safety and alignment.
Cerebras Inference: Faster, Better, AND Cheaper
llama-3.1-8b llama-3.1-70b gemini-1.5-flash gemini-1.5-pro cogvideox-5b mamba-2 rene-1.3b llama-3.1 gemini-1.5 claude groq cerebras cursor google-deepmind anthropic inference-speed wafer-scale-chips prompt-caching model-merging benchmarking open-source-models code-editing model-optimization jeremyphoward sam-altman nat-friedman daniel-gross swyx
Groq led early 2024 with superfast LLM inference speeds, achieving ~450 tokens/sec for Mixtral 8x7B and 240 tokens/sec for Llama 2 70B. Cursor introduced a specialized code edit model hitting 1000 tokens/sec. Now, Cerebras claims the fastest inference with their wafer-scale chips, running Llama3.1-8b at 1800 tokens/sec and Llama3.1-70B at 450 tokens/sec at full precision, with competitive pricing and a generous free tier. Google's Gemini 1.5 models showed significant benchmark improvements, especially Gemini-1.5-Flash and Gemini-1.5-Pro. New open-source models like CogVideoX-5B and Mamba-2 (Rene 1.3B) were released, optimized for consumer hardware. Anthropic's Claude now supports prompt caching, improving speed and cost efficiency. "Cerebras Inference runs Llama3.1 20x faster than GPU solutions at 1/5 the price."
The DSPy Roadmap
dspy litel-lm gemini chatgpt-4o grok-2 hermes-3 databricks mit google openai x-ai nous-research astribot apple sakana-ai model-optimization fine-tuning optimizers interactive-optimization robotics autonomous-systems voice image-generation open-source-models scientific-research streaming caching omar-khattab giffmana
Omar Khattab announced joining Databricks before his MIT professorship and outlined the roadmap for DSPy 2.5 and 3.0+, focusing on improving core components like LMs, signatures, optimizers, and assertions with features such as adopting LiteLLM to reduce code and enhance caching and streaming. The roadmap also includes developing more accurate, cost-effective optimizers, building tutorials, and enabling interactive optimization tracking. On AI Twitter, Google launched Gemini Live, a mobile conversational AI with voice and 10 voices, alongside Pixel Buds Pro 2 with a custom Tensor A1 chip. OpenAI updated ChatGPT-4o, reclaiming the top spot on LMSYS Arena. xAI released Grok-2 in beta, achieving SOTA in image generation with FLUX 1. Nous Research released open-source Hermes 3 models in 8B, 70B, and 405B sizes, with the 405B model achieving SOTA. Robotics updates include Astribot's humanoid robot and Apple's tabletop robot with Siri voice commands. Sakana AI introduced "The AI Scientist," an autonomous AI research system.
Execuhires: Tempting The Wrath of Khan
gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai
Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.
DataComp-LM: the best open-data 7B model/benchmark/dataset
mistral-nemo-12b gpt-4o-mini deepseek-v2-0628 mistral-7b llama-3 gemma-2 qwen-2 datacomp hugging-face openai nvidia mistral-ai deepseek dataset-design scaling-laws model-benchmarking model-performance fine-tuning multilinguality function-calling context-windows open-source-models model-optimization cost-efficiency benchmarking sam-altman guillaume-lample philschmid miramurati
DataComp team released a competitive 7B open data language model trained on only 2.5T tokens from the massive DCLM-POOL dataset of 240 trillion tokens, showing superior scaling trends compared to FineWeb. OpenAI launched GPT-4o mini, a cost-effective model with 82% MMLU and performance near GPT-4-Turbo, aimed at developers for broad applications. NVIDIA and Mistral jointly released the Mistral NeMo 12B model featuring a 128k token context window, FP8 checkpoint, multilingual support, and Apache 2.0 licensing. DeepSeek announced DeepSeek-V2-0628 as the top open-source model on the LMSYS Chatbot Arena leaderboard with strong rankings in coding, math, and hard prompts. This news highlights advances in dataset design, model efficiency, and open-source contributions in the AI community.
Problems with MMLU-Pro
mmlu-pro llama-3-8b-q8 gpt4all-3.0 chatgpt claude llama gemini mobilellm runway-gen-3-alpha meta-3d-gen huggingface meta-ai-fair salesforce runway nomic-ai pineapple argil-ai benchmarking prompt-engineering model-evaluation model-performance multimodality automated-dataset-generation video-generation open-source-models ai-assistants text-to-3d deepfake transformers reasoning wenhu-chen danhendrycks clementine ylecun adcock_brett svpino rohanpaul_ai
MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips. Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.
Mozilla's AI Second Act
llama-3 claude-3-opus gemini-1.5 deepseek-coder-v2 gpt-4 mozilla llamaindex anthropic etched-ai sohu deepseek openai vector-search inference-speed hardware-benchmarks context-windows open-source-models coding reasoning model-benchmarking gpu-inference agentic-ai justine-tunney stephen-hood tim-dettmers bindureddy
Mozilla showcased detailed live demos of llamafile and announced sqlite-vec for vector search integration at the AIE World's Fair. LlamaIndex launched llama-agents. Anthropic introduced new UI features and Projects for Claude with a 200K context window. Etched AI revealed a specialized inference chip claiming 500k tokens/sec, though benchmark claims are questioned. Sohu chip enables 15 agent trajectories/sec. Tim Dettmers shared theoretical GPU inference limits of ~300k tokens/sec for 8xB200 NVLink on 70B Llama. Deepseek Coder v2 outperforms Gemini and GPT-4 variants in coding and reasoning. The PyTorch documentary launched to little attention.
Nemotron-4-340B: NVIDIA's new large open models, built on syndata, great for syndata
nemotron-4-340b mixtral llama-3 gemini-1.5 gpt-4o mamba-2-hybrid-8b samba-3.8b-instruct dolphin-2.9.3 faro-yi-9b-dpo nvidia hugging-face mistral-ai llamaindex cohere gemini mistral synthetic-data model-alignment reward-models fine-tuning long-context model-scaling inference-speed mixture-of-agents open-source-models model-training instruction-following context-windows philipp-schmid bryan-catanzaro oleksii-kuchaiev rohanpaul_ai cognitivecompai _philschmid 01ai_yi
NVIDIA has scaled up its Nemotron-4 model from 15B to a massive 340B dense model, trained on 9T tokens, achieving performance comparable to GPT-4. The model alignment process uses over 98% synthetic data, with only about 20K human-annotated samples for fine-tuning and reward model training. The synthetic data generation pipeline is open-sourced, including synthetic prompts and preference data generation. The base and instruct versions outperform Mixtral and Llama 3, while the reward model ranks better than Gemini 1.5, Cohere, and GPT-4o. Other notable models include Mamba-2-Hybrid 8B, which is up to 8x faster than Transformers and excels on long-context tasks, Samba-3.8B-instruct for infinite context length with linear complexity, Dolphin-2.9.3 tiny models optimized for low-resource devices, and Faro Yi 9B DPO with a 200K context window running efficiently on 16GB VRAM. The Mixture-of-Agents technique boosts open-source LLMs beyond GPT-4 Omni on AlpacaEval 2.0.
Ten Commandments for Deploying Fine-Tuned Models
claude-3-opus claude-3 gpt-4o anthropic google openai fine-tuning prompt-engineering model-evaluation feature-alteration benchmarking model-performance open-source-models kyle-corbitt bindureddy alexalbert__
Gemini-in-Google-Slides is highlighted as a useful tool for summarizing presentations. Kyle Corbitt's talk on deploying fine-tuned models in production emphasizes avoiding fine-tuning unless necessary, focusing on prompting, data quality, appropriate model choice, and thorough evaluation. Anthropic showcased feature alteration in Claude AI, demonstrating control over model behavior and increased understanding of large language models. Open-source models like GPT-4o are approaching closed-source performance on benchmarks like MMLU for simple tasks, though advanced models remain necessary for complex automation.
Zero to GPT in 1 Year
gpt-4-turbo claude-3-opus mixtral-8x22b zephyr-141b medical-mt5 openai anthropic mistral-ai langchain hugging-face fine-tuning multilinguality tool-integration transformers model-evaluation open-source-models multimodal-llms natural-language-processing ocr model-training vik-paruchuri sam-altman greg-brockman miranda-murati abacaj mbusigin akhaliq clementdelangue
GPT-4 Turbo reclaimed the top leaderboard spot with significant improvements in coding, multilingual, and English-only tasks, now rolled out in paid ChatGPT. Despite this, Claude Opus remains superior in creativity and intelligence. Mistral AI released powerful open-source models like Mixtral-8x22B and Zephyr 141B suited for fine-tuning. LangChain enhanced tool integration across models, and Hugging Face introduced Transformer.js for running transformers in browsers. Medical domain-focused Medical mT5 was shared as an open-source multilingual text-to-text model. The community also highlighted research on LLMs as regressors and shared practical advice on OCR/PDF data modeling from Vik Paruchuri's journey.
DBRX: Best open model (just not most efficient)
dbrx grok mixtral llama-2 mpt-7b gpt-4 databricks hugging-face mistral-ai mosaicml openai mixture-of-experts model-efficiency tokenization model-training code-generation model-architecture open-source-models benchmarking fine-tuning
Databricks Mosaic has released a new open-source model called DBRX that outperforms Grok, Mixtral, and Llama2 on evaluations while being about 2x more efficient than Llama2 and Grok. The model was trained on 12 trillion tokens using 3,000 H100 GPUs over 2 months, with an estimated compute cost of $10 million. It uses OpenAI's 100k tiktoken tokenizer and shows strong zero-shot code generation performance, even beating GPT-4 on the Humaneval benchmark. DBRX also upstreamed work to MegaBlocks open source. Despite its scale and efficiency, DBRX's performance on MMLU is only slightly better than Mixtral, raising questions about its scaling efficiency. The focus of DBRX is on enabling users to train models efficiently, with MoE training being about 2x more FLOP-efficient than dense models, achieving similar quality with nearly 4x less compute than previous MPT models. This release is part of the ongoing competition for open-source AI leadership, including models like Dolly, MPT, and Mistral. "If it activates 36B params, the model's perf should be equivalent to a 72B dense model or even 80B," says Qwen's tech lead.
12/28/2023: Smol Talk updates
tinyllama-1.1b mixtral tinygpt-v nous-research tyrannosaurus latex benchmarking knowledge-graphs model-finetuning tokenization decentralized-computation philosophy-of-ai multimodality vision open-source-models gary-marcus
Nous Research AI Discord discussions covered topics such as AI placement charts, ChatGPT's issues with Latex math format compatibility with Obsidian, and performance metrics of the TinyLlama 1.1B model on various benchmarks. Users shared resources including the math-centric corpus MathPile, knowledge graph building methods, and open-source large language model repositories. Technical discussions included decentralized computation feasibility for models like Mixtral, philosophical debates on AI sentience, and strategies for model finetuning and token counting. The community also discussed the Obsidian model, vision model training, and the release of the multimodal TinyGPT-V model by Tyrannosaurus. "ChatGPT not generating Latex math format compatible with Obsidian" and "optimistic about human-level AI within our lifetime" were notable quotes.