All tags
Topic: "performance"
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
SOTA Video Gen: Veo 2 and Kling 2 are GA for developers
veo-2 gemini gpt-4.1 gpt-4o gpt-4.5-preview gpt-4.1-mini gpt-4.1-nano google openai video-generation api coding instruction-following context-window performance benchmarks model-deprecation kevinweil stevenheidel aidan_clark_
Google's Veo 2 video generation model is now available in the Gemini API with a cost of 35 cents per second of generated video, marking a significant step in accessible video generation. Meanwhile, China's Kling 2 model launched with pricing around $2 for a 10-second clip and a minimum subscription of $700 per month for 3 months, generating excitement despite some skill challenges. OpenAI announced the GPT-4.1 family release, including GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, highlighting improvements in coding, instruction following, and a 1 million token context window. The GPT-4.1 models are 26% cheaper than GPT-4o and will replace the GPT-4.5 Preview API version by July 14. Performance benchmarks show GPT-4.1 achieving 54-55% on SWE-bench verified and a 60% improvement over GPT-4o in some internal tests, though some critiques note it underperforms compared to other models like OpenRouter and DeepSeekV3 in coding tasks. The release is API-only, with a prompting guide provided for developers.
not much happened today
gpt-4.1 o3 o4-mini grok-3 grok-3-mini o1 tpuv7 gb200 openai x-ai google nvidia samsung memory model-release hardware-accelerators fp8 hbm inference ai-conferences agent-collaboration robotics model-comparison performance power-consumption sama
OpenAI teased a Memory update in ChatGPT with limited technical details. Evidence suggests upcoming releases of o3 and o4-mini models, alongside a press leak about GPT-4.1. X.ai launched the Grok 3 and Grok 3 mini APIs, confirmed as o1 level models. Discussions compared Google's TPUv7 with Nvidia's GB200, highlighting TPUv7's specs like 4,614 TFLOP/s FP8 performance, 192 GB HBM, and 1.2 Tbps ICI bandwidth. TPUv7 may have pivoted from training to inference chip use. Key AI events include Google Cloud Next 2025 and Samsung's Gemini-powered Ballie robot. The community is invited to participate in the AI Engineer World's Fair 2025 and the 2025 State of AI Engineering survey.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
not much happened today
prime gpt-4o qwen-32b olmo openai qwen cerebras-systems langchain vercel swaggo gin echo reasoning chain-of-thought math coding optimization performance image-processing software-development agent-frameworks version-control security robotics hardware-optimization medical-ai financial-ai architecture akhaliq jason-wei vikhyatk awnihannun arohan tom-doerr hendrikbgr jerryjliu0 adcock-brett shuchaobi stasbekman reach-vb virattt andrew-n-carr
Olmo 2 released a detailed tech report showcasing full pre, mid, and post-training details for a frontier fully open model. PRIME, an open-source reasoning solution, achieved 26.7% pass@1, surpassing GPT-4o in benchmarks. Performance improvements include Qwen 32B (4-bit) generating at >40 tokens/sec on an M4 Max and libvips being 25x faster than Pillow for image resizing. New tools like Swaggo/swag for Swagger 2.0 documentation, Jujutsu (jj) Git-compatible VCS, and Portspoof security tool were introduced. Robotics advances include a weapon detection system with a meters-wide field of view and faster frame rates. Hardware benchmarks compared H100 and MI300x accelerators. Applications span medical error detection using PRIME and a financial AI agent integrating LangChainAI and Vercel AI SDK. Architectural insights suggest the need for breakthroughs similar to SSMs or RNNs.
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.
not much happened today
llama-3-2 llama-3 gemma-2 phi-3-5-mini claude-3-haiku gpt-4o-mini molmo gemini-1.5 gemini meta-ai-fair openai allenai google-deepmind multimodality model-optimization benchmarks ai-safety model-distillation pruning adapter-layers open-source-models performance context-windows mira-murati demis-hassabis ylecun sama
Meta AI released Llama 3.2 models including 1B, 3B text-only and 11B, 90B vision variants with 128K token context length and adapter layers for image-text integration. These models outperform competitors like Gemma 2 and Phi 3.5-mini, and are supported on major platforms including AWS, Azure, and Google Cloud. OpenAI CTO Mira Murati announced her departure. Allen AI released Molmo, an open-source multimodal model family outperforming proprietary systems. Google improved Gemini 1.5 with Flash and Pro models. Meta showcased Project Orion AR glasses and hinted at a Quest 3S priced at $300. Discussions covered new benchmarks for multimodal models, model optimization, and AI safety and alignment.
OpenAI's PR Campaign?
alphafold-3 xlstm gpt-4 openai microsoft google-deepmind memory-management model-spec scaling multimodality performance transformers dynamic-memory model-architecture demis-hassabis sama joanne-jang omarsar0 arankomatsuzaki drjimfan
OpenAI faces user data deletion backlash over its new partnership with StackOverflow amid GDPR complaints and US newspaper lawsuits, while addressing election year concerns with efforts like the Media Manager tool for content opt-in/out by 2025 and source link attribution. Microsoft develops a top-secret airgapped GPT-4 AI service for US intelligence agencies. OpenAI releases the Model Spec outlining responsible AI content generation policies, including NSFW content handling and profanity use, emphasizing clear distinctions between bugs and design decisions. Google DeepMind announces AlphaFold 3, a state-of-the-art model predicting molecular structures with high accuracy, showcasing cross-domain AI techniques. New research on xLSTM proposes scaling LSTMs to billions of parameters, competing with transformers in performance and scaling. Microsoft introduces vAttention, a dynamic memory management method for efficient large language model serving without PagedAttention.
Mistral Large disappoints
mistral-large mistral-small mixtral-8x7b gpt-4-turbo dreamgen-opus-v1 mistral-ai openai hugging-face benchmarking model-merging fine-tuning reinforcement-learning model-training tokenization model-optimization ai-assisted-decompilation performance cost-efficiency deception roleplay deep-speed dpo timotheeee1 cogbuji plasmator jsarnecki maldevide spottyluck mrjackspade
Mistral announced Mistral Large, a new language model achieving 81.2% accuracy on MMLU, trailing GPT-4 Turbo by about 5 percentage points on benchmarks. The community reception has been mixed, with skepticism about open sourcing and claims that Mistral Small outperforms the open Mixtral 8x7B. Discussions in the TheBloke Discord highlighted performance and cost-efficiency comparisons between Mistral Large and GPT-4 Turbo, technical challenges with DeepSpeed and DPOTrainer for training, advances in AI deception for roleplay characters using DreamGen Opus V1, and complexities in model merging using linear interpolation and PEFT methods. Enthusiasm for AI-assisted decompilation was also expressed, emphasizing the use of open-source projects for training data.
1/1/2024: How to start with Open Source AI
gpt-4-turbo dall-e-3 chatgpt openai microsoft perplexity-ai prompt-engineering ai-reasoning custom-gpt performance python knowledge-integration swyx
OpenAI Discord discussions revealed mixed sentiments about Bing's AI versus ChatGPT and Perplexity AI, and debated Microsoft Copilot's integration with Office 365. Users discussed DALL-E 3 access within ChatGPT Plus, ChatGPT's performance issues, and ways to train a GPT model using book content via OpenAI API or custom GPTs. Anticipation for GPT-4 turbo in Microsoft Copilot was noted alongside conversations on AI reasoning, prompt engineering, and overcoming Custom GPT glitches. Advice for AI beginners included starting with Python and using YAML or Markdown for knowledge integration. The future of AI with multiple specialized GPTs and Microsoft Copilot's role was also explored.
12/22/2023: Anyscale's Benchmark Criticisms
gpt-4 gpt-3.5 bard anyscale openai microsoft benchmarking performance api prompt-engineering bug-tracking model-comparison productivity programming-languages storytelling
Anyscale launched their LLMPerf leaderboard to benchmark large language model inference performance, but it faced criticism for lacking detailed metrics like cost per token and throughput, and for comparing public LLM endpoints without accounting for batching and load. In OpenAI Discord discussions, users reported issues with Bard and preferred Microsoft Copilot for storytelling, noting fewer hallucinations. There was debate on the value of upgrading from GPT-3.5 to GPT-4, with many finding paid AI models worthwhile for coding productivity. Bugs and performance issues with OpenAI APIs were also highlighted, including slow responses and message limits. Future AI developments like GPT-6 and concerns about OpenAI's transparency and profitability were discussed. Prompt engineering for image generation was another active topic, emphasizing clear positive prompts and the desire for negative prompts.
12/19/2023: Everybody Loves OpenRouter
gpt-4 gpt-3.5 mixtral-8x7b-instruct dolphin-2.0-mistral-7b gemini openai mistral-ai google hugging-face performance memory-management api prompt-engineering local-language-models translation censorship video-generation
OpenRouter offers an easy OpenAI-compatible proxy for Mixtral-8x7b-instruct. Discord discussions highlight GPT-4 performance and usability issues compared to GPT-3.5, including memory management and accessibility problems. Users debate local language models versus OpenAI API usage, with mentions of Dolphin 2.0 Mistral 7B and Google's video generation project. Prompt engineering and custom instructions for GPT models are also key topics. Concerns about censorship on models like Gemini and translation tool preferences such as DeepL were discussed.
12/15/2023: Mixtral-Instruct beats Gemini Pro (and matches GPT3.5)
mixtral gemini-pro gpt-3.5 gpt-4.5 gpt-4 chatgpt lmsys openai deepseek cloudflare huggingface performance context-window prompt-engineering privacy local-gpu cloud-gpu code-generation model-comparison model-usage api-errors karpathy
Thanks to a karpathy shoutout, lmsys now has enough data to rank mixtral and gemini pro. The discussion highlights the impressive performance of these state-of-the-art open-source models that can run on laptops. In the openai Discord, users compared AI tools like perplexity and chatgpt's browsing tool, favoring Perplexity for its superior data gathering, pricing, and usage limits. Interest was shown in AI's ability to convert large code files with deepseek coder recommended. Debates on privacy implications for AI advancement and challenges of running LLMs on local and cloud GPUs were prominent. Users reported issues with chatgpt including performance problems, loss of access to custom GPTs, and unauthorized access. Discussions also covered prompt engineering for large context windows and speculations about gpt-4.5 and gpt-4 future developments.
12/14/2023: $1e7 for Superalignment
gemini bard gpt-4 gpt-4.5 llama-2 openai llamaindex perplexity-ai prompt-engineering api custom-gpt json bug-fixes chatbots performance tts code-generation image-recognition jan-leike patrick-collison
Jan Leike is launching a new grant initiative inspired by Patrick Collison's Fast Grants to support AI research. OpenAI introduced a new developers Twitter handle @OpenAIDevs for community updates. Discussions on OpenAI's Gemini and Bard chatbots highlight their ability to read each other's instructions and offer unique coding solutions. Users reported various issues with GPT-4, including performance problems, customization difficulties, and a resolved bug in image recognition. There are ongoing conversations about prompt engineering challenges and new JSON mode support in Convo-lang for API use. Concerns about misuse of chatbots for illegal activities and alternatives like Llama2 models and the Perplexity chatbot were also discussed.