All tags
Person: "willdepue"
gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq
OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.
Anthropic's $61.5B Series E
gpt-4.5 claude-3.7-sonnet deepseek-r1 anthropic openai deepseek lmsys perplexity-ai deutsche-telekom model-performance benchmarking style-control coding multi-turn funding partnerships workflow lmarena_ai teortaxestex casper_hansen_ omarsar0 aidan_mclau willdepue vikhyatk teknim1 reach_vb _aidan_clark_ cto_junior aravsrinivas
Anthropic raised a $3.5 billion Series E funding round at a $61.5 billion valuation, signaling strong financial backing for the Claude AI model. GPT-4.5 achieved #1 rank across all categories on the LMArena leaderboard, excelling in multi-turn conversations, coding, math, creative writing, and style control. DeepSeek R1 tied with GPT-4.5 for top performance on hard prompts with style control. Discussions highlighted comparisons between GPT-4.5 and Claude 3.7 Sonnet in coding and workflow applications. The importance of the LMSYS benchmark was emphasized, though some questioned the relevance of benchmarks versus user acquisition. Additionally, Perplexity AI partnered with Deutsche Telekom to integrate the Perplexity Assistant into a new AI phone.
Not much happened today
gemini-1.5-flashmodel gemini-pro mixtral mamba-2 phi-3-medium phi-3-small gpt-3.5-turbo-0613 llama-3-8b llama-2-70b mistral-finetune twelve-labs livekit groq openai nea nvidia lmsys mistral-ai model-performance prompt-engineering data-curation ai-safety model-benchmarking model-optimization training sequence-models state-space-models daniel-kokotajlo rohanpaul_ai _arohan_ tri_dao _albertgu _philschmid sarahcat21 hamelhusain jachiam0 willdepue teknium1
Twelve Labs raised $50m in Series A funding co-led by NEA and NVIDIA's NVentures to advance multimodal AI. Livekit secured $22m in funding. Groq announced running at 800k tokens/second. OpenAI saw a resignation from Daniel Kokotajlo. Twitter users highlighted Gemini 1.5 FlashModel for high performance at low cost and Gemini Pro ranking #2 in Japanese language tasks. Mixtral models can run up to 8x faster on NVIDIA RTX GPUs using TensorRT-LLM. Mamba-2 model architecture introduces state space duality for larger states and faster training, outperforming previous models. Phi-3 Medium (14B) and Small (7B) models benchmark near GPT-3.5-Turbo-0613 and Llama 3 8B. Prompt engineering is emphasized for unlocking LLM capabilities. Data quality is critical for model performance, with upcoming masterclasses on data curation. Discussions on AI safety include a Frontier AI lab employee letter advocating whistleblower protections and debates on aligning AI to user intent versus broader humanity interests.