All tags
Person: "alexalbert__"
not much happened today
deepseek-r1-0528 pali-gemma-2 gemma-3 shieldgemma-2 txgemma gemma-3-qat gemma-3n-preview medgemma dolphingemma signgemma claude-4 opus-4 claude-sonnet-4 codestral-embed bagel qwen nemotron-cortexa gemini-2.5-pro deepseek-ai huggingface gemma claude bytedance qwen nemotron sakana-ai-labs benchmarking model-releases multimodality code-generation model-performance long-context reinforcement-learning model-optimization open-source yuchenj_uw _akhaliq clementdelangue osanseviero alexalbert__ guillaumelample theturingpost lmarena_ai epochairesearch scaling01 nrehiew_ ctnzr
DeepSeek R1 v2 model released with availability on Hugging Face and inference partners. The Gemma model family continues prolific development including PaliGemma 2, Gemma 3, and others. Claude 4 and its variants like Opus 4 and Claude Sonnet 4 show top benchmark performance, including new SOTA on ARC-AGI-2 and WebDev Arena. Codestral Embed introduces a 3072-dimensional code embedder. BAGEL, an open-source multimodal model by ByteDance, supports reading, reasoning, drawing, and editing with long mixed contexts. Benchmarking highlights include Nemotron-CORTEXA topping SWEBench and Gemini 2.5 Pro performing on VideoGameBench. Discussions on random rewards effectiveness focus on Qwen models. "Opus 4 NEW SOTA ON ARC-AGI-2. It's happening - I was right" and "Claude 4 launch has dev moving at a different pace" reflect excitement in the community.
not much happened today
claude-4 claude-4-opus claude-4-sonnet gemini-2.5-pro gemma-3n imagen-4-ultra anthropic google-deepmind openai codebase-understanding coding agentic-performance multimodality text-to-speech video-generation model-integration benchmarking memory-optimization cline amanrsanger ryanpgreenblatt johnschulman2 alexalbert__ nearcyan mickeyxfriedman jeremyphoward gneubig teortaxesTex scaling01 artificialanlys philschmid
Anthropic's Claude 4 models (Opus 4, Sonnet 4) demonstrate strong coding abilities, with Sonnet 4 achieving 72.7% on SWE-bench and Opus 4 at 72.5%. Claude Sonnet 4 excels in codebase understanding and is considered SOTA on large codebases. Criticism arose over Anthropic's handling of ASL-3 security requirements. Demand for Claude 4 is high, with integration into IDEs and support from Cherry Studio and FastHTML. Google DeepMind introduced Gemini 2.5 Pro Deep Think and Gemma 3n, a mobile multimodal model reducing RAM usage by nearly 3x. Google's Imagen 4 Ultra ranks third in the Artificial Analysis Image Arena, available on Vertex AI Studio. Google also promoted Google Beam, an AI video model for immersive 3D experiences, and new text-to-speech models with multi-speaker support. The GAIA benchmark shows Claude 4 Opus and Sonnet leading in agentic performance.
not much happened today
qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb
Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.
lots of small launches
gpt-4o claude-3.7-sonnet claude-3.7 claude-3.5-sonnet deepseek-r1 deepseek-v3 grok-3 openai anthropic amazon cloudflare perplexity-ai deepseek-ai togethercompute elevenlabs elicitorg inceptionailabs mistral-ai voice model-releases cuda gpu-optimization inference open-source api model-performance token-efficiency context-windows cuda jit-compilation lmarena_ai alexalbert__ aravsrinivas reach_vb
GPT-4o Advanced Voice Preview is now available for free ChatGPT users with enhanced daily limits for Plus and Pro users. Claude 3.7 Sonnet has achieved the top rank in WebDev Arena with improved token efficiency. DeepSeek-R1 with 671B parameters benefits from the Together Inference platform optimizing NVIDIA Blackwell GPU usage, alongside the open-source DeepGEMM CUDA library delivering up to 2.7x speedups on Hopper GPUs. Perplexity launched a new Voice Mode and a Deep Research API. The upcoming Grok 3 API will support a 1M token context window. Several companies including Elicit, Amazon, Anthropic, Cloudflare, FLORA, Elevenlabs, and Inception Labs announced new funding rounds, product launches, and model releases.
not much happened to end the year
deepseek-v3 code-llm o1 sonnet-3.5 deepseek smol-ai reinforcement-learning reasoning training-data mixed-precision-training open-source multimodality software-development natural-language-processing interpretability developer-tools real-time-applications search sdk-generation corbtt tom_doerr cognitivecompai alexalbert__ theturingpost svpino bindureddy
Reinforcement Fine-Tuning (RFT) is introduced as a data-efficient method to improve reasoning in LLMs using minimal training data with strategies like First-Correct Solutions (FCS) and Greedily Diverse Solutions (GDS). DeepSeek-V3, a 671B parameter MoE language model trained on 14.8 trillion tokens with FP8 mixed precision training, highlights advances in large-scale models and open-source LLMs. Predictions for AI in 2025 include growth in smaller models, multimodality, and challenges in open-source AI. The impact of AI on software development jobs suggests a need for higher intelligence and specialization as AI automates low-skilled tasks. Enhancements to CodeLLM improve coding assistance with features like in-place editing and streaming responses. Natural Language Reinforcement Learning (NLRL) offers better interpretability and richer feedback for AI planning and critique. AI hiring is growing rapidly with startups seeking strong engineers in ML and systems. New AI-powered tools such as Rivet, Buzee, and Konfig improve real-time applications, search, and SDK generation using technologies like Rust and V8 isolates.
The AI Search Wars Have Begun — SearchGPT, Gemini Grounding, and more
gpt-4o o1-preview claude-3.5-sonnet universal-2 openai google gemini nyt perplexity-ai glean nvidia langchain langgraph weights-biases cohere weaviate fine-tuning synthetic-data distillation hallucinations benchmarking speech-to-text robotics neural-networks ai-agents sam-altman alexalbert__ _jasonwei svpino drjimfan virattt
ChatGPT launched its search functionality across all platforms using a fine-tuned version of GPT-4o with synthetic data generation and distillation from o1-preview. This feature includes a Chrome extension promoted by Sam Altman but has issues with hallucinations. The launch coincides with Gemini introducing Search Grounding after delays. Notably, The New York Times is not a partner due to a lawsuit against OpenAI. The AI search competition intensifies with consumer and B2B players like Perplexity and Glean. Additionally, Claude 3.5 Sonnet achieved a new benchmark record on SWE-bench Verified, and a new hallucination evaluation benchmark, SimpleQA, was introduced. Other highlights include the Universal-2 speech-to-text model with 660M parameters and HOVER, a neural whole-body controller for humanoid robots trained in NVIDIA Isaac simulation. AI hedge fund teams using LangChain and LangGraph were also showcased. The news is sponsored by the RAG++ course featuring experts from Weights & Biases, Cohere, and Weaviate.
SciCode: HumanEval gets a STEM PhD upgrade
gpt-4 claude-3.5-sonnet llama-3-7b llama-3 dolphin-2.9.3-yi-1.5-34b-32k-gguf anthropic hugging-face nvidia benchmarks coding model-training gpu-optimization model-performance synthetic-data compiler-optimization zero-shot-learning yi-tay rohanpaul_ai alexalbert__ tri_dao abacaj
PhD-level benchmarks highlight the difficulty of coding scientific problems for LLMs, with GPT-4 and Claude 3.5 Sonnet scoring under 5% on the new SciCode benchmark. Anthropic doubled the max output token limit for Claude 3.5 Sonnet to 8192 tokens. The Q-GaLore method enables training LLaMA-7B on a single 16GB GPU. The Mosaic compiler now generates efficient code for NVIDIA H100 GPUs. The Dolphin 2.9.3-Yi-1.5-34B-32k-GGUF model on Hugging Face has over 111k downloads. Llama 3 shows strong performance, achieving 90% zero-shot accuracy on the MATH dataset. Discussions continue on the limitations and forms of synthetic data for model training.
Ten Commandments for Deploying Fine-Tuned Models
claude-3-opus claude-3 gpt-4o anthropic google openai fine-tuning prompt-engineering model-evaluation feature-alteration benchmarking model-performance open-source-models kyle-corbitt bindureddy alexalbert__
Gemini-in-Google-Slides is highlighted as a useful tool for summarizing presentations. Kyle Corbitt's talk on deploying fine-tuned models in production emphasizes avoiding fine-tuning unless necessary, focusing on prompting, data quality, appropriate model choice, and thorough evaluation. Anthropic showcased feature alteration in Claude AI, demonstrating control over model behavior and increased understanding of large language models. Open-source models like GPT-4o are approaching closed-source performance on benchmarks like MMLU for simple tasks, though advanced models remain necessary for complex automation.
Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model
chameleon gpt-4o gemini-1.5-flash claude-3 meta-ai-fair openai google-deepmind anthropic reddit multimodality early-fusion benchmarking model-training tokenization streaming tool-use vision coding hallucination-detection model-performance armen-aghajanyan sama alexandr-wang abacaj alexalbert__
Meta AI FAIR introduced Chameleon, a new multimodal model family with 7B and 34B parameter versions trained on 10T tokens of interleaved text and image data enabling "early fusion" multimodality that can natively output any modality. While reasoning benchmarks are modest, its "omnimodality" approach competes well with pre-GPT4o multimodal models. OpenAI launched GPT-4o, a model excelling in benchmarks like MMLU and coding tasks, with strong multimodal capabilities but some regression in ELO scores and hallucination issues. Google DeepMind announced Gemini 1.5 Flash, a small model with 1M context window and flash performance, highlighting convergence trends between OpenAI and Google models. Anthropic updated Claude 3 with streaming support, forced tool use, and vision tool integration for multimodal knowledge extraction. OpenAI also partnered with Reddit, raising industry attention.
Quis promptum ipso promptiet?
llama-3-70b llama-3-120b llama-3 llama-cpp anthropic openai zoominfo neuralink prompt-engineering chain-of-thought rag quantization cuda-graphs gpu-optimization thought-controlled-devices modeling-consciousness conference sama gdb bindureddy svpino rohanpaul_ai alexalbert__ abacaj
Anthropic released upgrades to their Workbench Console, introducing new prompt engineering features like chain-of-thought reasoning and prompt generators that significantly reduce development time, exemplified by their customer Zoominfo. OpenAI teased a "magic" new development coming soon, speculated to be a new LLM replacing GPT-3.5 in the free tier or a search competitor. The open-source community highlighted Llama 3 70B as "game changing" with new quantized weights for Llama 3 120B and CUDA graph support for llama.cpp improving GPU performance. Neuralink demonstrated a thought-controlled mouse, sparking interest in modeling consciousness from brain signals. The ICLR 2024 conference is being held in Asia for the first time, generating excitement.