Topic: "attention-mechanisms"

eagle-3.1 unigram-tokenizer qwen-3.5 deepseek-v4-pro mimo deep-agents-v0.6 397b-parameter-model eaglecorp vllm_project perplexity_ai alibaba lightseek nvidia mooncake flashattention kimmonismus deepseek xiaomi langchain baseten trajectory clay harvey decagon mercor rogo rlm inference-optimization long-context speculative-decoding tokenization attention-mechanisms kv-cache cache-hierarchy agent-engineering model-harness-memory-fit continual-learning quantization autoscaling memory-centric-agents evaluation-automation kimmonismus _luofuli vtrivedy10

Inference optimization is increasingly architectural, with EAGLE 3.1 improving speculative decoding and long-context handling, collaborating with vLLM and TorchSpec. Perplexity open-sourced a rebuilt Unigram tokenizer cutting CPU use by 5–6× and achieving 63 µs at 514 tokens. Qwen3.5 hits 580 tokens/s via joint efforts from Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Price cuts in APIs from Chinese labs are sustainable due to structural KV-cache and attention improvements, exemplified by DeepSeek V4-Pro and Xiaomi MiMo reducing caching costs significantly. Agent engineering shifts focus from model quality to model-harness-memory fit, with LangChain releasing Deep Agents v0.6 and tools like LangSmith Engine automating evaluation loops. Trajectory launched a continual learning platform with $15M funding and partners like Clay and Harvey, supporting large models including a 397B-parameter model deployed on autoscaled H100 infrastructure. Open-source memory-centric agents and minimal training harnesses also gained attention.

May 12

not much happened today

gemini-3.1-pro gpt-5.5 opus-4.7-xhigh agent-moderncolbert google-deepmind lighton nous-research research-benchmarks math medical-benchmarks agentic-systems program-synthesis retrieval-augmentation training-optimization superoptimization scaling-laws training-efficiency gpu-optimization attention-mechanisms soohak polynoamial torchcompiled leloykun che_shr_cat jjitsev omarsar0

Research-level reasoning benchmarks are advancing with 439 new math problems from 64 mathematicians and expanded medical benchmarks in Medmarks v1.0 covering 30 benchmarks and 61 models. Google DeepMind's AI Co-Mathematician achieves 48% on FrontierMath Tier 4, while Gemini 3.1 Pro improves physics benchmark scores significantly. GPT-5.5 high/xhigh outperforms Opus 4.7 xhigh on program synthesis tasks. Retrieval benchmarks favor smaller models like LightOn's Agent-ModernColBERT with 149M parameters. Training optimization advances include SOAP/Muon-style updates reducing training steps, and a Lean4-to-TileLang superoptimizer achieving 1.8× speedup on A100 GPUs. Scaling laws are reconsidered with arguments for measuring in bytes rather than tokens. New training-time efficiency methods like Lighthouse Attention enable subquadratic training wrappers removable before deployment.

May 01

not much happened today

grok-4.3 deepseek-v4-pro kimi-k2.6 mimo-v2.5-pro gemini-3.1-pro claude-opus-4.7 gpt-5.5 deepskvit xai deepseek artificial-analysis andon-labs benchmarking cost-efficiency agentic-ai token-efficiency attention-mechanisms inference-speed multimodality spatial-reasoning model-architecture model-performance scaling01 teortaxestex omarsar0

xAI released Grok 4.3, improving cost/performance with a 53 Intelligence Index score, 4 points higher than Grok 4.20, and significant gains on GDPval-AA and τ²-Bench Telecom. However, accuracy tradeoffs raised reliability concerns. Community opinions are mixed, with some praising token-efficiency and others noting regressions and pricing concerns. DeepSeek V4 Pro emerges as a leading open-weight coding/agent model, comparable to Codex and Claude Code, featuring a 1M context window and efficient attention mechanisms. Benchmarking shows open-weight models like Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro closing the gap with closed models such as Gemini 3.1 Pro Preview, Claude Opus 4.7, and GPT-5.5. DeepSeek's multimodal efforts focus on explicit spatial grounding with a novel "point while thinking" approach using DeepSeek-ViT and CSA compression.

Apr 27

not much happened today

gpt-5.5 gpt-5.4 opus-4.7 mimo-v2.5-pro mimo-v2.5 kimi-k2.6 codex copilot openai microsoft google amazon github xiaomi openai-devs vllm_project kimi-moonshot model-distribution cloud-computing benchmarking usage-based-billing model-orchestration open-source large-context-models agent-scaling coding model-training fp8 attention-mechanisms multi-agent-systems sama scaling01 kimmonismus ajassy simonw htihle arena gdb hangsiin eliebakouch _luofuli teortaxestex

OpenAI loosens its Azure exclusivity, allowing distribution across Google TPU, AWS Trainium, and Bedrock with commitments through 2032 and revenue share through 2030. GPT-5.5 shows improved benchmarks but is not uniformly dominant, ranking variably across coding, document, math, and vision tasks. GitHub's Copilot shifts to usage-based billing starting June 1, reflecting increased runtime costs. OpenAI open-sourced Symphony, an orchestration layer for issue tracking and Codex agents. Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro, large context models with up to 1M-token context and trillions of tokens trained, emphasizing complex agent and omni-modal capabilities. Kimi K2.6 leads OpenRouter's leaderboard, noted for coding and long-horizon agent capabilities with large-scale sub-agent coordination.

Mar 17

not much happened today

gpt-5.4-mini gpt-5.4-nano gpt-5.4 codex openai langchain stripe ramp coinbase nous-research hermes-agent coding multimodality subagents context-window model-performance pricing behavior-tuning secure-execution plugin-architecture attention-mechanisms agent-infrastructure hwchase17 michpokrass

OpenAI released GPT-5.4 mini and GPT-5.4 nano, their most capable small models optimized for coding, multimodal understanding, and subagents, featuring a 400k context window and over 2x speed compared to GPT-5 mini. The mini model approaches larger GPT-5.4 performance while using only 30% of Codex quota, becoming the default for many coding workflows. Pricing concerns and truthfulness tradeoffs were noted, with mixed third-party evaluations on reasoning and resistance to false premises. OpenAI also addressed behavior tuning issues in a recent update. Meanwhile, agent infrastructure is evolving with secure code execution and orchestration tools like LangChain's LangSmith Sandboxes and Open SWE, inspired by internal systems at Stripe, Ramp, and Coinbase. Subagents and secure execution are now key product features, with releases like Hermes Agent v0.3.0 showcasing plugin architectures, live Chrome control, and voice mode. Research on attention mechanisms, including Attention Residuals and vertical attention, is gaining traction.

Mar 16

not much happened today

kimi-linear-48b codex gpt-5.4 claude-code moonshot openai assemblyai langchain attention-mechanisms model-architecture inference-speed agent-feedback agent-skills multi-agent-systems knowledge-transfer cli-tools coding-agents model-deployment kimi_moonshot elonmusk yuchenj_uw nathancgy4 eliebakouch tokenbender behrouz_ali cloneofsimo fidjissimo sama gdb andrewyng itsafiz simplifyinai

Moonshot's Attention Residuals paper introduced an input-dependent attention mechanism over prior layers with a 1.25x compute advantage and less than 2% inference latency overhead, validated on Kimi Linear 48B total / 3B active. The paper sparked debate on novelty versus prior art like DeepCrossAttention and Google’s earlier work, highlighting tensions in idea novelty, citation quality, and frontier-scale validation. OpenAI's Codex showed strong momentum with over 2M weekly active users, nearly 4x growth YTD, and GPT-5.4 hitting 5T tokens/day and a $1B annualized run-rate. Codex added subagents supporting multi-agent coding workflows. Infrastructure for coding agents matured with tools like Context Hub / chub supporting agent feedback loops, AssemblyAI's skill for Claude Code and Codex, and automated skill extraction from GitHub repos yielding 40% knowledge-transfer gains. LangChain launched LangGraph CLI and open-sourced Deep Agents, recreating top coding agent workflows with planning, filesystem ops, shell access, and sub-agents.

Mar 05

GPT 5.4: SOTA Knowledge Work -and- Coding -and- CUA Model, OpenAI is so very back

gpt-5.4 gpt-5.4-pro openai cursor_ai perplexity_ai arena native-computer-use long-context efficiency steering benchmarking gpu-kernels attention-mechanisms algorithmic-optimization pipeline-optimization sama reach_vb scaling01 danshipper yuchenj_uw

OpenAI launched GPT-5.4 and GPT-5.4 Pro with unified mainline and Codex models, featuring native computer use, up to ~1M token context, and efficiency improvements including a new Codex /fast mode. Benchmarks showed strong results like OSWorld-Verified 75.0% surpassing human baseline and GDPval 83% against industry pros. User feedback highlighted coding utility but raised concerns about pricing and overthinking. Integration with devtools like Cursor, Perplexity, and Arena was announced. In systems research, FlashAttention-4 (FA4) was introduced with near-matmul speed attention on Blackwell GPUs, featuring innovations like polynomial exp emulation and online softmax. "Steering mid-response" and "fewer tokens, faster speed" were emphasized as UX and efficiency improvements.

Nov 10, 2025

not much happened today

kimi-k2-thinking kimi-k3 gelato-30b-a3b omnilingual-wav2vec-2.0 moonshot-ai meta-ai-fair togethercompute qwen attention-mechanisms quantization fine-tuning model-optimization agentic-ai speech-recognition multilingual-models gui-manipulation image-editing dataset-release yuchenj_uw scaling01 code_star omarsar0 kimi_moonshot anas_awadalla akhaliq minchoi

Moonshot AI's Kimi K2 Thinking AMA revealed a hybrid attention stack using KDA + NoPE MLA outperforming full MLA + RoPE, with the Muon optimizer scaling to ~1T parameters and native INT4 QAT for cost-efficient inference. K2 Thinking ranks highly on LisanBench and LM Arena Text leaderboards, offering low-cost INT4 serving and strong performance in Math, Coding, and Creative Writing. It supports heavy agentic tool use with up to 300 tool requests per run and recommends using the official API for reliable long-trace inference. Meta AI released the Omnilingual ASR suite covering 1600+ languages including 500 underserved, plus a 7B wav2vec 2.0 model and ASR corpus. Additionally, the Gelato-30B-A3B model for computer grounding in GUI manipulation agents outperforms larger VLMs, targeting immediate agent gains. Qwen's image-edit LoRAs and light-restoration app were also highlighted.

Oct 30, 2025

not much happened today

kimi-linear kimi-delta-attention minimax-m2 looped-llms aardvark-gpt-5 moonshot-ai minimax bytedance princeton mila openai cursor cognition hkust long-context attention-mechanisms agentic-ai tool-use adaptive-compute coding-agents performance-optimization memory-optimization reinforcement-learning model-architecture kimi_moonshot scaling01 uniartisan omarsar0 aicodeking songlinyang4 iscienceluvr nrehiew_ gdb embeddedsec auchenberg simonw

Moonshot AI released Kimi Linear (KDA) with day-0 infrastructure and strong long-context metrics, achieving up to 75% KV cache reduction and 6x decoding throughput. MiniMax M2 pivoted to full attention for multi-hop reasoning, maintaining strong agentic coding performance with 200k context and ~100 TPS. ByteDance, Princeton, and Mila introduced Looped LLMs showing efficiency gains comparable to larger transformers. OpenAI's Aardvark (GPT-5) entered private beta as an agentic security researcher for scalable vulnerability discovery. Cursor launched faster cloud coding agents, though transparency concerns arose regarding base-model provenance. Cognition released a public beta for a desktop/mobile tool-use agent named Devin. The community discussed advanced attention mechanisms and adaptive compute techniques.

Aug 01, 2025

Gemini 2.5 Deep Think finally ships

gemini-2.5-deep-think gpt-oss gpt-5 kimi-k2-turbo-preview qwen3-coder-flash glm-4.5 step-3 claude openai anthropic google-deepmind kimi-moonshot alibaba ollama zhipu-ai stepfun parallel-thinking model-releases moe attention-mechanisms multimodal-reasoning model-performance context-windows open-source-models model-leaks creative-ai coding reasoning model-optimization demishassabis philschmid scaling01 teortaxestex teknium1 lmarena_ai andrewyng

OpenAI is rumored to soon launch new GPT-OSS and GPT-5 models amid drama with Anthropic revoking access to Claude. Google DeepMind quietly launched Gemini 2.5 Deep Think, a model optimized for parallel thinking that achieved gold-medal level at the IMO and excels in reasoning, coding, and creative tasks. Leaks suggest OpenAI is developing a 120B MoE and a 20B model with advanced attention mechanisms. Chinese AI companies like Kimi Moonshot, Alibaba, and ZHIpu AI are releasing faster and more capable open models such as kimi-k2-turbo-preview, Qwen3-Coder-Flash, and GLM-4.5, signaling strong momentum and potential to surpass the U.S. in AI development. "The final checkpoint was selected just 5 hours before the IMO problems were released," highlighting rapid development cycles.

Jul 07, 2025

not much happened today

grok-4 jamba ernie-4.5 claude-4-sonnet claude-4 kontext-dev ai21-labs hugging-face baidu perplexity-ai deepmind anthropic reinforcement-learning fine-tuning energy-based-transformers ssm-transformer context-windows length-generalization recurrent-neural-networks attention-mechanisms 2-simplicial-attention biomedical-ai instruction-following open-weight-models python-package-management _philschmid corbtt jxmnop sedielem _akhaliq slashml alexiglad clementdelangue _albertgu tri_dao theaitimeline deep-learning-ai

Over the holiday weekend, key AI developments include the upcoming release of Grok 4, Perplexity teasing new projects, and community reactions to Cursor and Dia. Research highlights feature a paper on Reinforcement Learning (RL) improving generalization and reasoning across domains, contrasting with Supervised Fine-Tuning's forgetting issues. Energy-Based Transformers (EBTs) are proposed as a promising alternative to traditional transformers. AI21 Labs updated its Jamba model family with enhanced grounding and instruction following, maintaining a 256K context window. Baidu open-sourced its massive 424 billion parameter Ernie 4.5 model, while Kontext-dev became the top trending model on Hugging Face. Advances in length generalization for recurrent models and the introduction of 2-simplicial attention were noted. In biomedical AI, Biomni, powered by Claude 4 Sonnet, demonstrated superior accuracy and rare disease diagnosis capabilities. Additionally, the Python package manager uv received praise for improving Python installation workflows.

Jun 16, 2025

Chinese Models Launch - MiniMax-M1, Hailuo 2 "Kangaroo", Moonshot Kimi-Dev-72B

minimax-m1 hailuo-02 kimi-dev-72b deepseek-r1 ale-agent minimax-ai moonshot-ai deepseek bytedance anthropic langchain columbia-university sakana-ai openai microsoft multi-agent-systems attention-mechanisms coding optimization prompt-injection model-performance video-generation model-training task-automation jerryjliu0 hwchase17 omarsar0 gallabytes lateinteraction karpathy

MiniMax AI launched MiniMax-M1, a 456 billion parameter open weights LLM with a 1 million token input and 80k token output using efficient "lightning attention" and a GRPO variant called CISPO. MiniMax AI also announced Hailuo 02 (0616), a video model similar to ByteDance's Seedance. Moonshot AI released Kimi-Dev-72B, a coding model outperforming DeepSeek R1 on SWEBench Verified. Discussions on multi-agent system design from Anthropic and LangChain highlighted improvements in task completion and challenges like prompt injection attacks, as demonstrated by Karpathy and Columbia University research. Sakana AI introduced ALE-Agent, a coding agent that ranked 21st in the AtCoder Heuristic Competition solving NP-hard optimization problems. There is unverified news about an acquisition involving OpenAI, Microsoft, and Windsurf.

May 31, 2025

Mary Meeker is so back: BOND Capital AI Trends report

qwen-3-8b anthropic hugging-face deepseek attention-mechanisms inference arithmetic-intensity transformers model-optimization interpretability model-quantization training tri_dao fleetwood___ teortaxestex awnihannun lateinteraction neelnanda5 eliebakouch _akhaliq

Mary Meeker returns with a comprehensive 340-slide report on the state of AI, highlighting accelerating tech cycles, compute growth, and comparisons of ChatGPT to early Google and other iconic tech products. The report also covers enterprise traction and valuation of major AI companies. On Twitter, @tri_dao discusses an "ideal" inference architecture featuring attention variants like GTA, GLA, and DeepSeek MLA with high arithmetic intensity (~256), improving efficiency and model quality. Other highlights include the release of 4-bit DWQ of DSR1 Qwen3 8B on Hugging Face, AnthropicAI's open-source interpretability tools for LLMs, and discussions on transformer training and abstractions by various researchers.

Apr 11, 2025

not much happened today

grok-3 grok-3-mini gpt-4.5 claude-3.7-sonnet quasar-alpha optimus-alpha gpt-4.1 kaleidoscope internvl3 internvit qwen2.5vl transmamba fantasytalking openai alibaba cmu reinforcement-learning reasoning benchmarks vision multilinguality multimodality transformers attention-mechanisms agents code-generation model-performance rasbt sarahookr mervenoyann gneubig svpino mathemagic1an

The AI news recap highlights independent evaluations showing Grok-3 outperforming models like GPT-4.5 and Claude 3.7 Sonnet on reasoning benchmarks, while Grok-3 mini excels in reasoning tasks. Research on reinforcement learning (RL) fine-tuning reveals potential improvements for small reasoning models but also notes instability in reported gains. Benchmark results suggest Quasar Alpha and Optimus Alpha may be versions of GPT-4.1. Vision and multimodal models like Kaleidoscope, supporting 18 languages, and InternVL3, built on InternViT and Qwen2.5VL, demonstrate advances in multilingual vision and reasoning. The fusion model TransMamba combines transformer precision with speed via SSM mechanisms. Alibaba's FantasyTalking generates realistic talking portraits. Agent-focused events at CMU and tools like FilmAgent AI for virtual film production and BrowseComp benchmark for browsing agents were announced. The coding assistant Augment supports multiple IDEs with code analysis and suggestions. Discussions also covered Google’s new agent-to-agent protocol concept.

Apr 08, 2025

Llama 4's Controversial Weekend Release

llama-4 llama-3 llama-3-2 meta mixture-of-experts early-fusion attention-mechanisms fp8-training training-data benchmarking model-performance model-release multimodality open-models ahmad_al_dahle ylecun reach_vb yuchenj_uw

Meta released Llama 4, featuring two new medium-size MoE open models and a promised 2 Trillion parameter "behemoth" model, aiming to be the largest open model ever. The release included advanced training techniques like Chameleon-like early fusion with MetaCLIP, interleaved chunked attention without RoPE, native FP8 training, and training on up to 40 trillion tokens. Despite the hype, the release faced criticism for lack of transparency compared to Llama 3, implementation issues, and poor performance on some benchmarks. Meta leadership, including Ahmad Al Dahle, denied allegations of training on test sets. The smallest Scout model at 109B parameters is too large for consumer GPUs, and the claimed 10 million token context is disputed. The community response has been mixed, with some praising the openness and others pointing out discrepancies and quality concerns.

Apr 05, 2025

not much happened today

o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0

OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.

Mar 10, 2025

not much happened today

gpt-4.5 claude-3.7-sonnet deepseek-r1 smolagents-codeagent gpt-4o llama-3-8b tinyr1-32b-preview r1-searcher forgetting-transformer nanomoe openai deepseek hugging-face mixture-of-experts reinforcement-learning kv-cache-compression agentic-ai model-distillation attention-mechanisms model-compression minimax model-pretraining andrej-karpathy cwolferesearch aymericroucher teortaxestex jonathanross321 akhaliq

The AI news recap highlights several key developments: nanoMoE, a PyTorch implementation of a mid-sized Mixture-of-Experts (MoE) model inspired by Andrej Karpathy's nanoGPT, enables pretraining on commodity hardware within a week. An agentic leaderboard ranks LLMs powering smolagents CodeAgent, with GPT-4.5 leading, followed by Claude-3.7-Sonnet. Discussions around DeepSeek-R1 emphasize AI model commoditization, with DeepSeek dubbed the "OpenAI of China." Q-Filters offer a training-free method for KV cache compression in autoregressive models, achieving 32x compression with minimal perplexity loss. The PokéChamp minimax language agent, powered by GPT-4o and Llama-3-8b, demonstrates strong performance in Pokémon battles. Other notable models include TinyR1-32B-Preview with Branch-Merge Distillation, R1-Searcher incentivizing search capability via reinforcement learning, and the Forgetting Transformer using a Forget Gate in softmax attention. These advancements reflect ongoing innovation in model architectures, compression, reinforcement learning, and agentic AI.

Dec 27, 2024

DeepSeek v3: 671B finegrained MoE trained for $5.5m USD of compute on 15T tokens

deepseek-v3 gpt-4o claude-3.5-sonnet llama-3 deepseek-ai hugging-face openai anthropic mixture-of-experts model-training model-optimization reinforcement-learning chain-of-thought multi-token-prediction synthetic-data model-distillation fine-tuning attention-mechanisms gpu-optimization nrehiew_ denny_zhou

DeepSeek-V3 has launched with 671B MoE parameters and trained on 14.8T tokens, outperforming GPT-4o and Claude-3.5-sonnet in benchmarks. It was trained with only 2.788M H800 GPU hours, significantly less than Llama-3's 30.8M GPU-hours, showcasing major compute efficiency and cost reduction. The model is open-source and deployed via Hugging Face with API support. Innovations include native FP8 mixed precision training, Multi-Head Latent Attention scaling, distillation from synthetic reasoning data, pruning and healing for MoEs with up to 256 experts, and a new multi-token prediction objective enabling lookahead token planning. Research highlights also cover the OREO method and Natural Language Reinforcement Learning (NLRL) for multi-step reasoning and agent control.

Oct 24, 2024

not much happened today

claude-3.5-sonnet claude-3.5-haiku o1-preview mochi-1 stable-diffusion-3.5 embed-3 kerashub differential-transformer anthropic openai cohere microsoft computer-use coding-performance video-generation fine-tuning multimodality transformers attention-mechanisms model-optimization alexalbert fchollet rasbt

Anthropic released upgraded Claude 3.5 Sonnet and Claude 3.5 Haiku models featuring a new computer use capability that allows interaction with computer interfaces via screenshots and actions like mouse movement and typing. The Claude 3.5 Sonnet achieved state-of-the-art coding performance on SWE-bench Verified with a 49% score, surpassing OpenAI's o1-preview. Anthropic focuses on teaching general computer skills rather than task-specific tools, with expected rapid improvements. Other releases include Mochi 1, an open-source video generation model, Stable Diffusion 3.5 with Large and Medium variants, and Embed 3 by Cohere, a multimodal embedding model for text and image search. KerasHub was launched by François Chollet, unifying KerasNLP and KerasCV with 37 pretrained models. Microsoft introduced the Differential Transformer to reduce attention noise via differential attention maps, and research on transformer attention layers was shared by Rasbt.

Aug 09, 2024

Too Cheap To Meter: AI prices cut 50-70% in last 30 days

gpt-4o gpt-4o-mini llama-3-1-405b mistral-large-2 gemini-1.5-flash deepseek-v2 sonnet-3.5 exaone-3.0 minicpm-v-2.6 claude-3.5 gpt-4o-2024-08-06 llamaindex together-ai deepinfra deepseek-ai mistral-ai google-deepmind lg-ai-research llamaindex llamaindex llamaindex price-cuts context-caching instruction-tuning vision benchmarks pytorch attention-mechanisms reinforcement-learning-from-human-feedback compute-optimal-scaling rohanpaul_ai akhaliq mervenoyann sophiamyang chhillee karpathy

Gemini 1.5 Flash has cut prices by approximately 70%, offering a highly competitive free tier of 1 million tokens per minute at $0.075/mtok, intensifying the AI model price war. Other significant price reductions include GPT-4o (~50% cut to $2.50/mtok), GPT-4o mini (70-98.5% cut to $0.15/mtok), Llama 3.1 405b (46% cut to $2.7/mtok), and Mistral Large 2 (62% cut to $3/mtok). Deepseek v2 introduced context caching, reducing input token costs by up to 90% to $0.014/mtok. New model releases include Llama 3.1 405b, Sonnet 3.5, EXAONE-3.0 (7.8B instruction-tuned by LG AI Research), and MiniCPM V 2.6 (vision-language model combining SigLIP 400M and Qwen2-7B). Benchmarks show Mistral Large performing well on ZebraLogic and Claude-3.5 leading LiveBench. FlexAttention, a new PyTorch API, simplifies and optimizes attention mechanisms. Andrej Karpathy analyzed RLHF, highlighting its limitations compared to traditional reinforcement learning. Google DeepMind research on compute-optimal scaling was also summarized.

Jul 12, 2024

FlashAttention 3, PaliGemma, OpenAI's 5 Levels to Superintelligence

flashattention-3 paligemma-3b gemma-2b numinamath-7b deepseekmath-7b codellama-34b wizardcoder-python-34b-v1.0 chatgpt-3.5 openai together-ai google hugging-face deepseek code-llama attention-mechanisms fp8-training vision prefix-lm superintelligence fine-tuning chain-of-thought tool-integrated-reasoning self-consistency-decoding python coding-capabilities elo-ratings ilya-sutskever lucas-giffman

FlashAttention-3 introduces fast and accurate attention optimized for H100 GPUs, advancing native FP8 training. PaliGemma, a versatile 3B Vision-Language Model (VLM) combining a SigLIP-So400m ViT encoder with the Gemma-2B language model, emphasizes a prefix-LM architecture for improved image-query interaction. OpenAI reveals a framework on levels of superintelligence, signaling progress toward Level 2 and highlighting internal safety disagreements. On Reddit, NuminaMath 7B, fine-tuned from DeepSeekMath-7B, wins the AI Math Olympiad by solving 29 problems using iterative supervised fine-tuning and tool-integrated reasoning. Open-source LLMs like CodeLlama-34b and WizardCoder-Python-34B-V1.0 are closing the coding performance gap with closed models such as ChatGPT-3.5.

Jul 03, 2024

GraphRAG: The Marriage of Knowledge Graphs and RAG

gemma-2 llama-3-70b claude-3.5-sonnet nemotron-340b qwen2-72b llama-3 microsoft-research anthropic nvidia hugging-face retrieval-augmented-generation knowledge-graphs token-usage inference-time attention-mechanisms instruction-following coding math long-range-reasoning synthetic-data dataset-release fine-tuning context-windows function-calling travis-fischer rasbt alexandr-wang osanseviero rohanpaul_ai hamelhusain svpino aaaazzam omarsar0

Microsoft Research open sourced GraphRAG, a retrieval augmented generation (RAG) technique that extracts knowledge graphs from sources and clusters them for improved LLM answers, though it increases token usage and inference time. Gemma 2 models were released focusing on efficient small LLMs with innovations like sliding window attention and RMS norm, nearly matching the larger Llama 3 70B. Anthropic's Claude 3.5 Sonnet leads in instruction following and coding benchmarks, while Nvidia's Nemotron 340B model was released in June. Qwen2-72B tops the HuggingFace Open LLM leaderboard excelling in math and long-range reasoning. Discussions on RAG highlighted its limitations and improvements in context usage via function calls. A persona-driven synthetic data generation approach introduced 1 billion personas, with a fine-tuned model matching GPT-4 performance on math benchmarks at 7B scale. The 200GB AutoMathText dataset was also noted for math data synthesis.

Jun 28, 2024

Gemma 2: The Open Model for Everyone

gemma-2 qwen-72b mixtral-8x22b-instruct claude-3.5-sonnet google-deepmind alibaba mistral-ai anthropic knowledge-distillation attention-mechanisms multilingual-models multimodality model-training model-optimization memory-optimization fine-tuning kathleen-kenealy daniel-han

Gemma 2, a 27B parameter model from google-deepmind, was released with innovations like 1:1 local-global attention alternation and logit soft-capping, leveraging knowledge distillation to train smaller models on over 50× the compute-optimal token quantity. The model supports multilingual and multimodal capabilities, with fine-tuning success on over 200 Indic language variants. The Open LLM Leaderboard highlights alibaba's Qwen 72B as the top model, with mistral-ai's Mixtral-8x22B-Instruct also ranking highly. Anthropic launched Claude 3.5 Sonnet, improving intelligence at mid-tier cost and speed. Research on eliminating matrix multiplication in LLMs promises significant memory savings without performance loss. Kathleen Kenealy and Daniel Han provided insights on Gemma 2's tokenizer and attention scaling respectively.

Jun 22, 2024

Shazeer et al (2024): you are overpaying for inference >13x

claude-3.5-sonnet claude-3-opus character.ai anthropic memory-efficiency kv-cache attention-mechanisms stateful-caching int8-precision transformer-architecture scaling overfitting architecture noam-shazeer kevin-a-fischer sebastien-bubeck _aidan_clark_ andrej-karpathy

Noam Shazeer explains how Character.ai serves 20% of Google Search Traffic for LLM inference while reducing serving costs by a factor of 33 compared to late 2022, with leading commercial APIs costing at least 13.5X more. Key memory-efficiency techniques include MQA > GQA reducing KV cache size by 8X, hybrid attention horizons, cross-layer KV-sharing, stateful caching with a 95% cache rate, and native int8 precision with custom kernels. Anthropic released Claude 3.5 Sonnet, which outperforms Claude 3 Opus at twice the speed and one-fifth the cost, passing 64% of internal pull request tests and introducing new features like Artifacts for real-time doc and code generation. Discussions on LLM architecture highlight the dominance of transformers, challenges in scaling and overfitting, and the importance of architecture work for progress.

Dec 24, 2023

12/23/2023: NeurIPS Best Papers of 2023

gpt-4 palm2 hermes-2.5 mistral-7b nous-research hugging-face apple context-length malware-security video-content music-content linear-layers api-access large-language-models embedding vector-databases model-merging model-interpretability striped-hyena-architecture quantization rmsnorm attention-mechanisms

The Latent Space Pod released a 3-hour recap of the best NeurIPS 2023 papers. The Nous Research AI Discord community discussed optimizing AI performance with shorter context lengths, malware security concerns linked to HuggingFace, and shared insights on video and music content. Technical discussions included the DYAD research paper proposing a faster alternative to linear layers, Apple's ML Ferret machine learning tool, and accessing PALM2 via API. The community also explored Large Language Models focusing on specialized models, data scaling, embedding/vector databases, model merging, and interpretability, with mentions of Hermes 2.5, GPT-4, and Mistral. Additionally, there were conversations on the Striped Hyena Architecture, quantization challenges, and fixes related to RMSNorm and the "Attention is All You Need" paper.

Dec 08, 2023

12/8/2023 - Mamba v Mistral v Hyena

mistral-8x7b-moe mamba-3b stripedhyena-7b claude-2.1 gemini gpt-4 dialogrpt-human-vs-machine cybertron-7b-v2-gguf falcon-180b mistral-ai togethercompute stanford anthropic google hugging-face mixture-of-experts attention-mechanisms prompt-engineering alignment image-training model-deployment gpu-requirements cpu-performance model-inference long-context model-evaluation open-source chatbots andrej-karpathy tri-dao maxwellandrews raddka

Three new AI models are highlighted: Mistral's 8x7B MoE model (Mixtral), Mamba models up to 3B by Together, and StripedHyena 7B, a competitive subquadratic attention model from Stanford's Hazy Research. Discussions on Anthropic's Claude 2.1 focus on its prompting technique and alignment challenges. The Gemini AI from Google is noted as potentially superior to GPT-4. The community also explores Dreambooth for image training and shares resources like the DialogRPT-human-vs-machine model on Hugging Face. Deployment challenges for large language models, including CPU performance and GPU requirements, are discussed with references to Falcon 180B and transformer batching techniques. User engagement includes meme sharing and humor.