All tags
Topic: "gpu-scaling"
not much happened today.
gpt-5.2-codex glm-4.7 openai cursor github cerebras modal artificial-analysis vllm long-running-tasks autonomous-agents code-generation inference-speed latency batch-inference gpu-scaling model-evaluation agent-systems operational-scaling swyx kevinweil pierceboggan mntruell scaling01
OpenAI launched GPT-5.2-Codex API, touted as their strongest coding model for long-running tasks and cybersecurity. Cursor integrated GPT-5.2-Codex to autonomously run a browser for a week, producing over 3 million lines of Rust code. GitHub incorporated it into their code tools, easing enterprise adoption. Discussions highlight the importance of review loops in agent systems and debate evaluation metrics for coding models. OpenAI partnered with Cerebras to improve inference speed and latency, with Cerebras serving GLM-4.7 at 1,445 tokens/sec and low latency. Provider benchmarking reveals tradeoffs in throughput, latency, and context window sizes. Modal shared operational scaling insights for self-hosted inference fleets of 20k GPUs, focusing on batch inference optimization with vLLM and FlashInfer backend. This reflects a focus on inference infrastructure, long-horizon autonomous agents, and coding model evaluation.
Summer of Code AI: $1.6b raised, 1 usable product
ltm-2 llama-3-1-405b gemini-advanced cognition poolside codeium magic google-deepmind nvidia google-cloud long-context model-efficiency custom-hardware cuda training-stack gpu-scaling neural-world-models diffusion-models quantization nat-friedman ben-chess rohan-paul
Code + AI is emphasized as a key modality in AI engineering, highlighting productivity and verifiability benefits. Recent major funding rounds include Cognition AI raising $175M, Poolside raising $400M, Codeium AI raising $150M, and Magic raising $320M. Magic announced their LTM-2 model with a 100 million token context window, boasting efficiency improvements over Llama 3.1 405B by about 1000x cheaper in sequence-dimension algorithm and drastically lower memory requirements. Magic's stack is built from scratch with custom CUDA and no open-source foundations, partnered with Google Cloud and powered by NVIDIA H100 and GB200 GPUs, aiming to scale to tens of thousands of GPUs. Google DeepMind revealed updates to Gemini Advanced with customizable expert "Gems." Neural Game Engines like GameNGen can run DOOM in a diffusion model trained on 0.9B frames. The content also references LLM quantization research by Rohan Paul.
Ring Attention for >1M Context
gemini-pro gemma-7b gemma-2b deepseek-coder-6.7b-instruct llama-cpp google cuda-mode nvidia polymind deepseek ollama runpod lmstudio long-context ringattention pytorch cuda llm-guessing-game chatbots retrieval-augmented-generation vram-optimization fine-tuning dynamic-prompt-optimization ml-workflows gpu-scaling model-updates liu zaharia abbeel
Google Gemini Pro has sparked renewed interest in long context capabilities. The CUDA MODE Discord is actively working on implementing the RingAttention paper by Liu, Zaharia, and Abbeel, including extensions from the World Model RingAttention paper, with available PyTorch and CUDA implementations. TheBloke Discord discussed various topics including LLM guessing game evaluation, chatbot UX comparisons between Nvidia's Chat with RTX and Polymind, challenges in retrieval-augmented generation (RAG) integration, VRAM optimization, fine-tuning for character roleplay using Dynamic Prompt Optimization (DPO), and model choices like deepseek-coder-6.7B-instruct. There was also discussion on ML workflows on Mac Studio, with preferences for llama.cpp over ollama, and scaling inference cost-effectively using GPUs like the 4090 on Runpod. LM Studio users face manual update requirements for version 0.2.16, which includes support for Gemma models and bug fixes, especially for MacOS. The Gemma 7B model has had performance issues, while Gemma 2B received positive feedback.