All tags
Topic: "gpu-scaling"
Summer of Code AI: $1.6b raised, 1 usable product
ltm-2 llama-3-1-405b gemini-advanced cognition poolside codeium magic google-deepmind nvidia google-cloud long-context model-efficiency custom-hardware cuda training-stack gpu-scaling neural-world-models diffusion-models quantization nat-friedman ben-chess rohan-paul
Code + AI is emphasized as a key modality in AI engineering, highlighting productivity and verifiability benefits. Recent major funding rounds include Cognition AI raising $175M, Poolside raising $400M, Codeium AI raising $150M, and Magic raising $320M. Magic announced their LTM-2 model with a 100 million token context window, boasting efficiency improvements over Llama 3.1 405B by about 1000x cheaper in sequence-dimension algorithm and drastically lower memory requirements. Magic's stack is built from scratch with custom CUDA and no open-source foundations, partnered with Google Cloud and powered by NVIDIA H100 and GB200 GPUs, aiming to scale to tens of thousands of GPUs. Google DeepMind revealed updates to Gemini Advanced with customizable expert "Gems." Neural Game Engines like GameNGen can run DOOM in a diffusion model trained on 0.9B frames. The content also references LLM quantization research by Rohan Paul.
Ring Attention for >1M Context
gemini-pro gemma-7b gemma-2b deepseek-coder-6.7b-instruct llama-cpp google cuda-mode nvidia polymind deepseek ollama runpod lmstudio long-context ringattention pytorch cuda llm-guessing-game chatbots retrieval-augmented-generation vram-optimization fine-tuning dynamic-prompt-optimization ml-workflows gpu-scaling model-updates liu zaharia abbeel
Google Gemini Pro has sparked renewed interest in long context capabilities. The CUDA MODE Discord is actively working on implementing the RingAttention paper by Liu, Zaharia, and Abbeel, including extensions from the World Model RingAttention paper, with available PyTorch and CUDA implementations. TheBloke Discord discussed various topics including LLM guessing game evaluation, chatbot UX comparisons between Nvidia's Chat with RTX and Polymind, challenges in retrieval-augmented generation (RAG) integration, VRAM optimization, fine-tuning for character roleplay using Dynamic Prompt Optimization (DPO), and model choices like deepseek-coder-6.7B-instruct. There was also discussion on ML workflows on Mac Studio, with preferences for llama.cpp over ollama, and scaling inference cost-effectively using GPUs like the 4090 on Runpod. LM Studio users face manual update requirements for version 0.2.16, which includes support for Gemma models and bug fixes, especially for MacOS. The Gemma 7B model has had performance issues, while Gemma 2B received positive feedback.