All tags
Company: "polymind"
Ring Attention for >1M Context
gemini-pro gemma-7b gemma-2b deepseek-coder-6.7b-instruct llama-cpp google cuda-mode nvidia polymind deepseek ollama runpod lmstudio long-context ringattention pytorch cuda llm-guessing-game chatbots retrieval-augmented-generation vram-optimization fine-tuning dynamic-prompt-optimization ml-workflows gpu-scaling model-updates liu zaharia abbeel
Google Gemini Pro has sparked renewed interest in long context capabilities. The CUDA MODE Discord is actively working on implementing the RingAttention paper by Liu, Zaharia, and Abbeel, including extensions from the World Model RingAttention paper, with available PyTorch and CUDA implementations. TheBloke Discord discussed various topics including LLM guessing game evaluation, chatbot UX comparisons between Nvidia's Chat with RTX and Polymind, challenges in retrieval-augmented generation (RAG) integration, VRAM optimization, fine-tuning for character roleplay using Dynamic Prompt Optimization (DPO), and model choices like deepseek-coder-6.7B-instruct. There was also discussion on ML workflows on Mac Studio, with preferences for llama.cpp over ollama, and scaling inference cost-effectively using GPUs like the 4090 on Runpod. LM Studio users face manual update requirements for version 0.2.16, which includes support for Gemma models and bug fixes, especially for MacOS. The Gemma 7B model has had performance issues, while Gemma 2B received positive feedback.