Frozen AI News archive

FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs

**Jeremy Howard** and collaborators released a new tool combining **FSDP**, **QLoRA**, and **HQQ** to enable training **70b-parameter** models on affordable consumer GPUs like **RTX 4090s** with only **24GB RAM**, overcoming traditional memory constraints that required expensive data center GPUs costing over $150k. The approach shards quantized models across multiple GPUs and uses techniques like gradient checkpointing and CPU offloading to achieve efficient training on desktop-class hardware. The blogpost details challenges and solutions integrating these methods, highlighting a significant cost reduction from $150k to under $2.5k for training large language models. Additionally, Twitter recaps mention **Inflection AI**'s **Inflection-2.5** model rivaling **GPT-4** in benchmarks with less compute, and **Grok** improving speed by 3x. **Yann LeCun** discusses multi-step reasoning training for LLMs.

Canonical issue URL

Jeremy Howard et al is back with a new tool for overcoming the memory constraints of doing 70b-scale training (either pretraining or finetuning, we don't care), usually costing $150k for 4 H100s, on desktop-class GPUs, which cost under $2.5k. These GPUs max out at 24GB per RTX 4090 card, but 70B-params LLMs take >140GB (just for weights).

Here’s the key point: the gaming GPUs have similar performance to the data center GPUs that cost over 10x more! It would be great if we could use these 10x cheaper (but nearly as fast) cards to train large language models, but we can’t, because they have much less memory. The best currently available data center cards have 80GB RAM, whilst gaming cards max out at 24GB RAM. Since only the largest models produce the best results, creating the best models has been largely inaccessible to most people.

QLoRA Limitations

The blogpost also gives a full account of QLoRA, HuggingFace's support, and the limitations they ran into:

QLoRA didn’t quite slay the problem we set out to solve, to train a 70b model on 24GB cards, but it got closer than anything before. When quantized to 4 bits (which is 0.5 bytes), the 70b model takes 70/2 = 35 GB, which is larger than the 24GB gaming GPUs we want to use.

They also discuss memory needs for training, including batch sizing, all of which take required memory well beyond a single 24GB card.

FSDP - Fully Sharded Data Parallel

FSDP solves the memory limitation issue for H100-size GPUs - but a 320GB RAM system of 4 H100s would cost $150k.

FSDP + QLoRA + HQQ

"We figured that if we could use QLoRA to reduce the size of a model by around 400% (so a 70b model would fit into 35GB RAM), and then we used FSDP to shard that across two or more 24GB consumer cards, that would leave enough RAM left over to train a model."

2 RTX 4090s would cost under $2.5k.

FSDP didn't work out of the box with QLoRA quantization, and figured out how to workaround assumptions in the FSDP, PEFT, and LoRA libraries/algorithms to make this all work. The team also used Gradient checkpointing, CPU offloading, FlashAttention 2, and HQQ which caused more integration issues. The blogpost has a lot more fascinating details for those who want to dive in.

Overall takeaway is clear:

image.png


Table of Contents

[TOC]

PART X: AI Twitter Recap

all recaps done by Claude 3 Opus, best of 2 runs

Got it, here's the reformatted version including direct links to each tweet:

Launches & Announcements

AI Capabilities & Benchmarks

AI Industry Analysis & Speculation

Engineering & ML Techniques

Memes & Humor


PART 0: Summary of Summaries of Summaries

Claude 3 Sonnet (14B?)

  1. Advancements in Memory-Efficient LLM Training:

    • Gradient Low-Rank Projection (GaLore) enables training the Llama 7B LLM on a single RTX 4090 GPU, reducing memory requirements for optimizer states by over 82% [Tweet]. This breakthrough could revolutionize LLM training accessibility.
    • A collaboration involving FSDP and QLoRA allows training 70B models on consumer GPUs like RTX 3090s [Blog Post], further democratizing large model development.
    • Discussions around combining GaLore with 1-bit quantization techniques like HQQ and bitsandbytes [GitHub Repo] for potential compounded memory savings during fine-tuning.
  2. Cutting-Edge Language Model Releases and Comparisons:

    • Inflection AI claims their Inflection-2.5 model matches GPT-4 benchmarks while using less than half the compute for training [Tweet], though the claim wasn't highlighted in their official blog post.
    • Anticipation builds for the release of GPT-4, as competitors like Claude 3 seem to be outperforming current OpenAI models according to some users.
    • Discussions around the performance of models like Sonnet, Opus, and Mixtral, with Sonnet praised for its impressive price-performance ratio at costs as low as $0.03 for 5k context and 1200 response length.
  3. Innovative AI Applications and Tools:

    • Doodle Wars is a multiplayer game where players compete in doodling skills evaluated by a neural network [Doodle Wars], showcasing the gamification potential of AI.
    • LangChain and Gradio were used to build a restaurant name and menu generator app [Demo], exemplifying the creative use of language models.
    • The release of over 6.6 million state and federal court decisions with datasets and embeddings [Tweet] enables legal precedent exploration powered by AI.
    • Prompt Mixer is a new desktop tool for building, testing, and iterating AI prompts with version tracking [Prompt Mixer], aiming to streamline prompt engineering workflows.
  4. Advancements in Efficient Attention Mechanisms:

    • Discussions around the mechanics of RelayAttention and its differences from ring/flash attention, with a GitHub repo showcasing vLLM with RelayAttention [GitHub].
    • Implementations of Flash Attention using CUDA are shared, like a minimal version in ~100 lines of CUDA code [GitHub].
    • The CuTe DSL from NVIDIA's FlashAttention repository is being studied to optimize tensor core utilization [GitHub].
    • Benchmarking and discussions around the performance impact of techniques like thread coarsening and vectorized operations in CUDA kernels.

Claude 3 Opus (8x220B?)

ChatGPT (GPT4T)


PART 1: High level Discord summaries

Nous Research AI Discord Summary


LM Studio Discord Summary

LM Studio Hits Version 0.2.16: LM Studio's latest version is 0.2.16, resolving previous terminal run errors and addressing GLIBC or LIBCBlast library issues. Compatibility discussions highlight challenges with gemma 7b gguf and starcoder2 models. For support with GGUF models, refer to Learning More About GGUF.

AMD's AI Hardware Optimism: AMD's CEO Lisa Su’s personal involvement to address Tiny Corp's GPU firmware concerns signals potential improvements for AI applications. AMD's article on running LLMs with AMD Ryzen™ and Radeon™ could assist in leveraging AI without internet dependence.

Rethinking Power Supply Units (PSU) for AI: Discussions suggest a minimum of 750W PSU for powering an RTX 3090, with the Razer Core X eGPU enclosure as an alternative. Debates on efficient hardware setups for language models consider VRAM, power efficiency, and cost-effectiveness.

Integrating and Selecting GPUs in LM Studio: There's a call for features allowing specific GPU selection in LM Studio, following incidents where the software defaults to integrated graphics, causing performance issues with demanding AI models.

Evolving Open Interpreter Usage and Model Sharing: Conversations in #open-interpreter include the implementation of custom system messages using command interpreter.system_message = "Your message" in Python scripts in Open Interpreter. The sharing of links to models such as LHK_DPO_v1 on Hugging Face spotlights the community's efforts in exchanging AI insights LHK_DPO_v1_GGUF. Concerns raised on the Forum about limitations of FusionNet_7Bx2_MoE_14B model's context size can be found here.

Beta Release Buzz in LM Studio: Anticipation is building in the #beta-releases-chat for an imminent new release, with community members teasing the release and sharing humorous banter about the update's arrival.


LlamaIndex Discord Summary


Perplexity AI Discord Summary


Eleuther Discord Summary


OpenAI Discord Summary


HuggingFace Discord Summary


LAION Discord Summary


OpenRouter (Alex Atallah) Discord Summary


CUDA MODE Discord Summary


Latent Space Discord Summary

GaLore Lights Up GPU Potential: User @tiagoefreitas shared insights from @AnimaAnandkumar that Gradient Low-Rank Projection (GaLore) enables the Llama 7B LLM to be trained on a single RTX 4090 GPU, which could transform memory efficiency benchmarks for both pre-training and fine-tuning stages, possibly enhanced by 1-bit quantization.

Inflection-2.5 Under the Microscope: Despite significant performance claims that Inflection-2.5 rivals GPT-4 with lower compute, @swyxio highlighted a gap in Inflection's official communication, observing the absence of this claim from their blog post detailing Inflection-2.5.

Democratizing AI Training with FSDP/QLoRA: @jeremyphoward's tweet about FSDP/QLoRA was shared by @fanahova, signaling a collaboration that enables training large models on home GPUs, while @fx2y pointed to support for quantization techniques like HQQ and bitsandbytes, shared via a GitHub repo link.

Yann LeCun Expounds on AI’s Horizons: Discussions steered towards Yann LeCun's Lex Fridman podcast episode, where he shared his visions for Meta AI, the limitations of current LLMs, and prospects for Contrastive Learning's future.

Data Privacy Concerns in Personal AI: @swyxio related their experience with Life Story, a personal biographer AI, prompting @tiagoefreitas to encourage development of local-hosted applications for better data security.

Inside the Depth of GPT: @ivanleomk and @1123457263638683770 led a session on the GPT-2 paper, with materials explaining concepts and implementation highlighted, alongside a discussion punctuated by a clarification on "causal attention" and the introduction of a LLM Visualization tool.


LangChain AI Discord Summary


DiscoResearch Discord Summary


LLM Perf Enthusiasts AI Discord Summary


Datasette - LLM (@SimonW) Discord Summary


PART 2: Detailed by-Channel summaries and links

Nous Research AI ▷ #off-topic (17 messages🔥):

Links mentioned:


Nous Research AI ▷ #interesting-links (34 messages🔥):

Links mentioned:


Nous Research AI ▷ #announcements (1 messages):

Links mentioned:

NousResearch/Genstruct-7B · Hugging Face: no description found


Nous Research AI ▷ #general (289 messages🔥🔥):

Links mentioned:


Nous Research AI ▷ #ask-about-llms (83 messages🔥🔥):

Links mentioned:


LM Studio ▷ #💬-general (148 messages🔥🔥):

Links mentioned:


LM Studio ▷ #🤖-models-discussion-chat (68 messages🔥🔥):

Links mentioned:


LM Studio ▷ #🧠-feedback (1 messages):

heyitsyorkie: Stop using <#1113937247520170084> for help posts. Use <#1111440136287297637>


LM Studio ▷ #🎛-hardware-discussion (66 messages🔥🔥):

Links mentioned:


LM Studio ▷ #🧪-beta-releases-chat (4 messages):


LM Studio ▷ #amd-rocm-tech-preview (22 messages🔥):

Links mentioned:


LM Studio ▷ #crew-ai (3 messages):


LM Studio ▷ #open-interpreter (87 messages🔥🔥):

Links mentioned:


LlamaIndex ▷ #announcements (1 messages):

Links mentioned:

LlamaIndex user survey: Take this survey powered by surveymonkey.com. Create your own surveys for free.


LlamaIndex ▷ #blog (4 messages):

Links mentioned:

LlamaIndex user survey: Take this survey powered by surveymonkey.com. Create your own surveys for free.


LlamaIndex ▷ #general (339 messages🔥🔥):

Links mentioned:


Perplexity AI ▷ #general (269 messages🔥🔥):

Links mentioned:

Inflection-2.5: meet the world's best personal AI: We are an AI studio creating a personal AI for everyone. Our first AI is called Pi, for personal intelligence, a supportive and empathetic conversational AI.


Perplexity AI ▷ #sharing (6 messages):


Perplexity AI ▷ #pplx-api (14 messages🔥):


Eleuther ▷ #announcements (1 messages):


Eleuther ▷ #general (39 messages🔥):

Links mentioned:


Eleuther ▷ #research (107 messages🔥🔥):

Links mentioned:


Eleuther ▷ #lm-thunderdome (24 messages🔥):

Links mentioned:


Eleuther ▷ #multimodal-general (1 messages):


Eleuther ▷ #gpt-neox-dev (102 messages🔥🔥):

Links mentioned:


OpenAI ▷ #ai-discussions (94 messages🔥🔥):

Links mentioned:

GitHub - Kiddu77/Train_Anything: A repo to get you cracking with Neural Nets .: A repo to get you cracking with Neural Nets . Contribute to Kiddu77/Train_Anything development by creating an account on GitHub.


OpenAI ▷ #gpt-4-discussions (38 messages🔥):


OpenAI ▷ #prompt-engineering (54 messages🔥):


OpenAI ▷ #api-discussions (54 messages🔥):


HuggingFace ▷ #announcements (1 messages):

Links mentioned:


HuggingFace ▷ #general (142 messages🔥🔥):

Links mentioned:


HuggingFace ▷ #today-im-learning (4 messages):


HuggingFace ▷ #cool-finds (5 messages):

Links mentioned:


HuggingFace ▷ #i-made-this (13 messages🔥):

Links mentioned:


HuggingFace ▷ #reading-group (45 messages🔥):

Links mentioned:


HuggingFace ▷ #diffusion-discussions (1 messages):

Links mentioned:

ByteDance/SDXL-Lightning · finetune: no description found


HuggingFace ▷ #computer-vision (9 messages🔥):


HuggingFace ▷ #NLP (18 messages🔥):

Links mentioned:

Golden Retriever Dog GIF - Golden Retriever Dog Puppy - Discover & Share GIFs: Click to view the GIF


HuggingFace ▷ #diffusion-discussions (1 messages):

Links mentioned:

ByteDance/SDXL-Lightning · finetune: no description found


LAION ▷ #general (113 messages🔥🔥):

Links mentioned:


LAION ▷ #research (83 messages🔥🔥):

Links mentioned:


OpenRouter (Alex Atallah) ▷ #announcements (3 messages):

Links mentioned:


OpenRouter (Alex Atallah) ▷ #general (109 messages🔥🔥):


CUDA MODE ▷ #general (6 messages):

Links mentioned:


CUDA MODE ▷ #triton (1 messages):

marksaroufim: do you see where the link to join that meetup is?


CUDA MODE ▷ #cuda (78 messages🔥🔥):

Links mentioned:

cutlass/media/docs/cute at main · NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines. Contribute to NVIDIA/cutlass development by creating an account on GitHub.


CUDA MODE ▷ #torch (4 messages):


CUDA MODE ▷ #algorithms (5 messages):

Links mentioned:


CUDA MODE ▷ #beginner (4 messages):


CUDA MODE ▷ #ring-attention (12 messages🔥):


CUDA MODE ▷ #off-topic (1 messages):


Latent Space ▷ #ai-general-chat (31 messages🔥):

Links mentioned:


Latent Space ▷ #ai-announcements (3 messages):

Links mentioned:

Join the Latent Space (née /dev/invest) Discord Server!: Check out the Latent Space (née /dev/invest) community on Discord - hang out with 3061 other members and enjoy free voice and text chat.


Latent Space ▷ #llm-paper-club-west (30 messages🔥):

Links mentioned:


LangChain AI ▷ #general (21 messages🔥):

Links mentioned:


LangChain AI ▷ #langchain-templates (9 messages🔥):


LangChain AI ▷ #share-your-work (2 messages):

Links mentioned:


LangChain AI ▷ #tutorials (1 messages):

pradeep1148: https://www.youtube.com/watch?v=PtP8R8VjTGc


DiscoResearch ▷ #general (3 messages):

Links mentioned:

Evo: Long-context modeling from molecular to genome scale: no description found


DiscoResearch ▷ #embedding_dev (11 messages🔥):

Links mentioned:


DiscoResearch ▷ #discolm_german (3 messages):


LLM Perf Enthusiasts AI ▷ #claude (6 messages):


Datasette - LLM (@SimonW) ▷ #ai (5 messages):

Links mentioned:

Making my bookshelves clickable | James' Coffee Blog: no description found