All tags
Topic: "gpu-training"
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
deepseek-native-sparse-attention r1-1776 paligemma-2-mix muse baichuan-m1-14b stripedhyena-2 huggingface deepseek perplexity-ai google-deepmind microsoft baichuan stripedhyena gpu-training scaling multimodality vision model-training foundation-models medical-llm genome-modeling robotic-manipulation interactive-content eliebakouch nouamanetazi lvwerra thom-wolf proftomyeh alex-wang aravsrinivas _akhaliq _philschmid mervenoyann reach_vb arankomatsuzaki maximelabonne
Huggingface released "The Ultra-Scale Playbook: Training LLMs on GPU Clusters," an interactive blogpost based on 4000 scaling experiments on up to 512 GPUs, providing detailed insights into modern GPU training strategies. DeepSeek introduced the Native Sparse Attention (NSA) model, gaining significant community attention, while Perplexity AI launched R1-1776, an uncensored and unbiased version of DeepSeek's R1 model. Google DeepMind unveiled PaliGemma 2 Mix, a multi-task vision-language model available in 3B, 10B, and 28B sizes. Microsoft introduced Muse, a generative AI model trained on the game Bleeding Edge, and presented Magma, a foundation model for multimodal AI agents excelling in UI navigation and robotic manipulation. Baichuan-M1-14B was announced as a state-of-the-art medical LLM trained on 20T tokens, and a fully open-source 40B genome modeling model using StripedHyena 2 architecture was also released. "Making your own gaming experience is coming sooner than you'd think," noted in relation to Muse.
12/13/2023 SOLAR10.7B upstages Mistral7B?
solar-10.7b llama-2 mistral-7b phi-2 gpt-4 gemini upstage nous-research openai mistral-ai microsoft depth-up-scaling pretraining synthetic-data gpu-training api-usage model-integration agi asi chat-models vision model-performance fine-tuning
Upstage released the SOLAR-10.7B model, which uses a novel Depth Up-Scaling technique built on the llama-2 architecture and integrates mistral-7b weights, followed by continued pre-training. The Nous community finds it promising but not exceptional. Additionally, weights for the phi-2 base model were released, trained on 1.4 trillion tokens including synthetic texts created by GPT-3 and filtered by GPT-4, using 96 A100 GPUs over 14 days. On OpenAI's Discord, users discussed challenges with various GPT models, including incoherent outputs, API usage limitations, and issues with GPT-4 Vision API. Conversations also covered understanding AGI and ASI, concerns about OpenAI's partnership with Axel Springer, and pricing changes for GPT Plus. Discussions included the Gemini chat model integrated into Bard and comparisons with GPT-4 performance.