All tags
Topic: "quantization"
AI Engineer World's Fair Talks Day 1
gemini-2.5 gemma claude-code mistral cursor anthropic openai aie google-deepmind meta-ai-fair agent-based-architecture open-source model-memorization scaling-laws quantization mixture-of-experts language-model-memorization model-generalization langgraph model-architecture
Mistral launched a new Code project, and Cursor released version 1.0. Anthropic improved Claude Code plans, while ChatGPT announced expanded connections. The day was dominated by AIE keynotes and tracks including GraphRAG, RecSys, and Tiny Teams. On Reddit, Google open-sourced the DeepSearch stack for building AI agents with Gemini 2.5 and LangGraph, enabling flexible agent architectures and integration with local LLMs like Gemma. A new Meta paper analyzed language model memorization, showing GPT-style transformers store about 3.5–4 bits/parameter and exploring the transition from memorization to generalization, with implications for Mixture-of-Experts models and quantization effects.
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release
deepseek-r1-0528 gemini-2.5-pro qwen-3-8b qwen-3-235b deepseek-ai anthropic meta-ai-fair nvidia alibaba google-deepmind reinforcement-learning benchmarking model-performance open-weights reasoning quantization post-training model-comparison artificialanlys scaling01 cline reach_vb zizhpan andrewyng teortaxestex teknim1 lateinteraction abacaj cognitivecompai awnihannun
DeepSeek R1-0528 marks a significant upgrade, closing the gap with proprietary models like Gemini 2.5 Pro and surpassing benchmarks from Anthropic, Meta, NVIDIA, and Alibaba. This Chinese open-weights model leads in several AI benchmarks, driven by reinforcement learning post-training rather than architecture changes, and demonstrates increased reasoning token usage (23K tokens per question). The China-US AI race intensifies as Chinese labs accelerate innovation through transparency and open research culture. Key benchmarks include AIME 2024, LiveCodeBench, and GPQA Diamond.
Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning
intellect-2 dreamo qwen gemini-2.5-pro dynamic-byte-latent-transformer gen-4-references mistral-medium-3 le-chat-enterprise primeintellect bytedance qwen gemma meta-ai-fair runwayml mistral-ai google distributed-training reinforcement-learning gpu-clusters model-optimization quantization multimodality agentic-ai video-understanding fine-tuning _akhaliq reach_vb osanseviero aiatmeta c_valenzuelab lmarena_ai adcock_brett
Prime Intellect released INTELLECT-2, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. ByteDance launched DreamO, a unified image customization model on Hugging Face. Qwen released models optimized for GPTQ, GGUF, and AWQ quantization. Gemma surpassed 150 million downloads on Hugging Face. Meta released weights for the Dynamic Byte Latent Transformer and the Collaborative Reasoner framework to improve language model efficiency and reasoning. RunwayML introduced Gen-4 References, a near-realtime model requiring no fine-tuning. Mistral AI released Mistral Medium 3, a strong multimodal model, and Le Chat Enterprise, an agentic AI assistant for business. Google updated Gemini 2.5 Pro Preview with video understanding and UI improvements. "Airbnb for spare GPUs from all over the world" highlights the ongoing challenges and potential of distributed GPU training.
not much happened today
qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb
Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.
not much happened today
deepseek-r1 gemma-3 gemma-3-27b openai nvidia deepseek hugging-face fp8 model-efficiency hardware-requirements quantization benchmarking model-deployment open-source sam-altman
DeepSeek R1 demonstrates significant efficiency using FP8 precision, outperforming Gemma 3 27B in benchmarks with a Chatbot Arena Elo Score of 1363 vs. 1338, requiring substantial hardware like 32 H100 GPUs and 2,560GB VRAM. OpenAI labels DeepSeek as "state-controlled" and calls for bans on "PRC-produced" models, sparking community backlash accusing OpenAI and Sam Altman of anti-competitive behavior. Discussions emphasize DeepSeek's openness and affordability compared to OpenAI, with users highlighting its local and Hugging Face deployment options. Meanwhile, Gemma 3 receives mixed community feedback on creativity and worldbuilding.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
Project Stargate: $500b datacenter (1.7% of US GDP) and Gemini 2 Flash Thinking 2
gemini-2.0-flash deepseek-r1 qwen-32b openai softbank oracle arm microsoft nvidia huggingface deepseek-ai long-context quantization code-interpretation model-distillation open-source agi-research model-performance memory-optimization noam-shazeer liang-wenfeng
Project Stargate, a US "AI Manhattan project" led by OpenAI and Softbank, supported by Oracle, Arm, Microsoft, and NVIDIA, was announced with a scale comparable to the original Manhattan project costing $35B inflation adjusted. Despite Microsoft's reduced role as exclusive compute partner, the project is serious but not immediately practical. Meanwhile, Noam Shazeer revealed a second major update to Gemini 2.0 Flash Thinking, enabling 1M token long context usable immediately. Additionally, AI Studio introduced a new code interpreter feature. On Reddit, DeepSeek R1, a distillation of Qwen 32B, was released for free on HuggingChat, sparking discussions on self-hosting, performance issues, and quantization techniques. DeepSeek's CEO Liang Wenfeng highlighted their focus on fundamental AGI research, efficient MLA architecture, and commitment to open-source development despite export restrictions, positioning DeepSeek as a potential alternative to closed-source AI trends.
not much happened to end the week
gemini deepseek-r1 o1 chatgpt gpt-4 claude-3.5-sonnet o1-preview o1-mini gpt4o qwq-32b google-deepmind deeplearningai amazon tesla x-ai alibaba ollama multimodality benchmarking quantization reinforcement-learning ai-safety translation reasoning interpretability model-comparison humor yoshua-bengio kevinweil ylecun
AI News for 11/29/2024-11/30/2024 covers key updates including the Gemini multimodal model advancing in musical structure understanding, a new quantized SWE-Bench for benchmarking at 1.3 bits per task, and the launch of the DeepSeek-R1 model focusing on transparent reasoning as an alternative to o1. The establishment of the 1st International Network of AI Safety Institutes highlights global collaboration on AI safety. Industry updates feature Amazon's Olympus AI model, Tesla's Optimus, and experiments with ChatGPT as a universal translator. Community reflections emphasize the impact of large language models on daily life and medical AI applications. Discussions include scaling sparse autoencoders to gpt-4 and the need for transparency in reasoning LLMs. The report also notes humor around ChatGPT's French nickname.
OLMo 2 - new SOTA Fully Open LLM
llama-3-1-8b olmo-2 qwen2-5-72b-instruct smolvlm tulu-3 ai2 huggingface intel reinforcement-learning quantization learning-rate-annealing ocr fine-tuning model-training vision
AI2 has updated OLMo-2 to roughly Llama 3.1 8B equivalent, training with 5T tokens and using learning rate annealing and new high-quality data (Dolmino). They credit Tülu 3 and its "Reinforcement Learning with Verifiable Rewards" approach. On Reddit, Qwen2.5-72B instruct model shows near lossless performance with AutoRound 4-bit quantization, available on HuggingFace in 4-bit and 2-bit versions, with discussions on MMLU benchmark and quantization-aware training. HuggingFace released SmolVLM, a 2B parameter vision-language model running efficiently on consumer GPUs, supporting fine-tuning on Google Colab and demonstrating strong OCR capabilities with adjustable resolution and quantization options.
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.
Common Corpus: 2T Open Tokens with Provenance
qwen-2.5-coder claude-3.5-sonnet janusflow-1.3b ocronos-vintage pleais huggingface langchainai deepseek alibaba anthropic provenance ocr multilingual-datasets prompt-engineering multimodality image-generation code-generation quantization model-scaling inference-efficiency tim-dettmers tom-doerr omarsar0 swyx madiator reach_vb
Pleais via Huggingface released Common Corpus, the largest fully open multilingual dataset with over 2 trillion tokens including detailed provenance information. They also introduced OCRonos-Vintage, a 124M-parameter OCR correction model that efficiently fixes digitization errors on CPU and GPU, unlocking knowledge from PDFs. On AI tools, LangChainAI launched Prompt Canvas for collaborative prompt engineering, while DeepSeek released JanusFlow 1.3B, a unified multimodal LLM integrating autoregressive and rectified flow models for enhanced image understanding and generation. Alibaba Cloud announced Qwen2.5-Coder, a code-focused LLM with advanced coding capabilities, and Claude 3.5 Sonnet was highlighted for superior code generation. Discussions on quantization challenges and scaling laws for precision by Tim Dettmers and others emphasized the impact of low-precision training on model scalability and inference efficiency. "Scaling Laws for Precision" paper insights and alternative efficiency methods were also noted.
BitNet was a lie?
qwen-2.5-coder-32b-instruct gpt-4o llama-3 sambanova alibaba hugging-face quantization scaling-laws model-efficiency fine-tuning model-performance code-generation open-source unit-testing ci-cd tanishq-kumar tim-dettmers
Scaling laws for quantization have been modified by a group led by Chris Re, analyzing over 465 pretraining runs and finding benefits plateau at FP6 precision. Lead author Tanishq Kumar highlights that longer training and more data increase sensitivity to quantization, explaining challenges with models like Llama-3. Tim Dettmers, author of QLoRA, warns that the era of efficiency gains from low-precision quantization is ending, signaling a shift from scaling to optimizing existing resources. Additionally, Alibaba announced Qwen 2.5-Coder-32B-Instruct, which matches or surpasses GPT-4o on coding benchmarks, and open-source initiatives like DeepEval for LLM testing are gaining traction.
not much happened today
flux-schnell meta-ai-fair anthropic togethercompute hugging-face audio-generation quantization prompt-caching long-term-memory llm-serving-framework hallucination-detection ai-safety ai-governance geoffrey-hinton john-hopfield demis-hassabis rohanpaul_ai svpino hwchase17 shreyar philschmid mmitchell_ai bindureddy
Geoffrey Hinton and John Hopfield won the Nobel Prize in Physics for foundational work on neural networks linking AI and physics. Meta AI introduced a 13B parameter audio generation model as part of Meta Movie Gen for video-synced audio. Anthropic launched the Message Batches API enabling asynchronous processing of up to 10,000 queries at half the cost. Together Compute released Flux Schnell, a free model for 3 months. New techniques like PrefixQuant quantization and Prompt Caching for low-latency inference were highlighted by rohanpaul_ai. LangGraph added long-term memory support for persistent document storage. Hex-LLM framework was introduced for TPU-based low-cost, high-throughput LLM serving from Hugging Face models. Discussions on AI safety emphasized gender equality in science, and concerns about premature AI regulation by media and Hollywood were raised.
Not much technical happened today
whisper-v3-turbo llama-3 llamaindex openai poolside liquidai perplexity-ai meta-ai-fair cohere fujitsu mixture-of-experts context-windows model-optimization fine-tuning quantization model-training alignment synthetic-data model-architecture agentic-ai nick-turley arav-srinivas francois-fleuret finbarr-timbers lewtun francois-chollet jerry-j-liu mmitchell-ai jxnlco
OpenAI announced raising $6.6B in new funding at a $157B valuation, with ChatGPT reaching 250M weekly active users. Poolside raised $500M to advance AGI development. LiquidAI introduced three new MoE models (1B, 3B, 40B) with a 32k context window and efficient token handling. OpenAI released Whisper V3 Turbo, an open-source multilingual model with significant speed improvements. Meta AI FAIR is hiring research interns focusing on LLM reasoning, alignment, synthetic data, and novel architectures. Cohere partnered with Fujitsu to launch Takane, a custom Japanese model. Technical discussions included challenges in LoRA fine-tuning, float8 quantization in Keras, and new tools like create-llama for agent templates. Industry commentary raised concerns about AI development priorities and highlighted freelancing opportunities in AI.
Llama 3.2: On-device 1B/3B, and Multimodal 11B/90B (with AI2 Molmo kicker)
llama-3-2 llama-3-1 claude-3-haiku gpt-4o-mini molmo-72b molmo-7b gemma-2 phi-3-5 llama-3-2-vision llama-3-2-3b llama-3-2-20b meta-ai-fair ai2 qualcomm mediatek arm ollama together-ai fireworks-ai weights-biases cohere weaviate multimodality vision context-windows quantization model-release tokenization model-performance model-optimization rag model-training instruction-following mira-murati daniel-han
Meta released Llama 3.2 with new multimodal versions including 3B and 20B vision adapters on a frozen Llama 3.1, showing competitive performance against Claude Haiku and GPT-4o-mini. AI2 launched multimodal Molmo 72B and 7B models outperforming Llama 3.2 in vision tasks. Meta also introduced new 128k-context 1B and 3B models competing with Gemma 2 and Phi 3.5, with collaborations hinted with Qualcomm, Mediatek, and Arm for on-device AI. The release includes a 9 trillion token count for Llama 1B and 3B. Partner launches include Ollama, Together AI offering free 11B model access, and Fireworks AI. Additionally, a new RAG++ course from Weights & Biases, Cohere, and Weaviate offers systematic evaluation and deployment guidance for retrieval-augmented generation systems based on extensive production experience.
Reflection 70B, by Matt from IT Department
llama-3.1-70b llama-3 claude-3.5-sonnet hyperwrite glaive fine-tuning chain-of-thought instruction-following synthetic-data quantization model-evaluation prompt-engineering matt-shumer sahil-chaudhary
Reflection Tuning technique has been used by a two-person team from Hyperwrite and Glaive to finetune llama-3.1-70b, showing strong performance improvements with minimal synthetic data. The approach builds on the concept of adding
thinking
and reflection
steps to outputs, related to the Chain of Thought method. Despite some criticisms like contamination concerns, worse coding performance, and reliance on system prompts, the model has received positive reception and comparisons to claude-3.5-sonnet. The work highlights efficient instruction tuning and synthetic data generation for large models. Summer of Code AI: $1.6b raised, 1 usable product
ltm-2 llama-3-1-405b gemini-advanced cognition poolside codeium magic google-deepmind nvidia google-cloud long-context model-efficiency custom-hardware cuda training-stack gpu-scaling neural-world-models diffusion-models quantization nat-friedman ben-chess rohan-paul
Code + AI is emphasized as a key modality in AI engineering, highlighting productivity and verifiability benefits. Recent major funding rounds include Cognition AI raising $175M, Poolside raising $400M, Codeium AI raising $150M, and Magic raising $320M. Magic announced their LTM-2 model with a 100 million token context window, boasting efficiency improvements over Llama 3.1 405B by about 1000x cheaper in sequence-dimension algorithm and drastically lower memory requirements. Magic's stack is built from scratch with custom CUDA and no open-source foundations, partnered with Google Cloud and powered by NVIDIA H100 and GB200 GPUs, aiming to scale to tens of thousands of GPUs. Google DeepMind revealed updates to Gemini Advanced with customizable expert "Gems." Neural Game Engines like GameNGen can run DOOM in a diffusion model trained on 0.9B frames. The content also references LLM quantization research by Rohan Paul.
Gemma 2 2B + Scope + Shield
gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering
Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.
not much happened today
sam-2 gemini-1.5-pro chatgpt midjourney-v6.1 meta-ai-fair google-deepmind scale-ai apple canva hugging-face object-segmentation quantization web-development-framework adversarial-robustness on-device-ai open-source robotics voice vision jeremyphoward demis-hassabis ylecun maartengrootendorst jimfan
Meta released SAM 2, a unified model for real-time object segmentation with a new dataset 4.5x larger and 53x more annotated than previous ones. FastHTML, a new Python web framework by Jeremy Howard, enables easy creation and deployment of interactive web apps. Scale AI launched the SEAL Leaderboard on adversarial robustness, topped by Gemini 1.5 Pro from Google DeepMind. Apple published a technical report on their Intelligence Foundation Language Models for on-device and server use. Yann LeCun emphasized the importance of open source AI in an article co-authored with Martin Casado and Ion Stoica. Maarten Grootendorst's "Visual Guide to Quantization" on efficient LLM inference went viral. ChatGPT started rolling out advanced voice and vision-enabled modes to select users. Leonardo AI was acquired by Canva. Jim Fan shared insights on Project Groot augmenting human demonstration data for robotics. Midjourney v6.1 was released.
There's Ilya!
chameleon-7b chameleon-34b deepseek-coder-v2 gpt-4-turbo claude-3-opus voco-llama safe-superintelligence-inc openai anthropic meta deepseek google-deepmind parallel-decoding code-generation quantization training-dynamics vision benchmarks datasets image-captioning reasoning memory-optimization ilya-sutskever jan-leike ylecun akhaliq philschmid rohanpaul_ai mervenoyann fchollet
Ilya Sutskever has co-founded Safe Superintelligence Inc shortly after leaving OpenAI, while Jan Leike moved to Anthropic. Meta released new models including Chameleon 7B and 34B with mixed-modal input and unified token space quantization. DeepSeek-Coder-V2 shows code capabilities comparable to GPT-4 Turbo, supporting 338 programming languages and 128K context length. Consistency Large Language Models (CLLMs) enable parallel decoding generating multiple tokens per step. Grokked Transformers demonstrate reasoning through training dynamics affecting memory formation and generalization. VoCo-LLaMA compresses vision tokens with LLMs improving video temporal correlation understanding. The BigCodeBench benchmark evaluates LLMs on 1,140 coding tasks across 139 Python libraries, topped by DeepSeek-Coder-V2 and Claude 3 Opus. PixelProse is a large 16M image-caption dataset with reduced toxicity.
Talaria: Apple's new MLOps Superweapon
gemma mixtral phi dbrx apple google mistral-ai microsoft mosaic quantization on-device-ai adapter-models model-optimization model-latency lossless-quantization low-bit-palletization token-generation model-benchmarking human-evaluation craig-federighi andrej-karpathy
Apple Intelligence introduces a small (~3B parameters) on-device model and a larger server model running on Apple Silicon with Private Cloud Compute, aiming to surpass Google Gemma, Mistral Mixtral, Microsoft Phi, and Mosaic DBRX. The on-device model features a novel lossless quantization strategy using mixed 2-bit and 4-bit LoRA adapters averaging 3.5 bits-per-weight, enabling dynamic adapter hot-swapping and efficient memory management. Apple credits the Talaria tool for optimizing quantization and model latency, achieving about 0.6 ms time-to-first-token latency and 30 tokens per second generation rate on iPhone 15 Pro. Apple focuses on an "adapter for everything" strategy with initial deployment on SiriKit and App Intents. Performance benchmarks rely on human graders, emphasizing consumer-level adequacy over academic dominance. The Apple ML blog also mentions an Xcode code-focused model and a diffusion model for Genmoji.
GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4T version)
gpt-4o gpt-3.5 llama-3 openai hugging-face nous-research eleutherai hazyresearch real-time-reasoning coding-capabilities fine-tuning knowledge-distillation hardware-optimization quantization multimodality mixture-of-experts efficient-attention model-scaling depth-upscaling transformer-architecture gpu-optimization prompt-engineering
OpenAI launched GPT-4o, a frontier model supporting real-time reasoning across audio, vision, and text, now free for all ChatGPT users with enhanced coding capabilities and upcoming advanced voice and video features. Discussions cover open-source LLMs like Llama 3, fine-tuning techniques including knowledge distillation for GPT-3.5, and hardware optimization strategies such as quantization. Emerging architectures include multimodal integrations with ChatGPT voice and Open Interpreter API, Mixture of Experts models combining autoregressive and diffusion approaches, and novel designs like the YOCO architecture and ThunderKittens DSL for efficient GPU use. Research advances in efficient attention methods like Conv-Basis using FFT and model scaling techniques such as depth upscaling were also highlighted.
Quis promptum ipso promptiet?
llama-3-70b llama-3-120b llama-3 llama-cpp anthropic openai zoominfo neuralink prompt-engineering chain-of-thought rag quantization cuda-graphs gpu-optimization thought-controlled-devices modeling-consciousness conference sama gdb bindureddy svpino rohanpaul_ai alexalbert__ abacaj
Anthropic released upgrades to their Workbench Console, introducing new prompt engineering features like chain-of-thought reasoning and prompt generators that significantly reduce development time, exemplified by their customer Zoominfo. OpenAI teased a "magic" new development coming soon, speculated to be a new LLM replacing GPT-3.5 in the free tier or a search competitor. The open-source community highlighted Llama 3 70B as "game changing" with new quantized weights for Llama 3 120B and CUDA graph support for llama.cpp improving GPU performance. Neuralink demonstrated a thought-controlled mouse, sparking interest in modeling consciousness from brain signals. The ICLR 2024 conference is being held in Asia for the first time, generating excitement.
A quiet weekend
llama-3 dolphin-2.9 pixart-sigma llama-3-70b microsoft coca-cola uber lmsys nous-research mistral-ai ar-interfaces transformers algorithmic-tasks turing-test graph-algorithms embeddings generative-ai model-optimization llm-inference quantization model-deployment yann-lecun
Yann LeCun predicts a shift to AR interfaces with AI assistants in 10-15 years, moving away from smartphones. The Dolphin-2.9 model based on Llama-3 was released, improving quality issues. PixArt Sigma, a 0.6B parameter model, achieves Stable Diffusion 3.0 level performance with complete prompt adherence and local usability. Research shows transformers can use meaningless filler tokens for algorithmic tasks with dense supervision. AI-generated restaurant reviews can pass the Turing test, fooling humans and AI detectors. Uber uses graph algorithms and learned embeddings for ETA prediction. Coca-Cola and Microsoft announced a 5-year AI partnership to accelerate cloud and generative AI initiatives. The Llama-3 70B model can run on a single 4GB GPU using AirLLM optimization without quantization but is slow. Mistral.rs is introduced as a fast LLM inference platform with quantization and OpenAI API compatibility. Only 5% of LLMs make it from prototype to production due to challenges, especially in enterprise. EXL2 and GGUF quantization methods for Llama models show similar perplexity vs model size, with Llama-3 and Llama-2 degrading more under quantization compared to full precision.
Apple's OpenELM beats OLMo with 50% of its dataset, using DeLighT
openelm llama-3 llama-3-8b-instruct llama-3-70b apple meta-ai-fair google layer-wise-scaling context-length quantization ai-alignment open-source ai-regulation eric-schmidt sebastian-raschka
Apple advances its AI presence with the release of OpenELM, its first relatively open large language model available in sizes from 270M to 3B parameters, featuring a novel layer-wise scaling architecture inspired by the DeLight paper. Meanwhile, Meta's LLaMA 3 family pushes context length boundaries with models supporting over 160K tokens and an 8B-Instruct model with 262K context length released on Hugging Face, alongside performance improvements in quantized versions. A new paper on AI alignment highlights KTO as the best-performing method, with sensitivity to training data volume noted. In AI ethics and regulation, former Google CEO Eric Schmidt warns about the risks of open-source AI empowering bad actors and geopolitical rivals, while a U.S. proposal aims to enforce "Know Your Customer" rules to end anonymous cloud usage.
Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM
snowflake-arctic phi-3 llama-3-70b llama-3 stable-diffusion-3 sd3-turbo gpt-3.5-turbo snowflake databricks deepseek deepspeed nvidia stable-diffusion adobe apple llamaindex lmsys openai mixture-of-experts curriculum-learning model-release image-generation video-upscaling quantization inference-speed benchmarking model-comparison open-source on-device-ai
Snowflake Arctic is a notable new foundation language model released under Apache 2.0, claiming superiority over Databricks in data warehouse AI applications and adopting a mixture-of-experts architecture inspired by DeepSeekMOE and DeepSpeedMOE. The model employs a 3-stage curriculum training strategy similar to the recent Phi-3 paper. In AI image and video generation, Nvidia introduced the Align Your Steps technique improving image quality at low step counts, while Stable Diffusion 3 and SD3 Turbo models were compared for prompt understanding and image quality. Adobe launched an AI video upscaling project enhancing blurry videos to HD, though with some high-resolution artifacts. Apple released open-source on-device language models with code and training logs, diverging from typical weight-only releases. The Llama-3-70b model ties for first place on the LMSYS leaderboard for English queries, and Phi-3 (4B params) outperforms GPT-3.5 Turbo in the banana logic benchmark. Fast inference and quantization of Llama 3 models were demonstrated on MacBook devices.
Perplexity, the newest AI unicorn
llama-3-8b llama-3-70b llama-3 llava-llama-3-8b-v1_1 phi-3 gpt-3.5 perplexity-ai meta-ai-fair hugging-face groq context-length fine-tuning quantization instruction-following model-comparison multimodality benchmarking memory-optimization model-performance daniel-gross aravind-srinivas
Perplexity doubles its valuation shortly after its Series B with a Series B-1 funding round. Significant developments around Llama 3 include context length extension to 16K tokens, new multimodal LLaVA models outperforming Llama 2, and fine-tuning improvements like QDoRA surpassing QLoRA. The Llama-3-70B model is praised for instruction following and performance across quantization formats. Phi-3 models by Meta AI released in multiple sizes show competitive benchmark results, with the 14B model achieving 78% on MMLU and the 3.8B model nearing GPT-3.5 performance.
FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security
2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.
Anime pfp anon eclipses $10k A::B prompting challenge
command-r-plus-104b stable-diffusion-1.5 openai ollama huggingface quantization model-optimization streaming prompt-engineering self-prompting image-composition character-lora-training model-size open-source-licenses memes humor victor-taelin futuristfrog
Victor Taelin issued a $10k challenge to GPT models, initially achieving only 10% success with state-of-the-art models, but community efforts surpassed 90% success within 48 hours, highlighting GPT capabilities and common skill gaps. In Reddit AI communities, Command R Plus (104B) is running quantized on M2 Max hardware via Ollama and llama.cpp forks, with GGUF quantizations released on Huggingface. Streaming text-to-video generation is now available through the st2v GitHub repo. WD Tagger v3 was released for mass auto-captioning datasets with a WebUI. Lesser-known prompting techniques like self-tagging and generational frameworks produced thought-provoking outputs in OpenAI discussions, including experiments with self-evolving system prompts. Stable Diffusion users discussed image composition importance for training character LoRAs and best checkpoints for video game character generation. Discussions also covered scarcity of 5B parameter models and open(ish) licenses for open source AI. Memes included jokes about ChatGPT and Gemini training data differences.
ReALM: Reference Resolution As Language Modeling
flan-t5 gpt-4 apple openai hugging-face stability-ai reference-resolution finetuning quantization retrieval-augmented-generation open-source coding-agents podcast-generation image-generation ai-industry-trends takuto-takizawa
Apple is advancing in AI with a new approach called ReALM: Reference Resolution As Language Modeling, which improves understanding of ambiguous references using three contexts and finetunes a smaller FLAN-T5 model that outperforms GPT-4 on this task. In Reddit AI news, an open-source coding agent SWE-agent achieves 12.29% on the SWE-bench benchmark, and RAGFlow introduces a customizable retrieval-augmented generation engine. A new quantization method, QuaRot, enables efficient 4-bit inference. AI applications include a t-shirt design generator, podgenai for GPT-4 based podcast generation, and an open-source model from HuggingFace that runs without a GPU. Industry discussions focus on the impact of large language models on the AI field and efforts to decentralize AI development. Takuto Takizawa joins Stability AI Japan as Head of Sales & Partnerships.
Evals-based AI Engineering
jamba bamboo qwen-1.5-moe grok-1.5 llama2-7b openai mistral-ai x-ai llamaindex evaluation fine-tuning prompt-engineering voice-cloning quantization model-optimization code-generation context-windows hamel-husain alec-radford
Hamel Husain emphasizes the importance of comprehensive evals in AI product development, highlighting evaluation, debugging, and behavior change as key iterative steps. OpenAI released a voice engine demo showcasing advanced voice cloning from small samples, raising safety concerns. Reddit discussions introduced new models like Jamba (hybrid Transformer-SSM with MoE), Bamboo (7B LLM with high sparsity based on Mistral), Qwen1.5-MoE (efficient parameter activation), and Grok 1.5 (128k context length, surpassing GPT-4 in code generation). Advances in quantization include 1-bit Llama2-7B models outperforming full precision and the QLLM quantization toolbox supporting GPTQ/AWQ/HQQ methods.
Welcome /r/LocalLlama!
cerebrum-8x7b mixtral-7b gpt-3.5-turbo gemini-pro moistral-11b-v1 claude-opus qwen-vl-chat sakana openinterpreter reddit aether-research mistral-ai nvidia lmdeploy model-merging benchmarking quantization performance-optimization deployment vision fine-tuning training-data synthetic-data rag gui
Sakana released a paper on evolutionary model merging. OpenInterpreter launched their O1 devkit. Discussions highlight Claude Haiku's underrated performance with 10-shot examples. On Reddit's IPO, AINews introduces Reddit summaries starting with /r/LocalLlama, covering upcoming subreddits like r/machinelearning and r/openai. Aether Research released Cerebrum 8x7b based on Mixtral, matching GPT-3.5 Turbo and Gemini Pro on reasoning tasks, setting a new open-source reasoning SOTA. Moistral 11B v1 finetuned model from Cream-Phi-2 creators was released. A creative writing benchmark uses Claude Opus as judge. Hobbyists explore 1.58 BitNet ternary quantization and 1-bit LLMs training. Nvidia's Blackwell (h200) chip supports FP4 precision quantization. LMDeploy v0.2.6+ enables efficient vision-language model deployment with models like Qwen-VL-Chat. Users seek GUIs for LLM APIs with plugin and RAG support. Pipelines for synthetic training data generation and fine-tuning language models for chat are discussed.
FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs
qlora fsdp inflection-2.5 gpt-4 answer.ai hugging-face meta-ai-fair nvidia inflectionai model-training quantization memory-optimization gradient-checkpointing cpu-offloading fine-tuning model-sharding reinforcement-learning chain-of-thought benchmarking jeremy_howard tim_dettmers yann_lecun
Jeremy Howard and collaborators released a new tool combining FSDP, QLoRA, and HQQ to enable training 70b-parameter models on affordable consumer GPUs like RTX 4090s with only 24GB RAM, overcoming traditional memory constraints that required expensive data center GPUs costing over $150k. The approach shards quantized models across multiple GPUs and uses techniques like gradient checkpointing and CPU offloading to achieve efficient training on desktop-class hardware. The blogpost details challenges and solutions integrating these methods, highlighting a significant cost reduction from $150k to under $2.5k for training large language models. Additionally, Twitter recaps mention Inflection AI's Inflection-2.5 model rivaling GPT-4 in benchmarks with less compute, and Grok improving speed by 3x. Yann LeCun discusses multi-step reasoning training for LLMs.
The Era of 1-bit LLMs
bitnet-b1.58 hugging-face quantization model-optimization energy-efficiency fine-tuning robotics multimodality ai-security ethics humor swyx levelsio gdb npew _akhaliq osanseviero mmitchell_ai deliprao nearcyan clementdelangue
The Era of 1-bit LLMs research, including the BitNet b1.58 model, introduces a ternary parameter approach that matches full-precision Transformer LLMs in performance while drastically reducing energy costs by 38x. This innovation promises new scaling laws and hardware designs optimized for 1-bit LLMs. Discussions on AI Twitter highlight advances in AGI societal impact, robotics with multimodal models, fine-tuning techniques like ResLoRA, and AI security efforts at Hugging Face. Ethical considerations in generative AI and humor within the AI community are also prominent topics.
Welcome Interconnects and OpenRouter
mistral-large miqu mixtral gpt-4 mistral-7b mistral-ai openai perplexity-ai llamaindex qwen langchain model-comparison model-optimization quantization role-playing story-writing code-clarity ai-assisted-decompilation asynchronous-processing quantum-computing encoder-based-diffusion open-source hardware-experimentation rag-systems nathan-lambert alex-atallah
Discord communities analyzed 22 guilds, 349 channels, and 12885 messages revealing active discussions on model comparisons and optimizations involving Mistral AI, Miqu, and GGUF quantized models. Highlights include comparing Mistral Large with GPT-4, focusing on cost-effectiveness and performance, and exploring quantization techniques like GPTQ and QLORA to reduce VRAM usage. Advanced applications such as role-playing, story-writing, code clarity, and AI-assisted decompilation were emphasized, alongside development of tools like an asynchronous summarization script for Mistral 7b. The intersection of quantum computing and AI was discussed, including DARPA-funded projects and encoder-based diffusion techniques for image processing. Community efforts featured new Spanish LLM announcements, hardware experimentation, and open-source initiatives, with platforms like Perplexity AI and LlamaIndex noted for innovation and integration. Speculation about Mistral AI's open-source commitment and tools like R2R for rapid RAG deployment highlighted collaborative spirit.
Google AI: Win some (Gemma, 1.5 Pro), Lose some (Image gen)
gemma-2b gemma-7b gemma gemini-pro-1.5 llama-2 llama-3 mistral google hugging-face nvidia benchmarking license-policies image-generation video-understanding long-context dataset-editing model-integration gpu-hardware bug-fixes quantization
Google's Gemma open models (2-7B parameters) outperform Llama 2 and Mistral in benchmarks but face criticism for an unusual license and poor image generation quality, which Google partially acknowledges. The upcoming Gemini Pro 1.5 model features a 1 million token context window, excelling in video understanding and needle-in-haystack tasks. Discord communities like TheBloke and LM Studio discuss mixed reception of Gemma models, anticipation for Llama 3 release, challenges in dataset editing, and hardware considerations such as NVIDIA GeForce RTX 3090 and RTX 4090 GPUs. LM Studio users report issues with version 0.2.15 Beta and ongoing integration of Gemma models, with resources shared on Hugging Face.
Karpathy emerges from stealth?
mistral-7b mixtral-8x7b zephyr-7b gpt-4 llama-2 intel mistral-ai audiogen thebloke tokenization quantization model-optimization fine-tuning model-merging computational-efficiency memory-optimization retrieval-augmented-generation multi-model-learning meta-reasoning dataset-sharing open-source ethical-ai community-collaboration andrej-karpathy
Andrej Karpathy released a comprehensive 2-hour tutorial on tokenization, detailing techniques up to GPT-4's tokenizer and noting the complexity of Llama 2 tokenization with SentencePiece. Discussions in AI Discord communities covered model optimization and efficiency, focusing on quantization of models like Mistral 7B and Zephyr-7B to reduce memory usage for consumer GPUs, including Intel's new weight-only quantization algorithm. Efforts to improve computational efficiency included selective augmentation reducing costs by 57.76% and memory token usage versus kNN for Transformers. Challenges in hardware compatibility and software issues were shared, alongside fine-tuning techniques such as LoRA and model merging. Innovative applications of LLMs in retrieval-augmented generation (RAG), multi-model learning, and meta-reasoning were explored. The community emphasized dataset sharing, open-source releases like SDXL VAE encoded datasets and Audiogen AI codecs, and ethical AI use with censorship and guardrails. Collaboration and resource sharing remain strong in these AI communities.
Companies liable for AI hallucination is Good Actually for AI Engineers
mistral-next large-world-model sora babilong air-canada huggingface mistral-ai quantization retrieval-augmented-generation fine-tuning cuda-optimization video-generation ai-ethics dataset-management open-source community-driven-development andrej-karpathy
Air Canada faced a legal ruling requiring it to honor refund policies communicated by its AI chatbot, setting a precedent for corporate liability in AI engineering accuracy. The tribunal ordered a refund of $650.88 CAD plus damages after the chatbot misled a customer about bereavement travel refunds. Meanwhile, AI community discussions highlighted innovations in quantization techniques for GPU inference, Retrieval-Augmented Generation (RAG) and fine-tuning of LLMs, and CUDA optimizations for PyTorch models. New prototype models like Mistral-Next and the Large World Model (LWM) were introduced, showcasing advances in handling large text contexts and video generation with models like Sora. Ethical and legal implications of AI autonomy were debated alongside challenges in dataset management. Community-driven projects such as the open-source TypeScript agent framework bazed-af emphasize collaborative AI development. Additionally, benchmarks like BABILong for up to 10M context evaluation and tools from karpathy were noted.
AI gets Memory
miqumaid-v2-70b mixtral-8x7b-qlora mistral-7b phi-2 medalpaca aya openai langchain thebloke cohere unsloth-ai mistral-ai microsoft rag memory-modeling context-windows open-source finetuning sequential-fine-tuning direct-preference-optimization rlhf ppo javascript-python-integration hardware-optimization gpu-overclocking quantization model-training large-context multilinguality joanne-jang
AI Discords analysis covered 20 guilds, 312 channels, and 6901 messages. The report highlights the divergence of RAG style operations for context and memory, with implementations like MemGPT rolling out in ChatGPT and LangChain. The TheBloke Discord discussed open-source large language models such as the Large World Model with contexts up to 1 million tokens, and the Cohere aya model supporting 101 languages. Roleplay-focused models like MiquMaid-v2-70B were noted for performance improvements with enhanced hardware. Finetuning techniques like Sequential Fine-Tuning (SFT) and Direct Preference Optimization (DPO) were explained, with tools like Unsloth AI's apply_chat_template preferred over Alpaca. Integration of JavaScript and Python via JSPyBridge in the SillyTavern project was also discussed. Training challenges with Mixtral 8x7b qlora versus Mistral 7b were noted. The LM Studio Discord focused on hardware limitations affecting large model loading, medical LLMs like medAlpaca, and hardware discussions around GPU upgrades and overclocking. Anticipation for IQ3_XSS 1.5 bit quantization support in LM Studio was expressed.
The Dissection of Smaug (72B)
smaug-72b qwen-1.0 qwen-1.5 gpt-4 mistral-7b miqumaid wizardlm_evol_instruct_v2_196k openhermes-2.5 abacus-ai hugging-face nous-research laion thebloke lm-studio intel nvidia elevenlabs fine-tuning model-merging quantization web-ui model-conversion hardware-setup privacy image-generation optical-character-recognition prompt-engineering bindureddy
Abacus AI launched Smaug 72B, a large finetune of Qwen 1.0, which remains unchallenged on the Hugging Face Open LLM Leaderboard despite skepticism from Nous Research. LAION introduced a local voice assistant model named Bud-E with a notable demo. The TheBloke Discord community discussed model performance trade-offs between large models like GPT-4 and smaller quantized models, fine-tuning techniques using datasets like WizardLM_evol_instruct_V2_196k and OpenHermes-2.5, and challenges in web UI development and model merging involving Mistral-7b and MiquMaid. The LM Studio Discord highlighted issues with model conversion from PyTorch to gguf, hardware setups involving Intel Xeon CPUs and Nvidia P40 GPUs, privacy concerns, and limitations in image generation and web UI availability.
Qwen 1.5 Released
qwen-1.5 mistral-7b sparsetral-16x7b-v2 bagel-7b-v0.4 deepseek-math-7b-instruct deepseek qwen mistral-ai hugging-face meta-ai-fair quantization token-context multilinguality retrieval-augmented-generation agent-planning code-generation sparse-moe model-merging fine-tuning direct-preference-optimization character-generation ascii-art kanji-generation vr retinal-resolution light-field-passthrough frozen-networks normalization-layers
Chinese AI models Yi, Deepseek, and Qwen are gaining attention for strong performance, with Qwen 1.5 offering up to 32k token context and compatibility with Hugging Face transformers and quantized models. The TheBloke Discord discussed topics like quantization of a 70B LLM, the introduction of the Sparse MoE model Sparsetral based on Mistral, debates on merging vs fine-tuning, and Direct Preference Optimization (DPO) for character generation. The Nous Research AI Discord covered challenges in Japanese Kanji generation, AI scams on social media, and Meta's VR headset prototypes showcased at SIGGRAPH 2023. Discussions also included fine-tuning frozen networks and new models like bagel-7b-v0.4, DeepSeek-Math-7b-instruct, and Sparsetral-16x7B-v2.
Less Lazy AI
hamster-v0.2 flan-t5 miqu-1-120b-gguf qwen2 axolotl openai hugging-face nous-research h2oai apple model-merging fine-tuning quantization vram-optimization plugin-development chatbot-memory model-training bug-reporting api-compatibility philschmid
The AI Discord summaries for early 2024 cover various community discussions and developments. Highlights include 20 guilds, 308 channels, and 10449 messages analyzed, saving an estimated 780 minutes of reading time. Key topics include Polymind Plugin Puzzle integrating PubMed API, roleplay with HamSter v0.2, VRAM challenges in Axolotl training, fine-tuning tips for FLAN-T5, and innovative model merging strategies. The Nous Research AI community discussed GPT-4's lyricism issues, quantization techniques using
llama.cpp
, frankenmerging with models like miqu-1-120b-GGUF, anticipation for Qwen2, and tools like text-generation-webui
and ExLlamaV2. The LM Studio community reported a bug where the app continues running after UI closure, with a workaround to forcibly terminate the process. These discussions reflect ongoing challenges and innovations in AI model training, deployment, and interaction. The Core Skills of AI Engineering
miqumaid olmo aphrodite awq exl2 mistral-medium internlm ssd-1b lora qlora loftq ai2 hugging-face ai-engineering quantization fine-tuning open-source model-deployment data-quality tokenization prompt-adherence distillation ai-security batching hardware role-playing eugene-yan
AI Discords for 2/2/2024 analyzed 21 guilds, 312 channels, and 4782 messages saving an estimated 382 minutes of reading time. Discussions included Eugene Yan initiating a deep dive into AI engineering challenges, highlighting overlaps between software engineering and data science skills. The TheBloke Discord featured talks on MiquMaid, OLMo (an open-source 65B LLM by AI2 under Apache 2.0), Aphrodite model batching, AWQ quantization, and LoRA fine-tuning techniques like QLoRA and LoftQ. The LAION Discord discussed SSD-1B distillation issues, data quality optimization with captioning datasets like BLIP, COCO, and LLaVA, and tokenization strategies for prompt adherence in image generation. Other topics included AI security with watermarking, superconductors and carbon nanotubes for hardware, and deployment of LLMs via Hugging Face tools.
Trust in GPTs at all time low
llama-3 mistral-medium llava-1.6 miquella-120b-gguf tinymodels miqumaid harmony-4x7b-bf16 smaug-34b-v0.1 openai hugging-face mistral-ai nous-research bittensor context-management fine-tuning model-merging quantization gpu-servers visual-reasoning ocr dataset-release incentive-structures nick-dobos manojbh teknium arthurmensch
Discord communities were analyzed with 21 guilds, 312 channels, and 8530 messages reviewed, saving an estimated 628 minutes of reading time. Discussions highlighted challenges with GPTs and the GPT store, including critiques of the knowledge files capability and context management issues. The CUDA MODE Discord was introduced for CUDA coding support. Key conversations in the TheBloke Discord covered Xeon GPU server cost-effectiveness, Llama3 and Mistral Medium model comparisons, LLaVA-1.6's visual reasoning and OCR capabilities, and the leaked Miqu 70B model. Technical topics included fine-tuning TinyLlama and MiquMaid+Euryale models, and model merging with examples like Harmony-4x7B-bf16 and Smaug-34B-v0.1. The Nous Research AI Discord discussed style influence in LLMs, quantization issues, Bittensor incentives for AI model improvements, and the identification of MIQU as Mistral Medium. The release of the Open Hermes 2.5 dataset on Hugging Face was also announced. "Discussions pointed towards the need for better context management in GPTs, contrasting with OpenAI's no-code approach."
CodeLLama 70B beats GPT4 on HumanEval
codellama miqu mistral-medium llama-2-70b aphrodite-engine mixtral flatdolphinmaid noromaid rpcal chatml mistral-7b activation-beacon eagle-7b rwkv-v5 openhermes2.5 nous-hermes-2-mixtral-8x7b-dpo imp-v1-3b bakllava moondream qwen-vl meta-ai-fair ollama nous-research mistral-ai hugging-face ai-ethics alignment gpu-optimization direct-prompt-optimization fine-tuning cuda-programming optimizer-technology quantization multimodality context-length dense-retrieval retrieval-augmented-generation multilinguality model-performance open-source code-generation classification vision
Meta AI surprised the community with the release of CodeLlama, an open-source model now available on platforms like Ollama and MLX for local use. The Miqu model sparked debate over its origins, possibly linked to Mistral Medium or a fine-tuned Llama-2-70b, alongside discussions on AI ethics and alignment risks. The Aphrodite engine showed strong performance on A6000 GPUs with specific configurations. Role-playing AI models such as Mixtral and Flatdolphinmaid faced challenges with repetitiveness, while Noromaid and Rpcal performed better, with ChatML and DPO recommended for improved responses. Learning resources like fast.ai's course were highlighted for ML/DL beginners, and fine-tuning techniques with optimizers like Paged 8bit lion and adafactor were discussed.
At Nous Research AI, the Activation Beacon project introduced a method for unlimited context length in LLMs using "global state" tokens, potentially transforming retrieval-augmented models. The Eagle-7B model, based on RWKV-v5, outperformed Mistral in benchmarks with efficiency and multilingual capabilities. OpenHermes2.5 was recommended for consumer hardware due to its quantization methods. Multimodal and domain-specific models like IMP v1-3b, Bakllava, Moondream, and Qwen-vl were explored for classification and vision-language tasks. The community emphasized centralizing AI resources for collaborative research.
RWKV "Eagle" v5: Your move, Mamba
rwkv-v5 mistral-7b miqu-1-70b mistral-medium llama-2 mistral-instruct-v0.2 mistral-tuna llama-2-13b kunoichi-dpo-v2-7b gpt-4 eleutherai mistral-ai hugging-face llamaindex nous-research rwkv lmsys fine-tuning multilinguality rotary-position-embedding model-optimization model-performance quantization speed-optimization prompt-engineering model-benchmarking reinforcement-learning andrej-karpathy
RWKV v5 Eagle was released with better-than-mistral-7b evaluation results, trading some English performance for multilingual capabilities. The mysterious miqu-1-70b model sparked debate about its origins, possibly a leak or distillation of Mistral Medium or a fine-tuned Llama 2. Discussions highlighted fine-tuning techniques, including the effectiveness of 1,000 high-quality prompts over larger mixed-quality datasets, and tools like Deepspeed, Axolotl, and QLoRA. The Nous Research AI community emphasized the impact of Rotary Position Embedding (RoPE) theta settings on LLM extrapolation, improving models like Mistral Instruct v0.2. Speed improvements in Mistral Tuna kernels reduced token processing costs, enhancing efficiency. The launch of Eagle 7B with 7.52B parameters showcased strong multilingual performance, surpassing other 7B class models.
Adept Fuyu-Heavy: Multimodal model for Agents
fuyu-heavy fuyu-8b gemini-pro claude-2 gpt4v gemini-ultra deepseek-coder-33b yi-34b-200k goliath-120b mistral-7b-instruct-v0.2 mamba rwkv adept hugging-face deepseek mistral-ai nous-research multimodality visual-question-answering direct-preference-optimization benchmarking model-size-estimation quantization model-merging fine-tuning instruct-tuning rms-optimization heterogeneous-ai-architectures recurrent-llms contrastive-preference-optimization
Adept launched Fuyu-Heavy, a multimodal model focused on UI understanding and visual QA, outperforming Gemini Pro on the MMMU benchmark. The model uses DPO (Direct Preference Optimization), gaining attention as a leading tuning method. The size of Fuyu-Heavy is undisclosed but estimated between 20B-170B parameters, smaller than rumored frontier models like Claude 2, GPT4V, and Gemini Ultra. Meanwhile, Mamba was rejected at ICLR for quality concerns. In Discord discussions, DeepSeek Coder 33B was claimed to outperform GPT-4 in coding tasks, and deployment strategies for large models like Yi-34B-200K and Goliath-120B were explored. Quantization debates highlighted mixed views on Q8 and EXL2 quants. Fine-tuning and instruct-tuning of Mistral 7B Instruct v0.2 were discussed, alongside insights on RMS optimization and heterogeneous AI architectures combining Transformers and Selective SSM (Mamba). The potential of recurrent LLMs like RWKV and techniques like Contrastive Preference Optimization (CPO) were also noted.
RIP Latent Diffusion, Hello Hourglass Diffusion
gpt-4 latent-diffusion stable-diffusion meta-ai-fair openai hugging-face diffusion-models transformers image-generation model-efficiency fine-tuning quantization prompt-engineering roleplay training-optimization katherine-crowson lucidrains
Katherine Crowson from Stable Diffusion introduces a hierarchical pure transformer backbone for diffusion-based image generation that efficiently scales to megapixel resolutions with under 600 million parameters, improving upon the original ~900M parameter model. This architecture processes local and global image phenomena separately, enhancing efficiency and resolution without latent steps. Additionally, Meta's Self Rewarding LM paper has inspired lucidrains to begin an implementation. Discord summaries highlight GPT-4's robustness against quantification tricks, discussions on open-source GPT-0 alternatives, challenges in DPO training on limited VRAM with suggestions like QLoRA and rmsprop, and efforts to improve roleplay model consistency through fine-tuning and merging. Philosophical debates on AI sentience and GPT-4 customization for markdown and translation tasks were also noted.
Nightshade poisons AI art... kinda?
mistral-7b falcon-7b mistral-ai hugging-face mixture-of-experts gpu-parallelism quantization fine-tuning model-merging ai-detection role-playing benchmarking
Over the weekend of 1/19-20/2024, discussions in TheBloke Discord covered key topics including Mixture of Experts (MoE) model efficiency, GPU parallelism, and quantization strategies. Users debated the effectiveness of AI detection tools like GPTZero and explored fine-tuning challenges with models such as Mistral 7B and Falcon 7B. Community interest was strong in developing simpler, community-powered quantization services and understanding model merging techniques. Ethical considerations around AI applications like AI girlfriend sites were also discussed.
1/17/2024: Help crowdsource function calling datasets
mistral-7b dolphin-2.7-mixtral-8x7b mega-dolphin dolphin-2.6-mistral-7b-dpo llama-cpp lm-studio mistral-ai microsoft hugging-face apple function-calling quantization model-performance gpu-optimization model-selection closed-source memory-optimization linux-server api-fees headless-mode yagilb heyitsyorkie
LM Studio updated its FAQ clarifying its closed-source status and perpetual freeness for personal use with no data collection. The new beta release includes fixes and hints at upcoming 2-bit quantization support. For gaming, models like Dolphin 2.7 Mixtral 8x7B, MegaDolphin, and Dolphin 2.6 Mistral 7B DPO with Q4_K_M quantization were recommended. Discussions highlighted that single powerful GPUs outperform multi-GPU setups due to bottlenecks, with older GPUs like Tesla P40 being cost-effective. Microsoft's AutoGen Studio was introduced but has issues and requires API fees for open-source models. Linux users are advised to use llama.cpp over LM Studio due to lack of headless mode. Additional tools like LLMFarm for iOS and various Hugging Face repositories were also mentioned. "LM Studio must be running to use the local inference server as there is no headless mode available" and "matching model size to GPU memory is key for performance" were notable points.
1/16/2024: ArtificialAnalysis - a new model/host benchmark site
mixtral hermes-2-mixtral openchat-7b byte-mistral nous-research nvidia hugging-face summarization fine-tuning byte-level-tokenization multimodality inference-speed-optimization dataset-sharing quantization swyx gabriel_syme manojbh carsonpoole fullstack6209
Artificial Analysis launched a new models and hosts comparison site, highlighted by swyx. Nous Research AI Discord discussed innovative summarization techniques using NVIDIA 3090 and 2080ti GPUs for processing around 100k tokens, and adapting prompts for smaller models like OpenChat 7B. The availability of Hermes 2 Mixtral on Huggingface's HuggingChat was noted, alongside fine-tuning challenges with Mixtral using Axolotl. Discussions included byte-level tokenization experiments with Byte Mistral, multimodal training on COCO image bytes, and inference speed improvements using vllm and llama.cpp. Calls for transparency in data sharing and open-sourcing the Hermes 2 Mixtral dataset were emphasized, with comparisons of dpo and sft methods and quantized LLM use on M1 MacBook Pro.
1/16/2024: TIES-Merging
mixtral-8x7b nous-hermes-2 frankendpo-4x7b-bf16 thebloke hugging-face nous-research togethercompute oak-ridge-national-laboratory vast-ai runpod mixture-of-experts random-gate-routing quantization gptq exl2-quants reinforcement-learning-from-human-feedback supercomputing trillion-parameter-models ghost-attention model-fine-tuning reward-models sanjiwatsuki superking__ mrdragonfox _dampf kaltcit rombodawg technotech
TheBloke's Discord community actively discusses Mixture of Experts (MoE) models, focusing on random gate routing layers for training and the challenges of immediate model use. There is a robust debate on quantization methods, comparing GPTQ and EXL2 quants, with EXL2 noted for faster execution on specialized hardware. A new model, Nous Hermes 2, based on Mixtral 8x7B and trained with RLHF, claims benchmark superiority but shows some inconsistencies. The Frontier supercomputer at Oak Ridge National Laboratory is highlighted for training a trillion-parameter LLM with 14TB RAM, sparking discussions on open-sourcing government-funded AI research. Additionally, the application of ghost attention in the academicat model is explored, with mixed reactions from the community. "Random gate layer is good for training but not for immediate use," and "EXL2 might offer faster execution on specialized hardware," are key insights shared.
12/25/2023: Nous Hermes 2 Yi 34B for Christmas
nous-hermes-2 yi-34b nucleusx yayi-2 ferret teknim nous-research apple mixtral deepseek qwen huggingface wenge-technology quantization model-optimization throughput-metrics batch-processing parallel-decoding tensor-parallelization multimodality language-model-pretraining model-benchmarking teknium carsonpoole casper_ai pradeep1148 osanseviero metaldragon01
Teknium released Nous Hermes 2 on Yi 34B, positioning it as a top open model compared to Mixtral, DeepSeek, and Qwen. Apple introduced Ferret, a new open-source multimodal LLM. Discussions in the Nous Research AI Discord focused on AI model optimization and quantization techniques like AWQ, GPTQ, and AutoAWQ, with insights on proprietary optimization and throughput metrics. Additional highlights include the addition of NucleusX Model to transformers, a 30B model with 80 MMLU, and the YAYI 2 language model by Wenge Technology trained on 2.65 trillion tokens. "AutoAWQ outperforms vLLM up to batch size 8" was noted, and proprietary parallel decoding and tensor parallelization across GPUs were discussed for speed improvements.
12/23/2023: NeurIPS Best Papers of 2023
gpt-4 palm2 hermes-2.5 mistral-7b nous-research hugging-face apple context-length malware-security video-content music-content linear-layers api-access large-language-models embedding vector-databases model-merging model-interpretability striped-hyena-architecture quantization rmsnorm attention-mechanisms
The Latent Space Pod released a 3-hour recap of the best NeurIPS 2023 papers. The Nous Research AI Discord community discussed optimizing AI performance with shorter context lengths, malware security concerns linked to HuggingFace, and shared insights on video and music content. Technical discussions included the DYAD research paper proposing a faster alternative to linear layers, Apple's ML Ferret machine learning tool, and accessing PALM2 via API. The community also explored Large Language Models focusing on specialized models, data scaling, embedding/vector databases, model merging, and interpretability, with mentions of Hermes 2.5, GPT-4, and Mistral. Additionally, there were conversations on the Striped Hyena Architecture, quantization challenges, and fixes related to RMSNorm and the "Attention is All You Need" paper.
12/11/2023: Mixtral beats GPT3.5 and Llama2-70B
mixtral-8x7b gpt-4 gpt-3.5-turbo llama-3 openhermes-2.5 llava-v1.5-13b-gptq mistral-ai openai huggingface sparse-mixture-of-experts fine-tuning quantization gpu-hardware transformers model-deployment open-source coding-datasets
Mistral AI announced the Mixtral 8x7B model featuring a Sparse Mixture of Experts (SMoE) architecture, sparking discussions on its potential to rival GPT-4. The community debated GPU hardware options for training and fine-tuning transformer models, including RTX 4070s, A4500, RTX 3090s with nvlink, and A100 GPUs. Interest was expressed in fine-tuning Mixtral and generating quantized versions, alongside curating high-quality coding datasets. Resources shared include a YouTube video on open-source model deployment, an Arxiv paper, GitHub repositories, and a blog post on Mixture-of-Experts. Discussions also touched on potential open-source releases of GPT-3.5 Turbo and llama-3, and running OpenHermes 2.5 on Mac M3 Pro with VRAM considerations.
12/9/2023: The Mixtral Rush
mixtral hermes-2.5 hermes-2 mistral-yarn ultrachat discoresearch fireworks-ai hugging-face mistral-ai benchmarking gpu-requirements multi-gpu quantization gptq chain-of-thought min-p-sampling top-p-sampling model-sampling model-merging model-performance small-models reasoning-consistency temperature-sampling bjoernp the_bloke rtyax kalomaze solbus calytrix
Mixtral's weights were released without code, prompting the Disco Research community and Fireworks AI to implement it rapidly. Despite efforts, no significant benchmark improvements were reported, limiting its usefulness for local LLM usage but marking progress for the small models community. Discussions in the DiscoResearch Discord covered Mixtral's performance compared to models like Hermes 2.5 and Hermes 2, with evaluations on benchmarks such as winogrande, truthfulqa_mc2, and arc_challenge. Technical topics included GPU requirements, multi-GPU setups, and quantization via GPTQ. Benchmarking strategies like grammar-based evaluation, chain of thought (CoT), and min_p sampling were explored, alongside model sampling techniques like Min P and Top P to enhance response stability and creativity. Users also discussed GPTs' learning limitations and the adaptability of models under varying conditions, emphasizing min_p sampling's role in enabling higher temperature settings for creativity.