All tags
Model: "mamba"
Test-Time Training, MobileLLM, Lilian Weng on Hallucination (Plus: Turbopuffer)
llama-2-7b codegeex4-all-9b mamba facebook-research meta-ai-fair tsinghua-university hallucination-detection anti-hallucination-methods on-device-ai model-architecture rnn long-context-modeling model-scaling expressive-hidden-states code-generation lilian-weng yann-lecun
Lilian Weng released a comprehensive literature review on hallucination detection and anti-hallucination methods including techniques like FactualityPrompt, SelfCheckGPT, and WebGPT. Facebook AI Research (FAIR) published MobileLLM, a sub-billion parameter on-device language model architecture achieving performance comparable to llama-2-7b with innovations like thin and deep models and shared weights. A new RNN-based LLM architecture with expressive hidden states was introduced, replacing attention mechanisms and scaling better than Mamba and Transformer models for long-context modeling. Additionally, Tsinghua University open sourced CodeGeeX4-ALL-9B, a multilingual code generation model excelling in code assistance.
Mamba-2: State Space Duality
mamba-2 mamba transformer++ llama-3-70b gpt-3 hugging-face state-space-models perplexity training-efficiency data-pruning benchmarking multimodality video-analysis _albertgu tri_dao arankomatsuzaki _akhaliq clementdelangue karpathy
Mamba-2, a new state space model (SSM), outperforms previous models like Mamba and Transformer++ in perplexity and wall-clock time, featuring 8x larger states and 50% faster training. It introduces the concept of state space duality (SSD) connecting SSMs and linear attention. The FineWeb-Edu dataset, a high-quality subset of the 15 trillion token FineWeb dataset, filtered using llama-3-70b for educational quality, enables better and faster LLM learning, potentially reducing tokens needed to surpass GPT-3 performance. Additionally, perplexity-based data pruning using a 125M parameter model improves downstream performance and reduces pretraining steps by up to 1.45x. The Video-MME benchmark evaluates multi-modal LLMs on video analysis across multiple visual domains and video lengths.
Adept Fuyu-Heavy: Multimodal model for Agents
fuyu-heavy fuyu-8b gemini-pro claude-2 gpt4v gemini-ultra deepseek-coder-33b yi-34b-200k goliath-120b mistral-7b-instruct-v0.2 mamba rwkv adept hugging-face deepseek mistral-ai nous-research multimodality visual-question-answering direct-preference-optimization benchmarking model-size-estimation quantization model-merging fine-tuning instruct-tuning rms-optimization heterogeneous-ai-architectures recurrent-llms contrastive-preference-optimization
Adept launched Fuyu-Heavy, a multimodal model focused on UI understanding and visual QA, outperforming Gemini Pro on the MMMU benchmark. The model uses DPO (Direct Preference Optimization), gaining attention as a leading tuning method. The size of Fuyu-Heavy is undisclosed but estimated between 20B-170B parameters, smaller than rumored frontier models like Claude 2, GPT4V, and Gemini Ultra. Meanwhile, Mamba was rejected at ICLR for quality concerns. In Discord discussions, DeepSeek Coder 33B was claimed to outperform GPT-4 in coding tasks, and deployment strategies for large models like Yi-34B-200K and Goliath-120B were explored. Quantization debates highlighted mixed views on Q8 and EXL2 quants. Fine-tuning and instruct-tuning of Mistral 7B Instruct v0.2 were discussed, alongside insights on RMS optimization and heterogeneous AI architectures combining Transformers and Selective SSM (Mamba). The potential of recurrent LLMs like RWKV and techniques like Contrastive Preference Optimization (CPO) were also noted.