All tags
Model: "gpt-3"
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath
o3 o3-mini o1-mini gpt-3 gpt-4o o1 openai benchmarking math reasoning model-performance inference-speed cost-efficiency alignment safety-testing sama eric-wallace
OpenAI announced the o3 and o3-mini models with groundbreaking benchmark results, including a jump from 2% to 25% on the FrontierMath benchmark and 87.5% on the ARC-AGI reasoning benchmark, representing about 11 years of progress on the GPT3 to GPT4o scaling curve. The o1-mini model shows superior inference efficiency compared to o3-full, promising significant cost reductions on coding tasks. The announcement was accompanied by community discussions, safety testing applications, and detailed analyses. Sama highlighted the unusual cost-performance tradeoff, and Eric Wallace shared insights on the o-series deliberative alignment strategy.
Mamba-2: State Space Duality
mamba-2 mamba transformer++ llama-3-70b gpt-3 hugging-face state-space-models perplexity training-efficiency data-pruning benchmarking multimodality video-analysis _albertgu tri_dao arankomatsuzaki _akhaliq clementdelangue karpathy
Mamba-2, a new state space model (SSM), outperforms previous models like Mamba and Transformer++ in perplexity and wall-clock time, featuring 8x larger states and 50% faster training. It introduces the concept of state space duality (SSD) connecting SSMs and linear attention. The FineWeb-Edu dataset, a high-quality subset of the 15 trillion token FineWeb dataset, filtered using llama-3-70b for educational quality, enables better and faster LLM learning, potentially reducing tokens needed to surpass GPT-3 performance. Additionally, perplexity-based data pruning using a 125M parameter model improves downstream performance and reduces pretraining steps by up to 1.45x. The Video-MME benchmark evaluates multi-modal LLMs on video analysis across multiple visual domains and video lengths.
Contextual Position Encoding (CoPE)
cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas
Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.
Life after DPO (RewardBench)
gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan
xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.
12/12/2023: Towards LangChain 0.1
mixtral-8x7b phi-2 gpt-3 chatgpt gpt-4 langchain mistral-ai anthropic openai microsoft mixture-of-experts information-leakage prompt-engineering oauth2 logo-generation education-ai gaming-ai api-access model-maintainability scalability
The Langchain rearchitecture has been completed, splitting the repo for better maintainability and scalability, while remaining backwards compatible. Mistral launched a new Discord community, and Anthropic is rumored to be raising another $3 billion. On the OpenAI Discord, discussions covered information leakage in AI training, mixture of experts (MoE) models like mixtral 8x7b, advanced prompt engineering techniques, and issues with ChatGPT performance and API access. Users also explored AI applications in logo generation, education, and gaming, and shared solutions for Oauth2 authentication problems. A new small language model named Phi-2 was mentioned from Microsoft.