Model: "gpt-3"

QwQ-32B claims to match DeepSeek R1-671B

qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb

Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.

Dec 21, 2024

o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath

o3 o3-mini o1-mini gpt-3 gpt-4o o1 openai benchmarking math reasoning model-performance inference-speed cost-efficiency alignment safety-testing sama eric-wallace

OpenAI announced the o3 and o3-mini models with groundbreaking benchmark results, including a jump from 2% to 25% on the FrontierMath benchmark and 87.5% on the ARC-AGI reasoning benchmark, representing about 11 years of progress on the GPT3 to GPT4o scaling curve. The o1-mini model shows superior inference efficiency compared to o3-full, promising significant cost reductions on coding tasks. The announcement was accompanied by community discussions, safety testing applications, and detailed analyses. Sama highlighted the unusual cost-performance tradeoff, and Eric Wallace shared insights on the o-series deliberative alignment strategy.

Jun 03, 2024

Mamba-2: State Space Duality

mamba-2 mamba transformer++ llama-3-70b gpt-3 hugging-face state-space-models perplexity training-efficiency data-pruning benchmarking multimodality video-analysis _albertgu tri_dao arankomatsuzaki _akhaliq clementdelangue karpathy

Mamba-2, a new state space model (SSM), outperforms previous models like Mamba and Transformer++ in perplexity and wall-clock time, featuring 8x larger states and 50% faster training. It introduces the concept of state space duality (SSD) connecting SSMs and linear attention. The FineWeb-Edu dataset, a high-quality subset of the 15 trillion token FineWeb dataset, filtered using llama-3-70b for educational quality, enables better and faster LLM learning, potentially reducing tokens needed to surpass GPT-3 performance. Additionally, perplexity-based data pruning using a 125M parameter model improves downstream performance and reduces pretraining steps by up to 1.45x. The Video-MME benchmark evaluates multi-modal LLMs on video analysis across multiple visual domains and video lengths.

May 31, 2024

Contextual Position Encoding (CoPE)

cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas

Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.

May 28, 2024

Life after DPO (RewardBench)

gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan

xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.

Dec 13, 2023

12/12/2023: Towards LangChain 0.1

mixtral-8x7b phi-2 gpt-3 chatgpt gpt-4 langchain mistral-ai anthropic openai microsoft mixture-of-experts information-leakage prompt-engineering oauth2 logo-generation education-ai gaming-ai api-access model-maintainability scalability

The Langchain rearchitecture has been completed, splitting the repo for better maintainability and scalability, while remaining backwards compatible. Mistral launched a new Discord community, and Anthropic is rumored to be raising another $3 billion. On the OpenAI Discord, discussions covered information leakage in AI training, mixture of experts (MoE) models like mixtral 8x7b, advanced prompt engineering techniques, and issues with ChatGPT performance and API access. Users also explored AI applications in logo generation, education, and gaming, and shared solutions for Oauth2 authentication problems. A new small language model named Phi-2 was mentioned from Microsoft.