All tags
Topic: "model-evaluation"
not much happened today
deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj
DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.
not much happened today
hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb
Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.
not much happened today
qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb
Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.
not much happened today
phi-4 phi-4-mini-reasoning qwen3-235b qwen3-moe-235b qwen3-moe-30b qwen3-dense-32b qwen3-dense-14b qwen3-dense-8b qwen3-dense-4b qwen3-dense-0.6b qwen2.5-omni-3b deepseek-prover-v2 llama llama-guard-4 prompt-guard-2 mimo-7b microsoft anthropic cursor alibaba togethercompute deepseek meta-ai-fair xiaomi openrouterai cohere reasoning model-fine-tuning model-evaluation benchmarking model-popularity open-source math model-scaling model-filtering jailbreak-prevention cline reach_vb vipulved akhaliq omarsar0 zhs05232838 huajian_xin mervenoyann karpathy random_walker sarahookr blancheminerva clefourrier
Microsoft released Phi-reasoning 4, a finetuned 14B reasoning model slightly behind QwQ but limited by data transparency and token efficiency issues. Anthropic introduced remote MCP server support and a 45-minute Research mode in Claude. Cursor published a model popularity list. Alibaba launched Qwen3-235B and other Qwen3 variants, highlighting budget-friendly coding and reasoning capabilities, with availability on Together AI API. Microsoft also released Phi-4-Mini-Reasoning with benchmark performance on AIME 2025 and OmniMath. DeepSeek announced DeepSeek-Prover V2 with state-of-the-art math problem solving, scaling to 671B parameters. Meta AI's Llama models hit 1.2 billion downloads, with new Llama Guard 4 and Prompt Guard 2 for input/output filtering and jailbreak prevention. Xiaomi released the open-source reasoning model MiMo-7B trained on 25 trillion tokens. Discussions on AI model evaluation highlighted issues with the LMArena leaderboard, data access biases favoring proprietary models, and challenges in maintaining fair benchmarking, with suggestions for alternatives like OpenRouterAI rankings. "LMArena slop and biased" and "61.3% of all data going to proprietary model providers" were noted concerns.
Grok 3 & 3-mini now API Available
grok-3 grok-3-mini gemini-2.5-flash o3 o4-mini llama-4-maverick gemma-3-27b openai llamaindex google-deepmind epochairesearch goodfireai mechanize agent-development agent-communication cli-tools reinforcement-learning model-evaluation quantization-aware-training model-compression training-compute hybrid-reasoning model-benchmarking
Grok 3 API is now available, including a smaller version called Grok 3 mini, which offers competitive pricing and full reasoning traces. OpenAI released a practical guide for building AI agents, while LlamaIndex supports the Agent2Agent protocol for multi-agent communication. Codex CLI is gaining traction with new features and competition from Aider and Claude Code. GoogleDeepMind launched Gemini 2.5 Flash, a hybrid reasoning model topping the Chatbot Arena leaderboard. OpenAI's o3 and o4-mini models show emergent behaviors from large-scale reinforcement learning. EpochAIResearch updated its methodology, removing Maverick from high FLOP models as Llama 4 Maverick training compute drops. GoodfireAI announced a $50M Series A for its Ember neural programming platform. Mechanize was founded to build virtual work environments and automation benchmarks. GoogleDeepMind's Quantisation Aware Training for Gemma 3 models reduces model size significantly, with open source checkpoints available.
lots of little things happened this week
llama-3-3-nemotron-super-49b-v1 claude anthropic nvidia sakana-ai meta-ai-fair reinforcement-learning reasoning benchmarks multi-turn-collaboration instruction-following dataset-release model-evaluation percy-liang
Anthropic introduced a novel 'think' tool enhancing instruction adherence and multi-step problem solving in agents, with combined reasoning and tool use demonstrated by Claude. NVIDIA's Llama-3.3-Nemotron-Super-49B-v1 ranked #14 on LMArena, noted for strong math reasoning and a 15M post-training dataset. Sakana AI launched a Sudoku-based reasoning benchmark to advance AI problem-solving capabilities. Meta AI released SWEET-RL, a reinforcement learning algorithm improving long-horizon multi-turn tasks by 6%, and introduced CollaborativeAgentBench, a benchmark for collaborative LLM agents working with humans on programming and design tasks. Percy Liang relaunched the HELM benchmark with 5 challenging datasets evaluating 22 top language models.
OpenAI launches Operator, its first Agent
operator deepseek-r1 videollama-3 llama-4 o1 claude openai anthropic deepseek-ai google-deepmind perplexity-ai computer-using-agent reasoning multimodality performance-benchmarks open-source ai-safety benchmarking video-generation model-evaluation sam-altman swyx
OpenAI launched Operator, a premium computer-using agent for web tasks like booking and ordering, available now for Pro users in the US with an API promised. It features long horizon remote VMs up to 20 minutes and video export, showing state-of-the-art agent performance but not yet human-level. Anthropic had launched a similar agent 3 months earlier as an open source demo. DeepSeek AI unveiled DeepSeek R1, an open-source reasoning model excelling on the Humanity's Last Exam dataset, outperforming models like LLaMA 4 and OpenAI's o1. Google DeepMind open-sourced VideoLLaMA 3, a multimodal foundation model for image and video understanding. Perplexity AI released Perplexity Assistant for Android with reasoning and search capabilities. The Humanity's Last Exam dataset contains 3,000 questions testing AI reasoning, with current models scoring below 10% accuracy, indicating room for improvement. OpenAI's Computer-Using Agent (CUA) shows improved performance on OSWorld and WebArena benchmarks but still lags behind humans. Anthropic AI introduced Citations for safer AI responses. Sam Altman and Swyx commented on Operator's launch and capabilities.
not much happened today
deepseek-v3 chatgpt-4 openai deepseek google qwen overfitting reasoning misguided-attention model-evaluation model-architecture finetuning open-source sam-altman
Sam Altman publicly criticizes DeepSeek and Qwen models, sparking debate about OpenAI's innovation claims and reliance on foundational research like the Transformer architecture. Deepseek V3 shows significant overfitting issues in the Misguided Attention evaluation, solving only 22% of test prompts, raising concerns about its reasoning and finetuning. Despite skepticism about its open-source status, Deepseek V3 is claimed to surpass ChatGPT4 as an open-source model, marking a milestone 1.75 years after ChatGPT4's release on March 14, 2023. The discussions highlight competitive dynamics in AI model performance and innovation sustainability.
not much happened today
llama mistral openai decagon sierra togethercompute vertical-saas funding protein-structure-prediction lora self-supervised-learning model-optimization neural-architecture-search model-evaluation ethics transformers multi-agent-systems long-context mira-murati demis-hassabis clement-delangue john-o-whitaker yann-lecun francois-chollet ajeya-cotra rohan-paul adcock-brett
Vertical SaaS agents are gaining rapid consensus as the future of AI applications, highlighted by Decagon's $100m funding and Sierra's $4b round. OpenAI alumni are actively raising venture capital and forming new startups, intensifying competition in the AI market. Demis Hassabis celebrated the Nobel Prize recognition for AlphaFold2, a breakthrough in protein structure prediction. Advances in AI models include techniques like LoRA projectors and annealing on high-quality data, while discussions emphasize the need for high-bandwidth sensory inputs beyond language for common sense learning. New methods like LoLCATs aim to optimize transformer models such as Llama and Mistral for efficiency. Ethical concerns about AI agents performing harmful tasks remain under investigation. The AI community continues to explore model evaluation challenges and optimization frameworks like LPZero for neural architecture search.
not much happened today
llama-3 o1 deepseek-2.5 gpt-4 claude-3.5-sonnet 3dtopia-xl cogvideox anthropic meta-ai-fair openai deepseek-ai llamaindex langchainai retrieval-augmented-generation prompt-caching multimodality multi-agent-systems reasoning diffusion-models image-to-video prompting enterprise-ai agentic-ai long-context model-evaluation caching model-cost-efficiency
Anthropic introduced a RAG technique called Contextual Retrieval that reduces retrieval failure rates by 67% using prompt caching. Meta is teasing multimodal Llama 3 ahead of Meta Connect. OpenAI is hiring for a multi-agent research team focusing on improved AI reasoning with their o1 models, which have sparked mixed reactions. DeepSeek 2.5 is noted as a cost-effective alternative to GPT-4 and Claude 3.5 sonnet. New models like 3DTopia-XL for 3D asset generation and CogVideoX for image-to-video conversion were highlighted. Techniques to boost reasoning by re-reading questions and combining retrieval with prompt caching were shared. Industry insights emphasize the necessity of AI adoption in enterprises and the disruption of traditional ML businesses. Tools like LangChainAI's LangGraph Templates and LlamaIndex's LlamaParse Premium enhance agentic applications and multimodal content extraction. Discussions on LLM evals and caching highlight production challenges and improvements. "Companies not allowing developers to use AI are unlikely to succeed" was a key sentiment.
AIPhone 16: the Visual Intelligence Phone
reflection-70b llama-3-70b qwen-2-72b llama-3-1-405b claude gpt-4 gemini apple openai weights-biases vision video-understanding benchmarking planning model-evaluation privacy ai-integration instruction-following yann-lecun
Apple announced the new iPhone 16 lineup featuring Visual Intelligence, a new AI capability integrated with Camera Control, Apple Maps, and Siri, emphasizing privacy and default service use over third-party AI like OpenAI. Apple Photos now includes advanced video understanding with timestamp recognition. Meanwhile, Reflection-70B claims to be a top open-source model but benchmarks show it performs close to Llama 3 70B and slightly worse than Qwen 2 72B. Yann LeCun highlighted ongoing challenges with LLM planning abilities, noting models like Llama-3.1-405b and Claude show some skill, while GPT-4 and Gemini lag behind. Weights & Biases is sponsoring an event to advance LLM evaluation techniques with prizes and API access.
Reflection 70B, by Matt from IT Department
llama-3.1-70b llama-3 claude-3.5-sonnet hyperwrite glaive fine-tuning chain-of-thought instruction-following synthetic-data quantization model-evaluation prompt-engineering matt-shumer sahil-chaudhary
Reflection Tuning technique has been used by a two-person team from Hyperwrite and Glaive to finetune llama-3.1-70b, showing strong performance improvements with minimal synthetic data. The approach builds on the concept of adding
thinking
and reflection
steps to outputs, related to the Chain of Thought method. Despite some criticisms like contamination concerns, worse coding performance, and reliance on system prompts, the model has received positive reception and comparisons to claude-3.5-sonnet. The work highlights efficient instruction tuning and synthetic data generation for large models. not much happened today
llama-3-1 claude-3-5-sonnet llama-3-1-405b ltm-2-mini qwen2-vl gpt-4o-mini meta-ai-fair hugging-face magic-ai-labs lmsys alibaba openai long-context style-control multimodality ai-safety model-evaluation web-crawling pdf-processing ai-hype-cycles call-center-automation sam-altman ajeya-cotra fchollet rohanpaul_ai philschmid
Meta announced significant adoption of LLaMA 3.1 with nearly 350 million downloads on Hugging Face. Magic AI Labs introduced LTM-2-Mini, a long context model with a 100 million token context window, and a new evaluation method called HashHop. LMSys added style control to their Chatbot Arena leaderboard, improving rankings for models like Claude 3.5 Sonnet and LLaMA 3.1 405B. Alibaba released Qwen2-VL, a multimodal LLM under Apache 2.0 license, competitive with GPT-4o mini. OpenAI CEO Sam Altman announced collaboration with the US AI Safety Institute for pre-release model testing. Discussions on AI safety and potential AI takeover risks were highlighted by Ajeya Cotra. Tools like firecrawl for web crawling and challenges in PDF processing were noted. AI hype cycles and market trends were discussed by François Chollet, and potential AI disruption in call centers was shared by Rohan Paul.
Problems with MMLU-Pro
mmlu-pro llama-3-8b-q8 gpt4all-3.0 chatgpt claude llama gemini mobilellm runway-gen-3-alpha meta-3d-gen huggingface meta-ai-fair salesforce runway nomic-ai pineapple argil-ai benchmarking prompt-engineering model-evaluation model-performance multimodality automated-dataset-generation video-generation open-source-models ai-assistants text-to-3d deepfake transformers reasoning wenhu-chen danhendrycks clementine ylecun adcock_brett svpino rohanpaul_ai
MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips. Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.
Qdrant's BM42: "Please don't trust us"
claude-3.5-sonnet gemma-2 nano-llava-1.5 qdrant cohere stripe anthropic hugging-face stablequan_ai semantic-search benchmarking dataset-quality model-evaluation model-optimization vision fine-tuning context-windows nils-reimers jeremyphoward hamelhusain rohanpaul_ai
Qdrant attempted to replace BM25 and SPLADE with a new method called "BM42" combining transformer attention and collection-wide statistics for semantic and keyword search, but their evaluation using the Quora dataset was flawed. Nils Reimers from Cohere reran BM42 on better datasets and found it underperformed. Qdrant acknowledged the errors but still ran a suboptimal BM25 implementation. This highlights the importance of dataset choice and evaluation sanity checks in search model claims. Additionally, Stripe faced criticism for AI/ML model failures causing account and payment issues, prompting calls for alternatives. Anthropic revealed that Claude 3.5 Sonnet suppresses some answer parts with backend tags, sparking debate. Gemma 2 model optimizations allow 2x faster fine-tuning with 63% less memory and longer context windows, running up to 34B parameters on consumer GPUs. nanoLLaVA-1.5 was announced as a compact 1B parameter vision model with significant improvements.
The Last Hurrah of Stable Diffusion?
llama-3-8b llama-3 qwen-2 gpt-4 gpt-4o stability-ai togethercompute model-architecture fine-tuning benchmarks dataset-release model-evaluation reasoning model-training retrieval-augmented-generation multimodality emad-mostaque rohanpaul_ai fchollet mikeknoop micahgoldblum teknium1 rasbt percyliang
Stability AI launched Stable Diffusion 3 Medium with models ranging from 450M to 8B parameters, featuring the MMDiT architecture and T5 text encoder for image text rendering. The community has shown mixed reactions following the departure of key researchers like Emad Mostaque. On AI models, Llama 3 8B Instruct shows strong evaluation correlation with GPT-4, while Qwen 2 Instruct surpasses Llama 3 on MMLU benchmarks. The Mixture of Agents (MoA) framework outperforms GPT-4o on AlpacaEval 2.0. Techniques like Spectrum and QLoRA enable efficient fine-tuning with less VRAM. Research on grokking reveals transformers can transition from memorization to generalization through extended training. Benchmark initiatives include the $1M ARC Prize Challenge for AGI progress and LiveBench, a live LLM benchmark to prevent dataset contamination. The Character Codex Dataset offers open data on over 15,000 characters for RAG and synthetic data. The MLX 0.2 tool enhances LLM experience on Apple Silicon Macs with improved UI and faster retrieval-augmented generation.
Contextual Position Encoding (CoPE)
cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas
Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.
Life after DPO (RewardBench)
gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan
xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.
Ten Commandments for Deploying Fine-Tuned Models
claude-3-opus claude-3 gpt-4o anthropic google openai fine-tuning prompt-engineering model-evaluation feature-alteration benchmarking model-performance open-source-models kyle-corbitt bindureddy alexalbert__
Gemini-in-Google-Slides is highlighted as a useful tool for summarizing presentations. Kyle Corbitt's talk on deploying fine-tuned models in production emphasizes avoiding fine-tuning unless necessary, focusing on prompting, data quality, appropriate model choice, and thorough evaluation. Anthropic showcased feature alteration in Claude AI, demonstrating control over model behavior and increased understanding of large language models. Open-source models like GPT-4o are approaching closed-source performance on benchmarks like MMLU for simple tasks, though advanced models remain necessary for complex automation.
Zero to GPT in 1 Year
gpt-4-turbo claude-3-opus mixtral-8x22b zephyr-141b medical-mt5 openai anthropic mistral-ai langchain hugging-face fine-tuning multilinguality tool-integration transformers model-evaluation open-source-models multimodal-llms natural-language-processing ocr model-training vik-paruchuri sam-altman greg-brockman miranda-murati abacaj mbusigin akhaliq clementdelangue
GPT-4 Turbo reclaimed the top leaderboard spot with significant improvements in coding, multilingual, and English-only tasks, now rolled out in paid ChatGPT. Despite this, Claude Opus remains superior in creativity and intelligence. Mistral AI released powerful open-source models like Mixtral-8x22B and Zephyr 141B suited for fine-tuning. LangChain enhanced tool integration across models, and Hugging Face introduced Transformer.js for running transformers in browsers. Medical domain-focused Medical mT5 was shared as an open-source multilingual text-to-text model. The community also highlighted research on LLMs as regressors and shared practical advice on OCR/PDF data modeling from Vik Paruchuri's journey.
Claude 3 is officially America's Next Top Model
claude-3-opus claude-3-sonnet claude-3-haiku gpt-4o-mini mistral-7b qwen-72b anthropic mistral-ai huggingface openrouter stable-diffusion automatic1111 comfyui fine-tuning model-merging alignment ai-ethics benchmarking model-performance long-context cost-efficiency model-evaluation mark_riedl ethanjperez stuhlmueller ylecun aravsrinivas
Claude 3 Opus outperforms GPT4T and Mistral Large in blind Elo rankings, with Claude 3 Haiku marking a new cost-performance frontier. Fine-tuning techniques like QLoRA on Mistral 7B and evolutionary model merging on HuggingFace models are highlighted. Public opinion shows strong opposition to ASI development. Research supervision opportunities in AI alignment are announced. The Stable Diffusion 3 (SD3) release raises workflow concerns for tools like ComfyUI and automatic1111. Opus shows a 5% performance dip on OpenRouter compared to the Anthropic API. A new benchmark stresses LLM recall at long contexts, with Mistral 7B struggling and Qwen 72b performing well.
Claude 3 just destroyed GPT 4 (see for yourself)
claude-3 claude-3-opus claude-3-sonnet claude-3-haiku gpt-4 anthropic amazon google claude-ai multimodality vision long-context model-alignment model-evaluation synthetic-data structured-output instruction-following model-speed cost-efficiency benchmarking safety mmitchell connor-leahy
Claude 3 from Anthropic launches in three sizes: Haiku (small, unreleased), Sonnet (medium, default on claude.ai, AWS, and GCP), and Opus (large, on Claude Pro). Opus outperforms GPT-4 on key benchmarks like GPQA, impressing benchmark authors. All models support multimodality with advanced vision capabilities, including converting a 2-hour video into a blog post. Claude 3 offers improved alignment, fewer refusals, and extended context length up to 1 million tokens with near-perfect recall. Haiku is noted for speed and cost-efficiency, processing dense research papers in under three seconds. The models excel at following complex instructions and producing structured outputs like JSON. Safety improvements reduce refusal rates, though some criticism remains from experts. Claude 3 is trained on synthetic data and shows strong domain-specific evaluation results in finance, medicine, and philosophy.
12/10/2023: not much happened today
mixtral-8x7b-32kseqlen mistral-7b stablelm-zephyr-3b openhermes-2.5-neural-chat-v3-3-slerp gpt-3.5 gpt-4 nous-research openai mistral-ai hugging-face ollama lm-studio fine-tuning mixture-of-experts model-benchmarking inference-optimization model-evaluation open-source decentralized-ai gpu-optimization community-engagement andrej-karpathy yann-lecun richard-blythman gabriel-syme pradeep1148 cyborg_1552
Nous Research AI Discord community discussed attending NeurIPS and organizing future AI events in Australia. Highlights include interest in open-source and decentralized AI projects, with Richard Blythman seeking co-founders. Users shared projects like Photo GPT AI and introduced StableLM Zephyr 3B. The Mixtral model, based on Mistral, sparked debate on performance and GPU requirements, with comparisons to GPT-3.5 and potential competitiveness with GPT-4 after fine-tuning. Tools like Tensorboard, Wandb, and Llamahub were noted for fine-tuning and evaluation. Discussions covered Mixture of Experts (MoE) architectures, fine-tuning with limited data, and inference optimization strategies for ChatGPT. Memes and community interactions referenced AI figures like Andrej Karpathy and Yann LeCun. The community also shared resources such as GitHub links and YouTube videos related to these models and tools.
12/8/2023 - Mamba v Mistral v Hyena
mistral-8x7b-moe mamba-3b stripedhyena-7b claude-2.1 gemini gpt-4 dialogrpt-human-vs-machine cybertron-7b-v2-gguf falcon-180b mistral-ai togethercompute stanford anthropic google hugging-face mixture-of-experts attention-mechanisms prompt-engineering alignment image-training model-deployment gpu-requirements cpu-performance model-inference long-context model-evaluation open-source chatbots andrej-karpathy tri-dao maxwellandrews raddka
Three new AI models are highlighted: Mistral's 8x7B MoE model (Mixtral), Mamba models up to 3B by Together, and StripedHyena 7B, a competitive subquadratic attention model from Stanford's Hazy Research. Discussions on Anthropic's Claude 2.1 focus on its prompting technique and alignment challenges. The Gemini AI from Google is noted as potentially superior to GPT-4. The community also explores Dreambooth for image training and shares resources like the DialogRPT-human-vs-machine model on Hugging Face. Deployment challenges for large language models, including CPU performance and GPU requirements, are discussed with references to Falcon 180B and transformer batching techniques. User engagement includes meme sharing and humor.
Is Google's Gemini... legit?
gemini gemini-pro gemini-ultra gpt-4 gpt-3.5 claude-2.1 palm2 google openai chain-of-thought context-windows prompt-engineering model-evaluation multimodality speech-processing chatbot-errors subscription-management swyx
Google's Gemini AI model is generating significant discussion and skepticism, especially regarding its 32-shot chain of thought MMLU claim and 32k context window. The community is comparing Gemini's performance and capabilities with OpenAI's GPT-4 and GPT-3.5, highlighting the upcoming Gemini Pro and Gemini Ultra models on the Bard platform. Users report various OpenAI service issues including chatbot errors and subscription problems. Discussions also cover prompt engineering techniques, AI model evaluation comparing GPT-4, Claude 2.1, and PaLM2, and improvements in speech and multimodal capabilities. The bot now supports reading and summarizing links from platforms like arXiv, Twitter, and YouTube, enhancing user interaction.