All tags
Model: "llama-3-8b"
not much happened today
gpt-4.5 claude-3.7-sonnet deepseek-r1 smolagents-codeagent gpt-4o llama-3-8b tinyr1-32b-preview r1-searcher forgetting-transformer nanomoe openai deepseek hugging-face mixture-of-experts reinforcement-learning kv-cache-compression agentic-ai model-distillation attention-mechanisms model-compression minimax model-pretraining andrej-karpathy cwolferesearch aymericroucher teortaxestex jonathanross321 akhaliq
The AI news recap highlights several key developments: nanoMoE, a PyTorch implementation of a mid-sized Mixture-of-Experts (MoE) model inspired by Andrej Karpathy's nanoGPT, enables pretraining on commodity hardware within a week. An agentic leaderboard ranks LLMs powering smolagents CodeAgent, with GPT-4.5 leading, followed by Claude-3.7-Sonnet. Discussions around DeepSeek-R1 emphasize AI model commoditization, with DeepSeek dubbed the "OpenAI of China." Q-Filters offer a training-free method for KV cache compression in autoregressive models, achieving 32x compression with minimal perplexity loss. The PokéChamp minimax language agent, powered by GPT-4o and Llama-3-8b, demonstrates strong performance in Pokémon battles. Other notable models include TinyR1-32B-Preview with Branch-Merge Distillation, R1-Searcher incentivizing search capability via reinforcement learning, and the Forgetting Transformer using a Forget Gate in softmax attention. These advancements reflect ongoing innovation in model architectures, compression, reinforcement learning, and agentic AI.
LLaDA: Large Language Diffusion Models
llada-8b llama-3-8b step-video-t2v-30b step-audio-chat-132b llama-2-7b stepfun-ai scale-ai cambridge llamaindex diffusion-models text-generation multimodality video-generation voice-processing benchmarking instruction-following model-scaling gpu-usage long-context multi-turn-dialogue arankomatsuzaki _akhaliq omarsar0 iscienceluvr gallabytes maximelabonne reach_vb
LLaDA (Large Language Diffusion Model) 8B is a breakthrough diffusion-based language model that rivals LLaMA 3 8B while training on 7x fewer tokens (2 trillion tokens) and using 0.13 million H800 GPU hours. It introduces a novel text generation approach by predicting uniformly masked tokens in a diffusion process, enabling multi-turn dialogue and instruction-following. Alongside, StepFun AI released two major models: Step-Video-T2V 30B, a text-to-video model generating up to 204 frames with high coherence and motion quality, and Step-Audio-Chat 132B, a voice-to-voice model. Additionally, challenging multimodal benchmarks like Scale AI's EnigmaEval and Cambridge's ZeroBench highlight current frontier models scoring zero, emphasizing the difficulty of these tasks. The community also noted the return of diffusion models in language modeling, a previously speculative architecture now scaled successfully.
not much happened this weekend
claude-3.5-sonnet llama-3 llama-3-8b notebookllama min-omni-2 moondream openai anthropic hugging-face mistral-ai google-deepmind langchain deepmind microsoft pattern-recognition reinforcement-learning prompt-optimization text-to-speech model-optimization tensor-parallelism hyperparameters multimodal modal-alignment multimodal-fine-tuning ai-productivity privacy generative-ai rag retrieval-augmentation enterprise-text-to-sql amanda-askell philschmid stasbekman francois-fleuret mervenoyann reach_vb dzhng aravsrinivas sama lateinteraction andrew-y-ng bindureddy jerryjliu0
Moondream, a 1.6b vision language model, secured seed funding, highlighting a trend in moon-themed tiny models alongside Moonshine (27-61m ASR model). Claude 3.5 Sonnet was used for AI Twitter recaps. Discussions included pattern recognition vs. intelligence in LLMs, reinforcement learning for prompt optimization, and NotebookLlama, an open-source NotebookLM variant using LLaMA models for tasks like text-to-speech. Advances in model optimization with async-TP in PyTorch for tensor parallelism and hyperparameter tuning were noted. Mini-Omni 2 demonstrated multimodal capabilities across image, audio, and text for voice conversations with emphasis on modal alignment and multimodal fine-tuning. AI productivity tools like an AI email writer and LlamaCloud-based research assistants were introduced. Emphasis on practical skill development and privacy-conscious AI tool usage with Llama3-8B was highlighted. Generative AI tools such as #AIPythonforBeginners and GenAI Agents with LangGraph were shared. Business insights covered rapid execution in AI product development and emerging AI-related job roles. Challenges in enterprise-grade text-to-SQL and advanced retrieval methods were discussed with tutorials on RAG applications using LangChain and MongoDB.
not much happened this weekend
jamba-1.5 dream-machine-1.5 ideogram-v2 mistral-nemo-minitron-8b mistral-7b llama-3-8b nous-research cursor-ai gdm george-hotz agibot unitree eth-zurich disney uc-san-diego ai21-labs luma-labs ideogram nvidia mistral-ai meta-ai-fair distributed-ai optimizer inter-gpu-communication low-latency-training open-source humanoid-robots robotics physics-based-motion teleoperation multilingual-models long-context text-to-video text-to-image model-performance george-hotz adcock_brett aman
Nous Research announced DisTrO, a new optimizer that drastically reduces inter-GPU communication by 1000x to 10,000x enabling efficient training on slow networks, offering an alternative to GDM's DiLoCo. Cursor AI gained viral attention from an 8-year-old user and announced a new fundraise, with co-host Aman returning to their podcast. George Hotz launched tinybox for sale. In robotics, AGIBOT revealed 5 new humanoid robots with open-source plans, and Unitree showcased its G1 humanoid robot nearing mass production at $16,000. ETH Zurich and Disney developed an AI system for physics-based robot motion generation from text or images. UC San Diego released ACE, an open-source teleoperation system for controlling multiple robots. AI21 Labs unveiled Jamba 1.5, a multilingual model with 256k context length and permissive licensing. Luma Labs released Dream Machine 1.5 for improved text-to-video generation. Ideogram launched v2 of its text-to-image model with near-perfect text generation. Nvidia and Mistral released Mistral-NeMo-Minitron 8B, a small model outperforming Mistral-7B and llama-3-8b on the Open LLM leaderboard.
AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold
gemini alphageometry-2 alphaproof llama-3-1-405b llama-3-70b llama-3-8b mistral-large-2 google-deepmind meta-ai-fair mistral-ai neurosymbolic-ai mathematical-reasoning synthetic-data knowledge-sharing model-fine-tuning alpha-zero multilinguality context-windows model-scaling benchmarking performance-comparison tim-gowers guillaume-lample osanseviero
Search+Verifier highlights advances in neurosymbolic AI during the 2024 Math Olympics. Google DeepMind's combination of AlphaProof and AlphaGeometry 2 solved four out of six IMO problems, with AlphaProof being a finetuned Gemini model using an AlphaZero approach, and AlphaGeometry 2 trained on significantly more synthetic data with a novel knowledge-sharing mechanism. Despite impressive results, human judges noted the AI required much longer time than human competitors. Meanwhile, Meta AI released Llama 3.1 with a 405B parameter model and smaller variants, and Mistral AI launched Mistral Large 2 with 123B parameters and 128k context windows, outperforming Llama 3.1 on coding tasks and multilingual benchmarks. This marks significant progress in AI mathematical reasoning, model scaling, and multilingual capabilities.
Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model
llama-3-1-405b llama-3-8b llama-3-70b llama-3-1-8b gpt-4o gpt-4o-mini claude-3-5 qwen-2 meta-ai-fair openai alibaba multilinguality code-generation context-windows model-training synthetic-data benchmarking reasoning fine-tuning model-performance dataset-release swyx philschmid jjitsev lewtun teknium1 adcock_brett
Llama 3.1 leaks reveal a 405B dense model with 128k context length, trained on 39.3M GPU hours using H100-80GB GPUs, and fine-tuned with over 25M synthetic examples. The model shows significant benchmark improvements, especially for the 8B and 70B variants, with some evals suggesting the 70B outperforms GPT-4o. GPT-4o Mini launched as a cost-efficient variant with strong performance but some reasoning weaknesses. Synthetic datasets like NuminaMath enable models such as Alibaba Qwen 2 to surpass GPT-4o and Claude 3.5 in math competitions. Discussions include reasoning task benchmarks and dataset building for improved reasoning.
Is this... OpenQ*?
deepseek-coder-v2 llama-3-8b nemotron-4-340b stable-diffusion-3-medium deepseek_ai anthropic runwayml openai apple nvidia stability-ai luma-labs reward-tampering test-time-search mathematical-reasoning process-supervision fine-tuning on-device-ai video-generation cost-efficiency context-length coding image-understanding multimodality adcock_brett clementdelangue svpino
DeepSeekCoder V2 promises GPT4T-beating performance at a fraction of the cost. Anthropic released new research on reward tampering. Runway launched their Sora response and Gen-3 Alpha video generation model. A series of papers explore "test-time" search techniques improving mathematical reasoning with models like LLaMa-3 8B. Apple announced Apple Intelligence with smarter Siri and image/document understanding, partnered with OpenAI to integrate ChatGPT into iOS 18, and released 20 new CoreML models with LoRA fine-tuning for specialization. NVIDIA released Nemotron-4 340B, an open model matching GPT-4 performance. DeepSeek-Coder-V2 excels in coding and math with 338 programming languages and 128K context length. Stability AI released Stable Diffusion 3 Medium weights. Luma Labs launched Dream Machine for 5-second video generation from text and images.
The Last Hurrah of Stable Diffusion?
llama-3-8b llama-3 qwen-2 gpt-4 gpt-4o stability-ai togethercompute model-architecture fine-tuning benchmarks dataset-release model-evaluation reasoning model-training retrieval-augmented-generation multimodality emad-mostaque rohanpaul_ai fchollet mikeknoop micahgoldblum teknium1 rasbt percyliang
Stability AI launched Stable Diffusion 3 Medium with models ranging from 450M to 8B parameters, featuring the MMDiT architecture and T5 text encoder for image text rendering. The community has shown mixed reactions following the departure of key researchers like Emad Mostaque. On AI models, Llama 3 8B Instruct shows strong evaluation correlation with GPT-4, while Qwen 2 Instruct surpasses Llama 3 on MMLU benchmarks. The Mixture of Agents (MoA) framework outperforms GPT-4o on AlpacaEval 2.0. Techniques like Spectrum and QLoRA enable efficient fine-tuning with less VRAM. Research on grokking reveals transformers can transition from memorization to generalization through extended training. Benchmark initiatives include the $1M ARC Prize Challenge for AGI progress and LiveBench, a live LLM benchmark to prevent dataset contamination. The Character Codex Dataset offers open data on over 15,000 characters for RAG and synthetic data. The MLX 0.2 tool enhances LLM experience on Apple Silicon Macs with improved UI and faster retrieval-augmented generation.
Not much happened today
gemini-1.5-flashmodel gemini-pro mixtral mamba-2 phi-3-medium phi-3-small gpt-3.5-turbo-0613 llama-3-8b llama-2-70b mistral-finetune twelve-labs livekit groq openai nea nvidia lmsys mistral-ai model-performance prompt-engineering data-curation ai-safety model-benchmarking model-optimization training sequence-models state-space-models daniel-kokotajlo rohanpaul_ai _arohan_ tri_dao _albertgu _philschmid sarahcat21 hamelhusain jachiam0 willdepue teknium1
Twelve Labs raised $50m in Series A funding co-led by NEA and NVIDIA's NVentures to advance multimodal AI. Livekit secured $22m in funding. Groq announced running at 800k tokens/second. OpenAI saw a resignation from Daniel Kokotajlo. Twitter users highlighted Gemini 1.5 FlashModel for high performance at low cost and Gemini Pro ranking #2 in Japanese language tasks. Mixtral models can run up to 8x faster on NVIDIA RTX GPUs using TensorRT-LLM. Mamba-2 model architecture introduces state space duality for larger states and faster training, outperforming previous models. Phi-3 Medium (14B) and Small (7B) models benchmark near GPT-3.5-Turbo-0613 and Llama 3 8B. Prompt engineering is emphasized for unlocking LLM capabilities. Data quality is critical for model performance, with upcoming masterclasses on data curation. Discussions on AI safety include a Frontier AI lab employee letter advocating whistleblower protections and debates on aligning AI to user intent versus broader humanity interests.
Life after DPO (RewardBench)
gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan
xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.
Not much happened today
command-r-35b goliath-120 miqu-120 llama-3-8b tensorrt-llm llama-cpp gpt2-chat gpt-4-turbo llama-3 deepmind-alphazero anthropic openai perplexity-ai amazon apple microsoft deepmind creative-writing context-windows benchmarking model-performance self-learning function-calling retrieval-augmented-generation ai-assistants on-device-ai ai-lobbying copyright-infringement code-reasoning image-generation
Anthropic released a team plan and iOS app about 4 months after OpenAI. The Command-R 35B model excels at creative writing, outperforming larger models like Goliath-120 and Miqu-120. The Llama-3 8B model now supports a 1 million token context window, improving long-context understanding with minimal training on a single 8xA800 GPU machine. TensorRT-LLM benchmarks show it is 30-70% faster than llama.cpp on consumer hardware. A benchmark suggests GPT2-Chat may have better reasoning than GPT-4-Turbo, though results are debated. Demos include a self-learning Llama-3 voice agent running locally on Jetson Orin and a Self-Learning Large Action Model (LAM). Amazon CodeWhisperer was renamed to Q Developer, expanding its generative AI assistant capabilities. Apple plans an AI-enabled Safari browser with an on-device LLM in iOS 18 and macOS 15. Big Tech dominates AI lobbying in Washington, while major U.S. newspapers sued OpenAI and Microsoft for copyright infringement. DeepMind's AlphaZero became the greatest chess player in 9 hours, and their Naturalized Execution Tuning (NExT) method improves LLM code reasoning by 14-26%. Stable Diffusion is used for diverse image generation applications.
OpenAI's Instruction Hierarchy for the LLM OS
phi-3-mini openelm claude-3-opus gpt-4-turbo gpt-3.5-turbo llama-3-70b rho-1 mistral-7b llama-3-8b llama-3 openai microsoft apple deepseek mistral-ai llamaindex wendys prompt-injection alignment benchmarking instruction-following context-windows model-training model-deployment inference performance-optimization ai-application career-advice drive-thru-ai
OpenAI published a paper introducing the concept of privilege levels for LLMs to address prompt injection vulnerabilities, improving defenses by 20-30%. Microsoft released the lightweight Phi-3-mini model with 4K and 128K context lengths. Apple open-sourced the OpenELM language model family with an open training and inference framework. An instruction accuracy benchmark compared 12 models, with Claude 3 Opus, GPT-4 Turbo, and Llama 3 70B performing best. The Rho-1 method enables training state-of-the-art models using only 3% of tokens, boosting models like Mistral. Wendy's deployed AI-powered drive-thru ordering, and a study found Gen Z workers prefer generative AI for career advice. Tutorials on deploying Llama 3 models on AWS EC2 highlight hardware requirements and inference server use.
Perplexity, the newest AI unicorn
llama-3-8b llama-3-70b llama-3 llava-llama-3-8b-v1_1 phi-3 gpt-3.5 perplexity-ai meta-ai-fair hugging-face groq context-length fine-tuning quantization instruction-following model-comparison multimodality benchmarking memory-optimization model-performance daniel-gross aravind-srinivas
Perplexity doubles its valuation shortly after its Series B with a Series B-1 funding round. Significant developments around Llama 3 include context length extension to 16K tokens, new multimodal LLaVA models outperforming Llama 2, and fine-tuning improvements like QDoRA surpassing QLoRA. The Llama-3-70B model is praised for instruction following and performance across quantization formats. Phi-3 models by Meta AI released in multiple sizes show competitive benchmark results, with the 14B model achieving 78% on MMLU and the 3.8B model nearing GPT-3.5 performance.
Llama-3-70b is GPT-4-level Open Model
llama-3-70b llama-3-8b llama-3 llama-2-70b mistral-7b grok-3 stable-diffusion-3 vasa-1 meta-ai-fair groq nvidia amazon microsoft benchmarking model-performance fine-tuning function-calling arithmetic image-generation video-generation energy-usage gpu-demand political-bias ai-safety scaling context-windows tokenization elon-musk
Meta has released Llama 3, their most capable open large language model with 8B and 70B parameter versions supporting 8K context length and outperforming previous models including Llama 2 and Mistral 7B. Groq serves the Llama 3 70B model at 500-800 tokens/second, making it the fastest GPT-4-level token source. Discussions highlight AI scaling challenges with Elon Musk stating that training Grok 3 will require 100,000 Nvidia H100 GPUs, and AWS planning to acquire 20,000 B200 GPUs for a 27 trillion parameter model. Microsoft unveiled VASA-1 for lifelike talking face generation, while Stable Diffusion 3 and its extensions received mixed impressions. Concerns about AI energy usage and political bias in AI were also discussed.
Meta Llama 3 (8B, 70B)
llama-3-8b llama-3-70b llama-3-400b stable-diffusion-3 mixtral-8x22b-instruct-v0.1 vasa-1 meta-ai-fair stability-ai boston-dynamics microsoft mistral-ai hugging-face transformer tokenization model-training benchmarking robotics natural-language-processing real-time-processing synthetic-data dataset-cleaning behavior-trees ai-safety model-accuracy api model-release humor helen-toner
Meta partially released Llama 3 models including 8B and 70B variants, with a 400B variant still in training, touted as the first GPT-4 level open-source model. Stability AI launched Stable Diffusion 3 API with model weights coming soon, showing competitive realism against Midjourney V6. Boston Dynamics unveiled an electric humanoid robot Atlas, and Microsoft introduced the VASA-1 model generating lifelike talking faces at 40fps on RTX 4090. Mistral AI, a European OpenAI rival, is seeking $5B funding with its Mixtral-8x22B-Instruct-v0.1 model achieving 100% accuracy on 64K context benchmarks. AI safety discussions include calls from former OpenAI board member Helen Toner for audits of top AI companies, and the Mormon Church released AI usage principles. New AI development tools include Ctrl-Adapter for diffusion models, Distilabel 1.0.0 for synthetic dataset pipelines, Data Bonsai for data cleaning with LLMs, and Dendron for building LLM agents with behavior trees. Memes highlight AI development humor and cultural references. The release of Llama 3 models features improved reasoning, a 128K token vocabulary, 8K token sequences, and grouped query attention.