All tags
Topic: "transformers"
Mary Meeker is so back: BOND Capital AI Trends report
qwen-3-8b anthropic hugging-face deepseek attention-mechanisms inference arithmetic-intensity transformers model-optimization interpretability model-quantization training tri_dao fleetwood___ teortaxestex awnihannun lateinteraction neelnanda5 eliebakouch _akhaliq
Mary Meeker returns with a comprehensive 340-slide report on the state of AI, highlighting accelerating tech cycles, compute growth, and comparisons of ChatGPT to early Google and other iconic tech products. The report also covers enterprise traction and valuation of major AI companies. On Twitter, @tri_dao discusses an "ideal" inference architecture featuring attention variants like GTA, GLA, and DeepSeek MLA with high arithmetic intensity (~256), improving efficiency and model quality. Other highlights include the release of 4-bit DWQ of DSR1 Qwen3 8B on Hugging Face, AnthropicAI's open-source interpretability tools for LLMs, and discussions on transformer training and abstractions by various researchers.
not much happened today
nemotron-h nvidia-eagle-2.5 gpt-4o qwen2.5-vl-72b gemini-2.5-flash gemini-2.0-pro gemini-exp-1206 gemma-3 qwen2.5-32b deepseek-r1-zero-32b uni3c seedream-3.0 adobe-dragon kimina-prover qwen2.5-72b bitnet-b1.58-2b4t nvidia deepseek hugging-face alibaba bytedance adobe transformers model-optimization multimodality long-context reinforcement-learning torch-compile image-generation diffusion-models distributional-rewards model-efficiency model-training native-quantization sampling-techniques philschmid arankomatsuzaki osanseviero iScienceLuvr akhaliq
Nemotron-H model family introduces hybrid Mamba-Transformer models with up to 3x faster inference and variants including 8B, 56B, and a compressed 47B model. Nvidia Eagle 2.5 is a frontier VLM for long-context multimodal learning, matching GPT-4o and Qwen2.5-VL-72B on long-video understanding. Gemini 2.5 Flash shows improved dynamic thinking and cost-performance, outperforming previous Gemini versions. Gemma 3 now supports torch.compile for about 60% faster inference on consumer GPUs. SRPO using Qwen2.5-32B surpasses DeepSeek-R1-Zero-32B on benchmarks with reinforcement learning only. Alibaba's Uni3C unifies 3D-enhanced camera and human motion controls for video generation. Seedream 3.0 by ByteDance is a bilingual image generation model with high-resolution outputs up to 2K. Adobe DRAGON optimizes diffusion generative models with distributional rewards. Kimina-Prover Preview is an LLM trained with reinforcement learning from Qwen2.5-72B, achieving 80.7% pass@8192 on miniF2F. BitNet b1.58 2B4T is a native 1-bit LLM with 2B parameters trained on 4 trillion tokens, matching full-precision LLM performance with better efficiency. Antidistillation sampling counters unwanted model distillation by modifying reasoning traces from frontier models.
not much happened today
grok-3 grok-3-mini gpt-4.5 claude-3.7-sonnet quasar-alpha optimus-alpha gpt-4.1 kaleidoscope internvl3 internvit qwen2.5vl transmamba fantasytalking openai alibaba cmu reinforcement-learning reasoning benchmarks vision multilinguality multimodality transformers attention-mechanisms agents code-generation model-performance rasbt sarahookr mervenoyann gneubig svpino mathemagic1an
The AI news recap highlights independent evaluations showing Grok-3 outperforming models like GPT-4.5 and Claude 3.7 Sonnet on reasoning benchmarks, while Grok-3 mini excels in reasoning tasks. Research on reinforcement learning (RL) fine-tuning reveals potential improvements for small reasoning models but also notes instability in reported gains. Benchmark results suggest Quasar Alpha and Optimus Alpha may be versions of GPT-4.1. Vision and multimodal models like Kaleidoscope, supporting 18 languages, and InternVL3, built on InternViT and Qwen2.5VL, demonstrate advances in multilingual vision and reasoning. The fusion model TransMamba combines transformer precision with speed via SSM mechanisms. Alibaba's FantasyTalking generates realistic talking portraits. Agent-focused events at CMU and tools like FilmAgent AI for virtual film production and BrowseComp benchmark for browsing agents were announced. The coding assistant Augment supports multiple IDEs with code analysis and suggestions. Discussions also covered Googleโs new agent-to-agent protocol concept.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
How To Scale Your Model, by DeepMind
qwen-0.5 google-deepmind deepseek hugging-face transformers inference high-performance-computing robotics sim2real mixture-of-experts reinforcement-learning bias-mitigation rust text-generation open-source omarsar0 drjimfan tairanhe99 guanyashi lioronai _philschmid awnihannun clementdelangue
Researchers at Google DeepMind (GDM) released a comprehensive "little textbook" titled "How To Scale Your Model" covering modern Transformer architectures, inference optimizations beyond O(N^2) attention, and high-performance computing concepts like rooflines. The resource includes practical problems and real-time comment engagement. On AI Twitter, several key updates include the open-sourced humanoid robotics model ASAP inspired by athletes like Cristiano Ronaldo, LeBron James, and Kobe Bryant; a new paper on Mixture-of-Agents proposing the Self-MoA method for improved LLM output aggregation; training of reasoning LLMs using the GRPO algorithm from DeepSeek demonstrated on Qwen 0.5; findings on bias in LLMs used as judges highlighting the need for multiple independent evaluations; and the release of mlx-rs, a Rust library for machine learning with examples including Mistral text generation. Additionally, Hugging Face launched an AI app store featuring over 400,000 apps with 2,000 new daily additions and 2.5 million weekly visits, enabling AI-powered app search and categorization.
OpenAI Sora Turbo and Sora.com
sora-turbo o1 claude-3.5-sonnet claude-3.5 gemini llama-3-3-euryale-v2.3 mistral-large behemoth endurance-v1.1 openai google nvidia hugging-face mistral-ai text-to-video-generation quantum-computing coding-capabilities transformers algorithmic-innovation storytelling roleplay model-parameter-tuning anti-monopoly-investigation sama sundarpichai bindureddy denny_zhou nrehiew_
OpenAI launched Sora Turbo, enabling text-to-video generation for ChatGPT Plus and Pro users with monthly generation limits and regional restrictions in Europe and the UK. Google announced a quantum computing breakthrough with the development of the Willow chip, potentially enabling commercial quantum applications. Discussions on O1 model performance highlighted its lag behind Claude 3.5 Sonnet and Gemini in coding tasks, with calls for algorithmic innovation beyond transformer scaling. The Llama 3.3 Euryale v2.3 model was praised for storytelling and roleplay capabilities, with users suggesting parameter tuning to reduce creative liberties and repetition. Alternatives like Mistral-Large, Behemoth, and Endurance v1.1 were also noted. Additionally, Nvidia faces an anti-monopoly investigation in China. Memes and humor around GPU issues and embargo mishaps were popular on social media.
not much happened today
llama-3-2-vision gpt-2 meta-ai-fair ollama amd llamaindex gemini gitpod togethercompute langchainai weights-biases stanfordnlp deeplearningai model-scaling neural-networks multi-gpu-support skip-connections transformers healthcare-ai automated-recruitment zero-trust-security small-language-models numerical-processing chain-of-thought optical-character-recognition multi-agent-systems agent-memory interactive-language-learning bindureddy fstichler stasbekman jxmnop bindureddy omarsar0 giffmana rajammanabrolu
This week in AI news highlights Ollama 0.4 supporting Meta's Llama 3.2 Vision models (11B and 90B), with applications like handwriting recognition. Self-Consistency Preference Optimization (ScPO) was introduced to improve model consistency without human labels. Discussions on model scaling, neural networks resurgence, and AMD's multi-GPU bandwidth challenges were noted. The importance of skip connections in Transformers was emphasized. In healthcare, less regulation plus AI could revolutionize disease treatment and aging. Tools like LlamaParse and Gemini aid automated resume insights. Gitpod Flex demonstrated zero-trust architecture for secure development environments. Research includes surveys on Small Language Models (SLMs), number understanding in LLMs, and DTrOCR using a GPT-2 decoder for OCR. Multi-agent systems in prediction markets were discussed by TogetherCompute and LangChainAI. Community events include NeurIPS Happy Hour, NLP seminars, and courses on Agent Memory with LLMs as operating systems.
not much happened today
claude-3.5-sonnet claude-3.5-haiku o1-preview mochi-1 stable-diffusion-3.5 embed-3 kerashub differential-transformer anthropic openai cohere microsoft computer-use coding-performance video-generation fine-tuning multimodality transformers attention-mechanisms model-optimization alexalbert fchollet rasbt
Anthropic released upgraded Claude 3.5 Sonnet and Claude 3.5 Haiku models featuring a new computer use capability that allows interaction with computer interfaces via screenshots and actions like mouse movement and typing. The Claude 3.5 Sonnet achieved state-of-the-art coding performance on SWE-bench Verified with a 49% score, surpassing OpenAI's o1-preview. Anthropic focuses on teaching general computer skills rather than task-specific tools, with expected rapid improvements. Other releases include Mochi 1, an open-source video generation model, Stable Diffusion 3.5 with Large and Medium variants, and Embed 3 by Cohere, a multimodal embedding model for text and image search. KerasHub was launched by Franรงois Chollet, unifying KerasNLP and KerasCV with 37 pretrained models. Microsoft introduced the Differential Transformer to reduce attention noise via differential attention maps, and research on transformer attention layers was shared by Rasbt.
not much happened today
llama mistral openai decagon sierra togethercompute vertical-saas funding protein-structure-prediction lora self-supervised-learning model-optimization neural-architecture-search model-evaluation ethics transformers multi-agent-systems long-context mira-murati demis-hassabis clement-delangue john-o-whitaker yann-lecun francois-chollet ajeya-cotra rohan-paul adcock-brett
Vertical SaaS agents are gaining rapid consensus as the future of AI applications, highlighted by Decagon's $100m funding and Sierra's $4b round. OpenAI alumni are actively raising venture capital and forming new startups, intensifying competition in the AI market. Demis Hassabis celebrated the Nobel Prize recognition for AlphaFold2, a breakthrough in protein structure prediction. Advances in AI models include techniques like LoRA projectors and annealing on high-quality data, while discussions emphasize the need for high-bandwidth sensory inputs beyond language for common sense learning. New methods like LoLCATs aim to optimize transformer models such as Llama and Mistral for efficiency. Ethical concerns about AI agents performing harmful tasks remain under investigation. The AI community continues to explore model evaluation challenges and optimization frameworks like LPZero for neural architecture search.
Microsoft AgentInstruct + Orca 3
mistral-7b orca-2.5 microsoft-research apple tencent hugging-face synthetic-data fine-tuning instruction-following transformers model-performance hallucination-detection dataset-quality flashattention mixture-of-experts philschmid sama bindureddy rohanpaul_ai zachtratar dair_ai
Microsoft Research released AgentInstruct, the third paper in its Orca series, introducing a generative teaching pipeline that produces 25.8 million synthetic instructions to fine-tune mistral-7b, achieving significant performance gains: +40% AGIEval, +19% MMLU, +54% GSM8K, +38% BBH, +45% AlpacaEval, and a 31.34% reduction in hallucinations. This synthetic data approach follows the success of FineWeb and Apple's Rephrasing research in improving dataset quality. Additionally, Tencent claims to have generated 1 billion diverse personas for synthetic data. On AI Twitter, notable discussions included a shooting incident at a Trump rally and recent ML research highlights such as FlashAttention-3, RankRAG, and Mixture of A Million Experts.
Problems with MMLU-Pro
mmlu-pro llama-3-8b-q8 gpt4all-3.0 chatgpt claude llama gemini mobilellm runway-gen-3-alpha meta-3d-gen huggingface meta-ai-fair salesforce runway nomic-ai pineapple argil-ai benchmarking prompt-engineering model-evaluation model-performance multimodality automated-dataset-generation video-generation open-source-models ai-assistants text-to-3d deepfake transformers reasoning wenhu-chen danhendrycks clementine ylecun adcock_brett svpino rohanpaul_ai
MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips. Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.
Contextual Position Encoding (CoPE)
cope gemini-1.5-flash gemini-1.5-pro claude gpt-3 meta-ai-fair google-deepmind anthropic perplexity-ai langchain openai positional-encoding transformers counting copying language-modeling coding external-memory tool-use model-evaluation inference-speed model-benchmarking scaling research-synthesis jason-weston alexandr-wang karpathy arav-srinivas
Meta AI researcher Jason Weston introduced CoPE, a novel positional encoding method for transformers that incorporates context to create learnable gates, enabling improved handling of counting and copying tasks and better performance on language modeling and coding. The approach can potentially be extended with external memory for gate calculation. Google DeepMind released Gemini 1.5 Flash and Pro models optimized for fast inference. Anthropic announced general availability of tool use for Claude, enhancing its ability to orchestrate tools for complex tasks. Alexandr Wang launched SEAL Leaderboards for private, expert evaluations of frontier models. Karpathy reflected on the 4th anniversary of GPT-3, emphasizing scaling and practical improvements. Perplexity AI launched Perplexity Pages to convert research into visually appealing articles, described as an "AI Wikipedia" by Arav Srinivas.
OpenAI's PR Campaign?
alphafold-3 xlstm gpt-4 openai microsoft google-deepmind memory-management model-spec scaling multimodality performance transformers dynamic-memory model-architecture demis-hassabis sama joanne-jang omarsar0 arankomatsuzaki drjimfan
OpenAI faces user data deletion backlash over its new partnership with StackOverflow amid GDPR complaints and US newspaper lawsuits, while addressing election year concerns with efforts like the Media Manager tool for content opt-in/out by 2025 and source link attribution. Microsoft develops a top-secret airgapped GPT-4 AI service for US intelligence agencies. OpenAI releases the Model Spec outlining responsible AI content generation policies, including NSFW content handling and profanity use, emphasizing clear distinctions between bugs and design decisions. Google DeepMind announces AlphaFold 3, a state-of-the-art model predicting molecular structures with high accuracy, showcasing cross-domain AI techniques. New research on xLSTM proposes scaling LSTMs to billions of parameters, competing with transformers in performance and scaling. Microsoft introduces vAttention, a dynamic memory management method for efficient large language model serving without PagedAttention.
A quiet weekend
llama-3 dolphin-2.9 pixart-sigma llama-3-70b microsoft coca-cola uber lmsys nous-research mistral-ai ar-interfaces transformers algorithmic-tasks turing-test graph-algorithms embeddings generative-ai model-optimization llm-inference quantization model-deployment yann-lecun
Yann LeCun predicts a shift to AR interfaces with AI assistants in 10-15 years, moving away from smartphones. The Dolphin-2.9 model based on Llama-3 was released, improving quality issues. PixArt Sigma, a 0.6B parameter model, achieves Stable Diffusion 3.0 level performance with complete prompt adherence and local usability. Research shows transformers can use meaningless filler tokens for algorithmic tasks with dense supervision. AI-generated restaurant reviews can pass the Turing test, fooling humans and AI detectors. Uber uses graph algorithms and learned embeddings for ETA prediction. Coca-Cola and Microsoft announced a 5-year AI partnership to accelerate cloud and generative AI initiatives. The Llama-3 70B model can run on a single 4GB GPU using AirLLM optimization without quantization but is slow. Mistral.rs is introduced as a fast LLM inference platform with quantization and OpenAI API compatibility. Only 5% of LLMs make it from prototype to production due to challenges, especially in enterprise. EXL2 and GGUF quantization methods for Llama models show similar perplexity vs model size, with Llama-3 and Llama-2 degrading more under quantization compared to full precision.
Zero to GPT in 1 Year
gpt-4-turbo claude-3-opus mixtral-8x22b zephyr-141b medical-mt5 openai anthropic mistral-ai langchain hugging-face fine-tuning multilinguality tool-integration transformers model-evaluation open-source-models multimodal-llms natural-language-processing ocr model-training vik-paruchuri sam-altman greg-brockman miranda-murati abacaj mbusigin akhaliq clementdelangue
GPT-4 Turbo reclaimed the top leaderboard spot with significant improvements in coding, multilingual, and English-only tasks, now rolled out in paid ChatGPT. Despite this, Claude Opus remains superior in creativity and intelligence. Mistral AI released powerful open-source models like Mixtral-8x22B and Zephyr 141B suited for fine-tuning. LangChain enhanced tool integration across models, and Hugging Face introduced Transformer.js for running transformers in browsers. Medical domain-focused Medical mT5 was shared as an open-source multilingual text-to-text model. The community also highlighted research on LLMs as regressors and shared practical advice on OCR/PDF data modeling from Vik Paruchuri's journey.
MetaVoice & RIP Bard
mixtral nous-mixtral-dpo miqu-70b gpt-4 llama-2-70b-instruct llama-2 llama-2-70b llama-2-70b-instruct coqui metavoice google openai thebloke text-to-speech voice-cloning longform-synthesis prompt-engineering direct-preference-optimization lora-fine-tuning transformers gpu-acceleration apple-silicon content-authenticity metadata ai-censorship open-source-ai model-comparison usability model-limitations
Coqui, a TTS startup that recently shut down, inspired a new TTS model supporting voice cloning and longform synthesis from a small startup called MetaVoice. Google discontinued the Bard brand in favor of Gemini. On TheBloke Discord, discussions focused on AI training with models like Mixtral, Nous Mixtral DPO, and Miqu 70B, comparing them to OpenAI's GPT models, and debated prompt engineering, lorebooks, and removing safety features via LoRA fine-tuning on models such as Llama2 70B instruct. Technical topics included transformer layer offloading limitations and adapting LLaMa 2 for Apple Silicon. On OpenAI Discord, DALL-E images now include C2PA metadata for content authenticity, sparking debates on AI censorship, metadata manipulation, and open-source AI models versus commercial giants like GPT-4. Users discussed GPT-4 usability, limitations, and practical applications.
RIP Latent Diffusion, Hello Hourglass Diffusion
gpt-4 latent-diffusion stable-diffusion meta-ai-fair openai hugging-face diffusion-models transformers image-generation model-efficiency fine-tuning quantization prompt-engineering roleplay training-optimization katherine-crowson lucidrains
Katherine Crowson from Stable Diffusion introduces a hierarchical pure transformer backbone for diffusion-based image generation that efficiently scales to megapixel resolutions with under 600 million parameters, improving upon the original ~900M parameter model. This architecture processes local and global image phenomena separately, enhancing efficiency and resolution without latent steps. Additionally, Meta's Self Rewarding LM paper has inspired lucidrains to begin an implementation. Discord summaries highlight GPT-4's robustness against quantification tricks, discussions on open-source GPT-0 alternatives, challenges in DPO training on limited VRAM with suggestions like QLoRA and rmsprop, and efforts to improve roleplay model consistency through fine-tuning and merging. Philosophical debates on AI sentience and GPT-4 customization for markdown and translation tasks were also noted.
12/11/2023: Mixtral beats GPT3.5 and Llama2-70B
mixtral-8x7b gpt-4 gpt-3.5-turbo llama-3 openhermes-2.5 llava-v1.5-13b-gptq mistral-ai openai huggingface sparse-mixture-of-experts fine-tuning quantization gpu-hardware transformers model-deployment open-source coding-datasets
Mistral AI announced the Mixtral 8x7B model featuring a Sparse Mixture of Experts (SMoE) architecture, sparking discussions on its potential to rival GPT-4. The community debated GPU hardware options for training and fine-tuning transformer models, including RTX 4070s, A4500, RTX 3090s with nvlink, and A100 GPUs. Interest was expressed in fine-tuning Mixtral and generating quantized versions, alongside curating high-quality coding datasets. Resources shared include a YouTube video on open-source model deployment, an Arxiv paper, GitHub repositories, and a blog post on Mixture-of-Experts. Discussions also touched on potential open-source releases of GPT-3.5 Turbo and llama-3, and running OpenHermes 2.5 on Mac M3 Pro with VRAM considerations.