All tags
Topic: "image-generation"
not much happened today
kernelllm-8b gpt-4o deepseek-v3 mistral-medium-3 qwen3 blip3-o xgen-small anisora stable-audio-open-small alphaevolve meta-ai-fair mistral-ai qwen deepseek salesforce bilibili stability-ai google benchmarking model-performance multilinguality hardware-optimization multimodality image-generation video-generation text-to-audio model-parallelism chain-of-thought instruction-following reasoning mitigation-strategies reach_vb lmarena_ai theadimeline adcock_brett jxmnop dair_ai omarsar0
Meta released KernelLLM 8B, outperforming GPT-4o and DeepSeek V3 on KernelBench-Triton Level 1. Mistral Medium 3 debuted strongly in multiple benchmarks. Qwen3 models introduced a unified framework with multilingual support. DeepSeek-V3 features hardware-aware co-design. BLIP3-o family released for multimodal tasks using diffusion transformers. Salesforce launched xGen-Small models excelling in long-context and math benchmarks. Bilibili released AniSORA for anime video generation. Stability AI open-sourced Stable Audio Open Small optimized for Arm devices. Google’s AlphaEvolve coding agent improved Strassen's algorithm for the first time since 1969. Research shows chain-of-thought reasoning can harm instruction-following ability, with mitigation strategies like classifier-selective reasoning being most effective, but reasoning techniques show high variance and limited generalization. "Chain-of-thought (CoT) reasoning can harm a model’s ability to follow instructions" and "Mitigation strategies such as few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning can counteract reasoning-induced failures".
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
AI Engineer World's Fair: Second Run, Twice The Fun
gemini-2.5-pro google-deepmind waymo tesla anthropic braintrust retrieval-augmentation graph-databases recommendation-systems software-engineering-agents agent-reliability reinforcement-learning voice image-generation video-generation infrastructure security evaluation ai-leadership enterprise-ai mcp tiny-teams product-management design-engineering robotics foundation-models coding web-development demishassabis
The 2025 AI Engineer World's Fair is expanding with 18 tracks covering topics like Retrieval + Search, GraphRAG, RecSys, SWE-Agents, Agent Reliability, Reasoning + RL, Voice AI, Generative Media, Infrastructure, Security, and Evals. New focuses include MCP, Tiny Teams, Product Management, Design Engineering, and Robotics and Autonomy featuring foundation models from Waymo, Tesla, and Google. The event highlights the growing importance of AI Architects and enterprise AI leadership. Additionally, Demis Hassabis announced the Gemini 2.5 Pro Preview 'I/O edition', which leads coding and web development benchmarks on LMArena.
Cursor @ $9b, OpenAI Buys Windsurf @ $3b
llama-nemotron-ultra llama-nemotron-super llama-nemotron-nano qwen3-235b-a22b prover-v2 phi-4-reasoning ernie-4.5-turbo ernie-x1-turbo suno-v4.5 gen-4-references o1-mini openai cursor nvidia alibaba deepseek microsoft baidu suno runway keras reasoning inference-efficiency open-license moe-models math-reasoning theorem-proving model-performance music-generation image-generation recommender-systems tpu-optimization _akhaliq adcock_brett lmarena_ai fchollet
OpenAI is reportedly close to closing a deal with Windsurf, coinciding with Cursor's $900M funding round at a $9B valuation. Nvidia launched the Llama-Nemotron series featuring models from 8B to 253B parameters, praised for reasoning and inference efficiency. Alibaba released the Qwen3 family with MoE and dense models up to 235B parameters, ranking highly in coding and math benchmarks. DeepSeek introduced Prover-V2, an open-source AI for math reasoning with an 88.9% pass rate on MiniF2F-test. Microsoft released reasoning-focused Phi-4 models, outperforming OpenAI's o1-mini. Baidu debuted turbo versions of ERNIE 4.5 and X1 for faster, cheaper inference. Suno v4.5 added advanced AI music generation features, while Runway Gen-4 References enable placing characters into scenes with high consistency. KerasRS, a new recommender system library optimized for TPUs, was released by Fran ois Chollet.
not much happened today
gpt-image-1 o3 o4-mini gpt-4.1 dam openai google anthropic epoch ai research image-generation model-benchmarks vision-language-models music-ai ai-experiences ai-research supercomputers
AI news for April 23-24, 2025, covering new model releases, benchmarks, and research developments from companies like openai, google deepmind, anthropic, and epoch ai research.
gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq
OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.
not much happened today
nemotron-h nvidia-eagle-2.5 gpt-4o qwen2.5-vl-72b gemini-2.5-flash gemini-2.0-pro gemini-exp-1206 gemma-3 qwen2.5-32b deepseek-r1-zero-32b uni3c seedream-3.0 adobe-dragon kimina-prover qwen2.5-72b bitnet-b1.58-2b4t nvidia deepseek hugging-face alibaba bytedance adobe transformers model-optimization multimodality long-context reinforcement-learning torch-compile image-generation diffusion-models distributional-rewards model-efficiency model-training native-quantization sampling-techniques philschmid arankomatsuzaki osanseviero iScienceLuvr akhaliq
Nemotron-H model family introduces hybrid Mamba-Transformer models with up to 3x faster inference and variants including 8B, 56B, and a compressed 47B model. Nvidia Eagle 2.5 is a frontier VLM for long-context multimodal learning, matching GPT-4o and Qwen2.5-VL-72B on long-video understanding. Gemini 2.5 Flash shows improved dynamic thinking and cost-performance, outperforming previous Gemini versions. Gemma 3 now supports torch.compile for about 60% faster inference on consumer GPUs. SRPO using Qwen2.5-32B surpasses DeepSeek-R1-Zero-32B on benchmarks with reinforcement learning only. Alibaba's Uni3C unifies 3D-enhanced camera and human motion controls for video generation. Seedream 3.0 by ByteDance is a bilingual image generation model with high-resolution outputs up to 2K. Adobe DRAGON optimizes diffusion generative models with distributional rewards. Kimina-Prover Preview is an LLM trained with reinforcement learning from Qwen2.5-72B, achieving 80.7% pass@8192 on miniF2F. BitNet b1.58 2B4T is a native 1-bit LLM with 2B parameters trained on 4 trillion tokens, matching full-precision LLM performance with better efficiency. Antidistillation sampling counters unwanted model distillation by modifying reasoning traces from frontier models.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
deepcoder-14b o3-mini o1 gemini-2.5-pro kimi-vl-a3b gpt-4o llama-4-scout maverick behemoth gen-4-turbo imagen-3 together-ai agentica opena bytedance google-deepmind moonshot-ai meta-ai-fair runway open-source reinforcement-learning code-generation multimodality model-training mixture-of-experts l2-normalization image-generation model-performance context-windows philschmid lepikhin reach_vb akhaliq yuchenj_uw epochairesearch danielhanchen c_valenzuelab
Together AI and Agentica released DeepCoder-14B, an open-source 14B parameter coding model rivaling OpenAI's o3-mini and o1 on coding benchmarks, trained with an open-source RL framework from ByteDance and costing about $26,880. Google DeepMind launched Gemini 2.5 Pro with experimental "Flash" versions available to subscribers. Moonshot AI introduced Kimi-VL-A3B, a multimodal model with 128K context outperforming gpt-4o on vision and math benchmarks. Meta AI released Llama 4 Scout and Maverick, with a larger Behemoth model in training, featuring mixture-of-experts and L2 norm techniques. Runway launched Gen-4 Turbo with 10x better results than Gen-3 at the same cost. Google announced Imagen 3, a high-quality text-to-image model now in Vertex AI, enabling easier object removal. The report highlights open-source contributions, reinforcement learning training optimizations, and significant model performance improvements across coding, multimodal, and image generation domains.
not much happened today
gpt-2 r1 gemma-3 gemmacoder3-12b qwen2.5-omni openai deepseek berkeley alibaba togethercompute nvidia azure runway langchain bmw amazon open-source function-calling benchmarking code-reasoning multimodality inference-speed image-generation voice-generation animation robotics realtime-transcription webrtc sama clémentdelangue lioronai scaling01 cognitivecompai osanseviero jack_w_rae ben_burtenshaw theturingpost vipulved kevinweil tomlikesrobots adcock_brett juberti
OpenAI plans to release its first open-weight language model since GPT-2 in the coming months, signaling a move towards more open AI development. DeepSeek launched its open-source R1 model earlier this year, challenging perceptions of China's AI progress. Gemma 3 has achieved function calling capabilities and ranks on the Berkeley Function-Calling Leaderboard, while GemmaCoder3-12b improves code reasoning performance on LiveCodeBench. Alibaba_Qwen's Qwen2.5-Omni introduces a novel Thinker-Talker system and TMRoPE for multimodal input understanding. The TogetherCompute team achieved 140 TPS on a 671B parameter model, outperforming Azure and DeepSeek API on Nvidia GPUs. OpenAI also expanded ChatGPT features with image generation for all free users and a new voice release. Runway Gen-4 enhances animation for miniature dioramas, and LangChain launched a chat-based generative UI agent. Commercial deployment of Figure 03 humanoid robots at BMW highlights advances in autonomy and manufacturing scaling. New tools include OpenAI's realtime transcription API with WebRTC support and Amazon's Nova Act AI browser agent.
not much happened today
gpt-4o deepseek-v3 claude-3.7-sonnet o3-mini gemini-2.5-pro openai deepseek anthropic google-deepmind togethercompute hypertecgroup coreweave cursor-ai windsurf-ai coding instruction-following image-generation policy-compliance long-context audio-processing video-processing gpu-clusters ai-infrastructure api-access sama kevinweil joannejang nrehiew_ giffmana _philschmid scaling01 saranormous
GPT-4o was praised for its improved coding, instruction following, and freedom, becoming the leading non-reasoning coding model surpassing DeepSeek V3 and Claude 3.7 Sonnet in coding benchmarks, though it still lags behind reasoning models like o3-mini. Concerns about policy compliance in image generation were noted, with efforts to improve adherence. Gemini 2.5 Pro was highlighted for its advanced audio and video understanding, long context capabilities, and integration with platforms like Cursor AI and Windsurf AI. AI infrastructure developments include a partnership between Together AI and Hypertec Group to deliver large-scale GPU clusters, and CoreWeave's IPO was celebrated for advancing AI infrastructure. GPU and TPU usage is expected to increase significantly. "GPT-4o's transparency and background generation feature" and "Gemini 2.5 Pro scored above 50% on Simple-Bench AI Explanation" were key highlights.
not much happened today
gpt-4o deepseek-v3-0324 gemini-2.5-pro gemini-3 claude-3.7-sonnet openai hugging-face sambanova google-cloud instruction-following image-generation content-filtering model-performance api coding model-deployment benchmarking model-release abacaj nrehiew_ sama joannejang giffmana lmarena_ai _philschmid
OpenAI announced the new GPT-4o model with enhanced instruction-following, complex problem-solving, and native image generation capabilities. The model shows improved performance in math, coding, and creativity, with features like transparent background image generation. Discussions around content filtering and policy for image generation emphasize balancing creative freedom and harm prevention. DeepSeek V3-0324 APIs, available on Hugging Face and powered by SambaNovaAI, outperform benchmarks and models like Gemini 2.0 Pro and Claude 3.7 Sonnet. Gemini 2.5 Pro is recommended for coding, and Gemini 3 can be deployed easily on Google Cloud Vertex AI via the new Model Garden SDK. The Gemma 3 Technical Report has been released on arXiv.
not much happened today
gemini-2.0-flash imagen-3 mistral-small-3.1 mistral-3 gpt-4o-mini claude-3.5-haiku olm0-32b qwen-2.5 shieldgemma-2 julian fasttransform nvidia google mistral-ai allen-ai anthropic langchainai perplexity-ai kalshi stripe qodoai multimodality image-generation context-windows model-pricing open-source-models image-classification frameworks python-libraries partnerships jeremyphoward karpathy abacaj mervenoyann
At Nvidia GTC Day 1, several AI updates were highlighted: Google's Gemini 2.0 Flash introduces image input/output but is not recommended for text-to-image tasks, with Imagen 3 preferred for that. Mistral AI released Mistral Small 3.1 with 128k token context window and competitive pricing. Allen AI launched OLMo-32B, an open LLM outperforming GPT-4o mini and Qwen 2.5. ShieldGemma 2 was introduced for image safety classification. LangChainAI announced multiple updates including Julian powered by LangGraph and integration with AnthropicAI's MCP. Jeremy Howard released fasttransform, a Python library for data transformations. Perplexity AI partnered with Kalshi for NCAA March Madness predictions.
not much happened today
gemini-2.0-flash-thinking command-a qwq-32b gemma-3-27b gemma-3 shieldgemma-2 llama-3-70b deepseek-r1 o1-mini deepseek-v3 google-deepmind cohere meta-ai-fair alibaba hugging-face model-updates model-performance benchmarking reinforcement-learning transformers normalization-layers image-generation vision memory-efficiency context-windows fine-tuning yann-lecun
Google DeepMind announced updates to Gemini 2.0, including an upgraded Flash Thinking model with stronger reasoning and native image generation capabilities. Cohere launched Command A, a 111B parameter dense model with a 256K context window and competitive pricing, available on Hugging Face. Meta AI proposed Dynamic Tanh (DyT) as a replacement for normalization layers in Transformers, supported by Yann LeCun. Alibaba released QwQ-32B, a 32.5B parameter model excelling in math and coding, fine-tuned with reinforcement learning and freely available under Apache 2.0 license. Google DeepMind also released Gemma 3 models ranging from 1B to 27B parameters with a 128K token context window and over 140 language support, plus ShieldGemma 2, an image safety checker. Benchmarking shows Gemma 3 27B has strong vision and memory efficiency but is outperformed by larger models like Llama 3.3 70B and DeepSeek V3 671B. The Hugging Face LLM leaderboard history was shared by @_lewtun.
Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen
gemma-3 gemini-1.5-pro gemini-2 o1-preview o3-mini-high deepseek-v3 claude-3.7-sonnet qwen-2.5-max google-deepmind openai multimodality multilinguality context-window quantization image-generation model-benchmarking model-performance vision reach_vb _philschmid danielhanchen lmarena_ai osanseviero
Google DeepMind launched the Gemma 3 family of models featuring a 128k context window, multimodal input (image and video), and multilingual support for 140+ languages. The Gemma 3-27B model ranks among the top open models on LMArena benchmarks, outperforming several competitors and matching Gemini-1.5-Pro on benchmarks. Additionally, Gemini 2 introduced Flash Native Image Generation with advanced image editing capabilities, a feature teased by OpenAI but not launched. The updates highlight significant advances in context length, multimodality, and model efficiency via quantization.
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
not much happened today
deepseek-r1 qwen-2.5 qwen-2.5-max deepseek-v3 deepseek-janus-pro gpt-4 nvidia anthropic openai deepseek huawei vercel bespoke-labs model-merging multimodality reinforcement-learning chain-of-thought gpu-optimization compute-infrastructure compression crypto-api image-generation saranormous zizhpan victormustar omarsar0 markchen90 sakanaailabs reach_vb madiator dain_mclau francoisfleuret garygodchaux arankomatsuzaki id_aa_carmack lavanyasant virattt
Huawei chips are highlighted in a diverse AI news roundup covering NVIDIA's stock rebound, new open music foundation models like Local Suno, and competitive AI models such as Qwen 2.5 Max and Deepseek V3. The release of DeepSeek Janus Pro, a multimodal LLM with image generation capabilities, and advancements in reinforcement learning and chain-of-thought reasoning are noted. Discussions include GPU rebranding with NVIDIA's H6400 GPUs, data center innovations, and enterprise AI applications like crypto APIs in hedge funds. "Deepseek R1's capabilities" and "Qwen 2.5 models added to applications" are key highlights.
$200 ChatGPT Pro and o1-full/pro, with vision, without API, and mixed reviews
o1 o1-pro claude-3.5-sonnet pali-gemma-2 openai google llamaindex multimodality vision fine-tuning benchmarking model-performance image-generation document-processing model-release sama bindureddy mervenoyann fchollet
OpenAI launched the o1 model with multimodal capabilities, faster reasoning, and image input support, marking it as a state-of-the-art model despite some bugs and mixed community reviews. The new o1-pro tier offers unlimited access for $200/month with notable benchmark improvements but some performance trade-offs compared to claude-3.5-sonnet. Google released the PaliGemma 2 vision-language model family in sizes 3B, 10B, and 28B, excelling in visual question answering, image segmentation, and OCR, with day-0 support for fine-tuning. LlamaIndex announced discounts and feature updates for large-scale document processing. The AI community also reacted humorously to the new pricing tiers and model comparisons. "o1 can see now, which makes it the SOTA multimodal model" and "most users will be best served by free/Plus tiers" were notable sentiments.
not much happened today
o1-full sora gpt-4.5 gpt-4 claude-3.5-sonnet llama-3-1-nemotron-51b llama-3-1 llama-3 nemotron-51b openai google-deepmind anthropic nvidia huggingface vision model-performance neural-architecture-search model-optimization multimodality model-release model-training reinforcement-learning image-generation lucas-beyer alexander-kolesnikov xiaohua-zhai aidan_mclau giffmana joannejang sama
OpenAI announced their "12 Days of OpenAI" event with daily livestreams and potential releases including the O1 full model, Sora video model, and GPT-4.5. Google DeepMind released the GenCast weather model capable of 15-day forecasts in 8 minutes using TPU chips, and launched Genie 2, a model generating playable 3D worlds from single images. Leading vision researchers Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai moved from DeepMind to OpenAI, which is opening a Zürich office. Criticism arose over OpenAI's strategy and model quality compared to Anthropic and Claude 3.5 Sonnet. On Reddit, a modified llama.cpp supports Nvidia's Llama-3_1-Nemotron-51B, matching performance of larger 70B models via NAS optimization.
Vision Everywhere: Apple AIMv2 and Jina CLIP v2
aimv2-3b jina-clip-v2 tulu-3 llama-3-1 claude-3-5 llama-3-1-70b apple jina allen_ai autoregressive-objectives vision multilinguality multimodality image-generation model-training model-optimization reinforcement-learning fine-tuning model-benchmarking
Apple released AIMv2, a novel vision encoder pre-trained with autoregressive objectives that achieves 89.5% accuracy on ImageNet and integrates joint visual and textual objectives. Jina launched Jina CLIP v2, a multimodal embedding model supporting 89 languages and high-resolution images with efficient Matryoshka embeddings reducing dimensions by 94% with minimal accuracy loss. Allen AI introduced Tülu 3 models based on Llama 3.1 with 8B and 70B parameters, offering 2.5x faster inference and alignment via SFT, DPO, and RLVR methods, competing with Claude 3.5 and Llama 3.1 70B. These developments highlight advances in autoregressive training, vision encoders, and multilingual multimodal embeddings.
Perplexity starts Shopping for you
pixtral-large-124b llama-3.1-405b claude-3.6 claude-3.5 stripe perplexity-ai mistral-ai hugging-face cerebras anthropic weights-biases google vllm-project multi-modal image-generation inference context-windows model-performance model-efficiency sdk ai-integration one-click-checkout memory-optimization patrick-collison jeff-weinstein mervenoyann sophiamyang tim-dettmers omarsar0 akhaliq aravsrinivas
Stripe launched their Agent SDK, enabling AI-native shopping experiences like Perplexity Shopping for US Pro members, featuring one-click checkout and free shipping via the Perplexity Merchant Program. Mistral AI released the Pixtral Large 124B multi-modal image model, now on Hugging Face and supported by Le Chat for image generation. Cerebras Systems offers a public inference endpoint for Llama 3.1 405B with a 128k context window and high throughput. Claude 3.6 shows improvements over Claude 3.5 but with subtle hallucinations. The Bi-Mamba 1-bit architecture improves LLM efficiency. The wandb SDK is preinstalled on Google Colab, and Pixtral Large is integrated into AnyChat and supported by vLLM for efficient model usage.
Common Corpus: 2T Open Tokens with Provenance
qwen-2.5-coder claude-3.5-sonnet janusflow-1.3b ocronos-vintage pleais huggingface langchainai deepseek alibaba anthropic provenance ocr multilingual-datasets prompt-engineering multimodality image-generation code-generation quantization model-scaling inference-efficiency tim-dettmers tom-doerr omarsar0 swyx madiator reach_vb
Pleais via Huggingface released Common Corpus, the largest fully open multilingual dataset with over 2 trillion tokens including detailed provenance information. They also introduced OCRonos-Vintage, a 124M-parameter OCR correction model that efficiently fixes digitization errors on CPU and GPU, unlocking knowledge from PDFs. On AI tools, LangChainAI launched Prompt Canvas for collaborative prompt engineering, while DeepSeek released JanusFlow 1.3B, a unified multimodal LLM integrating autoregressive and rectified flow models for enhanced image understanding and generation. Alibaba Cloud announced Qwen2.5-Coder, a code-focused LLM with advanced coding capabilities, and Claude 3.5 Sonnet was highlighted for superior code generation. Discussions on quantization challenges and scaling laws for precision by Tim Dettmers and others emphasized the impact of low-precision training on model scalability and inference efficiency. "Scaling Laws for Precision" paper insights and alternative efficiency methods were also noted.
s{imple|table|calable} Consistency Models
llama-3-70b llama-3-405b llama-3-1 stable-diffusion-3.5 gpt-4 stability-ai tesla cerebras cohere langchain model-distillation diffusion-models continuous-time-consistency-models image-generation ai-hardware inference-speed multilingual-models yang-song
Model distillation significantly accelerates diffusion models, enabling near real-time image generation with only 1-4 sampling steps, as seen in BlinkShot and Flux Schnell. Research led by Yang Song introduced simplified continuous-time consistency models (sCMs), achieving under 10% FID difference in just 2 steps and scaling up to 1.5B parameters for higher quality. On AI hardware, Tesla is deploying a 50k H100 cluster potentially capable of completing GPT-4 training in under three weeks, while Cerebras Systems set a new inference speed record on Llama 3.1 70B with their wafer-scale AI chips. Stability AI released Stable Diffusion 3.5 and its Turbo variant, and Cohere launched new multilingual models supporting 23 languages with state-of-the-art performance. LangChain also announced ecosystem updates.
DeepSeek Janus and Meta SpiRit-LM: Decoupled Image and Expressive Voice Omnimodality
nemotron-70b claude claude-3.5-sonnet gpt-4o deepseek meta-ai-fair wandb nvidia anthropic hugging-face perplexity-ai multimodality image-generation speech-synthesis fine-tuning model-merging benchmarking open-source model-optimization reinforcement-learning bindureddy aravsrinivas danielhanchen clementdelangue cwolferesearch
DeepSeek Janus and Meta SpiRit-LM are two notable multimodality AI models recently released, showcasing advances in image generation and speech synthesis respectively. DeepSeek Janus separates vision encoders for image understanding and generation, achieving better results in both tasks. Meta's SpiRit-LM introduces an expressive speech and writing model generating pitch and style units, improving over standard TTS. Additionally, W&B Weave offers comprehensive LLM observability and multimodality fine-tuning tools. Industry updates include Nvidia's Nemotron 70b model underperforming, Meta open-sourcing Movie Gen Bench for media generation benchmarking, Perplexity launching internal search with multi-step reasoning, and Anthropic updating Claude apps. Open source progress includes Hugging Face's gradient accumulation fix in transformers and advocacy for open source AI to prevent Big Tech dominance. "Model merging for combining skills of multiple models" is also highlighted.
Replit Agent - How did everybody beat Devin to market?
jpeg-lm avc-lm replit anthropic togethercompute document-retrieval retrieval-augmented-generation ai-agents image-generation video-generation context-windows gpu-pricing enterprise-ai self-healing text-to-music andrej-karpathy mervenoyann bindureddy rohanpaul_ai leptonai teortaxestex
Replit Agent launched as a fully integrated Web IDE enabling text-to-app generation with planning and self-healing, available immediately to paid users without a waitlist. Other notable developments include Melodio, a new text-to-music model, and Together AI's kernel and speculative decoding work. Anthropic AI announced a new enterprise plan featuring a 500K context window and enhanced security. Discussions on JPEG-LM and AVC-LM models for improved image and video generation, and GPU market trends around the H100 GPU pricing were highlighted. Influential voices like Andrej Karpathy shared insights on AI agents and automation.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.
The DSPy Roadmap
dspy litel-lm gemini chatgpt-4o grok-2 hermes-3 databricks mit google openai x-ai nous-research astribot apple sakana-ai model-optimization fine-tuning optimizers interactive-optimization robotics autonomous-systems voice image-generation open-source-models scientific-research streaming caching omar-khattab giffmana
Omar Khattab announced joining Databricks before his MIT professorship and outlined the roadmap for DSPy 2.5 and 3.0+, focusing on improving core components like LMs, signatures, optimizers, and assertions with features such as adopting LiteLLM to reduce code and enhance caching and streaming. The roadmap also includes developing more accurate, cost-effective optimizers, building tutorials, and enabling interactive optimization tracking. On AI Twitter, Google launched Gemini Live, a mobile conversational AI with voice and 10 voices, alongside Pixel Buds Pro 2 with a custom Tensor A1 chip. OpenAI updated ChatGPT-4o, reclaiming the top spot on LMSYS Arena. xAI released Grok-2 in beta, achieving SOTA in image generation with FLUX 1. Nous Research released open-source Hermes 3 models in 8B, 70B, and 405B sizes, with the 405B model achieving SOTA. Robotics updates include Astribot's humanoid robot and Apple's tabletop robot with Siri voice commands. Sakana AI introduced "The AI Scientist," an autonomous AI research system.
not much happened today
llama-3 llama-3-1 grok-2 claude-3.5-sonnet gpt-4-turbo nous-research nvidia salesforce goodfire-ai anthropic x-ai google-deepmind box langchain fine-tuning prompt-caching mechanistic-interpretability model-performance multimodality agent-frameworks software-engineering-agents api document-processing text-generation model-releases vision image-generation efficiency scientific-discovery fchollet demis-hassabis
GPT-5 delayed again amid a quiet news day. Nous Research released Hermes 3 finetune of Llama 3 base models, rivaling FAIR's instruct tunes but sparking debate over emergent existential crisis behavior with 6% roleplay data. Nvidia introduced Minitron finetune of Llama 3.1. Salesforce launched a DEI agent scoring 55% on SWE-Bench Lite. Goodfire AI secured $7M seed funding for mechanistic interpretability work. Anthropic rolled out prompt caching in their API, cutting input costs by up to 90% and latency by 80%, aiding coding assistants and large document processing. xAI released Grok-2, matching Claude 3.5 Sonnet and GPT-4 Turbo on LMSYS leaderboard with vision+text inputs and image generation integration. Claude 3.5 Sonnet reportedly outperforms GPT-4 in coding and reasoning. François Chollet defined intelligence as efficient operationalization of past info for future tasks. Salesforce's DEI framework surpasses individual agent performance. Google DeepMind's Demis Hassabis discussed AGI's role in scientific discovery and safe AI development. Dora AI plugin generates landing pages in under 60 seconds, boosting web team efficiency. Box AI API beta enables document chat, data extraction, and content summarization. LangChain updated Python & JavaScript integration docs.
Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing
gpt-4 gpt-4o gpt-4-turbo gpt-4o-mini llama bloom stable-diffusion cursor openai anthropic google-deepmind huggingface speculative-decoding code-edits multimodality image-generation streaming tool-use fine-tuning benchmarking mmlu model-performance evaluation synthetic-data context-windows sama abacaj imjaredz erhartford alexalbert svpino maximelabonne _philschmid
Cursor, an AI-native IDE, announced a speculative edits algorithm for code editing that surpasses GPT-4 and GPT-4o in accuracy and latency, achieving speeds of over 1000 tokens/s on a 70b model. OpenAI released GPT-4o with multimodal capabilities including audio, vision, and text, noted to be 2x faster and 50% cheaper than GPT-4 turbo, though with mixed coding performance. Anthropic introduced streaming, forced tool use, and vision features for developers. Google DeepMind unveiled Imagen Video and Gemini 1.5 Flash, a small model with a 1M-context window. HuggingFace is distributing $10M in free GPUs for open-source AI models like Llama, BLOOM, and Stable Diffusion. Evaluation insights highlight challenges with LLMs on novel problems and benchmark saturation, with new benchmarks like MMLU-Pro showing significant drops in top model performance.
Google I/O in 60 seconds
gemini-1.5-pro gemini-flash gemini-ultra gemini-pro gemini-nano gemma-2 llama-3-70b paligemma imagen-3 veo google google-deepmind youtube tokenization model-performance fine-tuning vision multimodality model-release model-training model-optimization ai-integration image-generation watermarking hardware-optimization voice video-understanding
Google announced updates to the Gemini model family, including Gemini 1.5 Pro with 2 million token support, and the new Gemini Flash model optimized for speed with 1 million token capacity. The Gemini suite now includes Ultra, Pro, Flash, and Nano models, with Gemini Nano integrated into Chrome 126. Additional Gemini features include Gemini Gems (custom GPTs), Gemini Live for voice conversations, and Project Astra, a live video understanding assistant. The Gemma model family was updated with Gemma 2 at 27B parameters, offering near-llama-3-70b performance at half the size, plus PaliGemma, a vision-language open model inspired by PaLI-3. Other launches include DeepMind's Veo, Imagen 3 for photorealistic image generation, and a Music AI Sandbox collaboration with YouTube. SynthID watermarking now extends to text, images, audio, and video. The Trillium TPUv6 codename was revealed. Google also integrated AI across its product suite including Workspace, Email, Docs, Sheets, Photos, Search, and Lens. "The world awaits Apple's answer."
Not much happened today
command-r-35b goliath-120 miqu-120 llama-3-8b tensorrt-llm llama-cpp gpt2-chat gpt-4-turbo llama-3 deepmind-alphazero anthropic openai perplexity-ai amazon apple microsoft deepmind creative-writing context-windows benchmarking model-performance self-learning function-calling retrieval-augmented-generation ai-assistants on-device-ai ai-lobbying copyright-infringement code-reasoning image-generation
Anthropic released a team plan and iOS app about 4 months after OpenAI. The Command-R 35B model excels at creative writing, outperforming larger models like Goliath-120 and Miqu-120. The Llama-3 8B model now supports a 1 million token context window, improving long-context understanding with minimal training on a single 8xA800 GPU machine. TensorRT-LLM benchmarks show it is 30-70% faster than llama.cpp on consumer hardware. A benchmark suggests GPT2-Chat may have better reasoning than GPT-4-Turbo, though results are debated. Demos include a self-learning Llama-3 voice agent running locally on Jetson Orin and a Self-Learning Large Action Model (LAM). Amazon CodeWhisperer was renamed to Q Developer, expanding its generative AI assistant capabilities. Apple plans an AI-enabled Safari browser with an on-device LLM in iOS 18 and macOS 15. Big Tech dominates AI lobbying in Washington, while major U.S. newspapers sued OpenAI and Microsoft for copyright infringement. DeepMind's AlphaZero became the greatest chess player in 9 hours, and their Naturalized Execution Tuning (NExT) method improves LLM code reasoning by 14-26%. Stable Diffusion is used for diverse image generation applications.
LLMs-as-Juries
gpt-4 gpt-3.5 sdxl ponyxl openai cohere financial-times memory training-data model-usage-limits data-cleansing ai-voice-assistants interface-agents image-generation model-extensions multi-agent-systems
OpenAI has rolled out the memory feature to all ChatGPT Plus users and partnered with the Financial Times to license content for AI training. Discussions on OpenAI's profitability arise due to paid training data licensing and potential GPT-4 usage limit reductions. Users report issues with ChatGPT's data cleansing after the memory update. Tutorials and projects include building AI voice assistants and interface agents powered by LLMs. In Stable Diffusion, users seek realistic SDXL models comparable to PonyXL, and new extensions like Hi-diffusion and Virtuoso Nodes v1.1 enhance ComfyUI with advanced image generation and Photoshop-like features. Cohere finds that multiple agents outperform single agents in LLM judging tasks, highlighting advances in multi-agent systems.
Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM
snowflake-arctic phi-3 llama-3-70b llama-3 stable-diffusion-3 sd3-turbo gpt-3.5-turbo snowflake databricks deepseek deepspeed nvidia stable-diffusion adobe apple llamaindex lmsys openai mixture-of-experts curriculum-learning model-release image-generation video-upscaling quantization inference-speed benchmarking model-comparison open-source on-device-ai
Snowflake Arctic is a notable new foundation language model released under Apache 2.0, claiming superiority over Databricks in data warehouse AI applications and adopting a mixture-of-experts architecture inspired by DeepSeekMOE and DeepSpeedMOE. The model employs a 3-stage curriculum training strategy similar to the recent Phi-3 paper. In AI image and video generation, Nvidia introduced the Align Your Steps technique improving image quality at low step counts, while Stable Diffusion 3 and SD3 Turbo models were compared for prompt understanding and image quality. Adobe launched an AI video upscaling project enhancing blurry videos to HD, though with some high-resolution artifacts. Apple released open-source on-device language models with code and training logs, diverging from typical weight-only releases. The Llama-3-70b model ties for first place on the LMSYS leaderboard for English queries, and Phi-3 (4B params) outperforms GPT-3.5 Turbo in the banana logic benchmark. Fast inference and quantization of Llama 3 models were demonstrated on MacBook devices.
Llama-3-70b is GPT-4-level Open Model
llama-3-70b llama-3-8b llama-3 llama-2-70b mistral-7b grok-3 stable-diffusion-3 vasa-1 meta-ai-fair groq nvidia amazon microsoft benchmarking model-performance fine-tuning function-calling arithmetic image-generation video-generation energy-usage gpu-demand political-bias ai-safety scaling context-windows tokenization elon-musk
Meta has released Llama 3, their most capable open large language model with 8B and 70B parameter versions supporting 8K context length and outperforming previous models including Llama 2 and Mistral 7B. Groq serves the Llama 3 70B model at 500-800 tokens/second, making it the fastest GPT-4-level token source. Discussions highlight AI scaling challenges with Elon Musk stating that training Grok 3 will require 100,000 Nvidia H100 GPUs, and AWS planning to acquire 20,000 B200 GPUs for a 27 trillion parameter model. Microsoft unveiled VASA-1 for lifelike talking face generation, while Stable Diffusion 3 and its extensions received mixed impressions. Concerns about AI energy usage and political bias in AI were also discussed.
Mergestral, Meta MTIAv2, Cohere Rerank 3, Google Infini-Attention
mistral-8x22b command-r-plus rerank-3 infini-attention llama-3 sd-1.5 cosxl meta-ai-fair mistral-ai cohere google stability-ai hugging-face ollama model-merging training-accelerators retrieval-augmented-generation linear-attention long-context foundation-models image-generation rag-pipelines model-benchmarking context-length model-performance aidan_gomez ylecun swyx
Meta announced their new MTIAv2 chips designed for training and inference acceleration with improved architecture and integration with PyTorch 2.0. Mistral released the 8x22B Mixtral model, which was merged back into a dense model to effectively create a 22B Mistral model. Cohere launched Rerank 3, a foundation model enhancing enterprise search and retrieval-augmented generation (RAG) systems supporting 100+ languages. Google published a paper on Infini-attention, an ultra-scalable linear attention mechanism demonstrated on 1B and 8B models with 1 million sequence length. Additionally, Meta's Llama 3 is expected to start rolling out soon. Other notable updates include Command R+, an open model surpassing GPT-4 in chatbot performance with 128k context length, and advancements in Stable Diffusion models and RAG pipelines.
ReALM: Reference Resolution As Language Modeling
flan-t5 gpt-4 apple openai hugging-face stability-ai reference-resolution finetuning quantization retrieval-augmented-generation open-source coding-agents podcast-generation image-generation ai-industry-trends takuto-takizawa
Apple is advancing in AI with a new approach called ReALM: Reference Resolution As Language Modeling, which improves understanding of ambiguous references using three contexts and finetunes a smaller FLAN-T5 model that outperforms GPT-4 on this task. In Reddit AI news, an open-source coding agent SWE-agent achieves 12.29% on the SWE-bench benchmark, and RAGFlow introduces a customizable retrieval-augmented generation engine. A new quantization method, QuaRot, enables efficient 4-bit inference. AI applications include a t-shirt design generator, podgenai for GPT-4 based podcast generation, and an open-source model from HuggingFace that runs without a GPU. Industry discussions focus on the impact of large language models on the AI field and efforts to decentralize AI development. Takuto Takizawa joins Stability AI Japan as Head of Sales & Partnerships.
AdamW -> AaronD?
claude-3-opus llama-3 llama-3-300m bert-large stable-diffusion-1.5 wdxl openai hugging-face optimizer machine-learning-benchmarks vision time-series-forecasting image-generation prompt-injection policy-enforcement aaron-defazio
Aaron Defazio is gaining attention for proposing a potential tuning-free replacement of the long-standing Adam optimizer, showing promising experimental results across classic machine learning benchmarks like ImageNet ResNet-50 and CIFAR-10/100. On Reddit, Claude 3 Opus has surpassed all OpenAI models on the LMSys leaderboard, while a user pretrained a LLaMA-based 300M model outperforming bert-large on language modeling tasks with a modest budget. The new MambaMixer architecture demonstrates promising results in vision and time series forecasting. In image generation, Stable Diffusion 1.5 with LoRAs achieves realistic outputs, and the WDXL release showcases impressive capabilities. AI applications include an AI-generated Nike spec ad and a chatbot built with OpenAI models that may resist prompt injections. OpenAI is reportedly planning a ban wave targeting policy violators and jailbreak users. "The high alpha seems to come from Aaron Defazio," highlighting his impactful work in optimizer research.
Jamba: Mixture of Architectures dethrones Mixtral
jamba dbrx mixtral animatediff fastsd sdxs512-0.9 b-lora supir ai21-labs databricks together-ai hugging-face midjourney mixture-of-experts model-architecture context-windows model-optimization fine-tuning image-generation video-generation cpu-optimization style-content-separation high-resolution-upscaling
AI21 labs released Jamba, a 52B parameter MoE model with 256K context length and open weights under Apache 2.0 license, optimized for single A100 GPU performance. It features a unique blocks-and-layers architecture combining transformer and MoE layers, competing with models like Mixtral. Meanwhile, Databricks introduced DBRX, a 36B active parameter MoE model trained on 12T tokens, noted as a new standard for open LLMs. In image generation, advancements include Animatediff for video-quality image generation and FastSD CPU v1.0.0 beta 28 enabling ultra-fast image generation on CPUs. Other innovations involve style-content separation using B-LoRA and improvements in high-resolution image upscaling with SUPIR.
Andrew likes Agents
gpt-3.5 gpt-4 cyberrealistic_v40 platypus-xl sdxl-lightning openai stability-ai agents human-eval-benchmark fine-tuning local-llm-deployment inference-speed image-generation lora upscaling workflow-optimization andrew-ng lilian-weng emad
Andrew Ng's The Batch writeup on Agents highlighted the significant improvement in coding benchmark performance when using an iterative agent workflow, with GPT-3.5 wrapped in an agent loop achieving up to 95.1% correctness on HumanEval, surpassing GPT-4 zero-shot at 67.0%. The report also covers new developments in Stable Diffusion models like Cyberrealistic_v40, Platypus XL, and SDXL Lightning for Naruto-style image generation, alongside innovations in LoRA and upscaling techniques. Discussions on local LLM deployment and optimization focus on hardware setups and finetuning strategies for efficient inference and multi-user serving. Emad's departure from Stability AI and new Sora videos from OpenAI were also noted.
Not much happened today
claude-3 claude-3-opus claude-3-sonnet gpt-4 gemma-2b anthropic perplexity langchain llamaindex cohere accenture mistral-ai snowflake together-ai hugging-face european-space-agency google gpt4all multimodality instruction-following out-of-distribution-reasoning robustness enterprise-ai cloud-infrastructure open-datasets model-deployment model-discoverability generative-ai image-generation
Anthropic released Claude 3, replacing Claude 2.1 as the default on Perplexity AI, with Claude 3 Opus surpassing GPT-4 in capability. Debate continues on whether Claude 3's performance stems from emergent properties or pattern matching. LangChain and LlamaIndex added support for Claude 3 enabling multimodal and tool-augmented applications. Despite progress, current models still face challenges in out-of-distribution reasoning and robustness. Cohere partnered with Accenture for enterprise AI search, while Mistral AI and Snowflake collaborate to provide LLMs on Snowflake's platform. Together AI Research integrates Deepspeed innovations to accelerate generative AI infrastructure. Hugging Face and the European Space Agency released a large earth observation dataset, and Google open sourced Gemma 2B, optimized for smartphones via the MLC-LLM project. GPT4All improved model discoverability for open models. The AI community balances excitement over new models with concerns about limitations and robustness, alongside growing enterprise adoption and open-source contributions. Memes and humor continue to provide social commentary.
Stable Diffusion 3 — Rombach & Esser did it again!
stable-diffusion-3 claude-3 orca dolphincoder-starcoder2-15b stability-ai anthropic microsoft latitude perplexity-ai llamaindex tripo-ai diffusion-models multimodality benchmarking human-evaluation text-generation image-generation 3d-modeling fine-tuning roleplay coding dataset-release soumith-chintala bill-peebles swyx kevinafischer jeremyphoward akhaliq karinanguyen_ aravsrinivas
Over 2500 new community members joined following Soumith Chintala's shoutout, highlighting growing interest in SOTA LLM-based summarization. The major highlight is the detailed paper release of Stable Diffusion 3 (SD3), showcasing advanced text-in-image control and complex prompt handling, with the model outperforming other SOTA image generation models in human-evaluated benchmarks. The SD3 model is based on an enhanced Diffusion Transformer architecture called MMDiT. Meanwhile, Anthropic released Claude 3 models, noted for human-like responses and emotional depth, scoring 79.88% on HumanEval but costing over twice as much as GPT-4. Microsoft launched new Orca-based models and datasets, and Latitude released DolphinCoder-StarCoder2-15b with strong coding capabilities. Integration of image models by Perplexity AI and 3D CAD generation by PolySpectra powered by LlamaIndex were also highlighted. "SD3's win rate beats all other SOTA image gen models (except perhaps Ideogram)" and "Claude 3 models are very good at generating d3 visualizations from text descriptions."
Google AI: Win some (Gemma, 1.5 Pro), Lose some (Image gen)
gemma-2b gemma-7b gemma gemini-pro-1.5 llama-2 llama-3 mistral google hugging-face nvidia benchmarking license-policies image-generation video-understanding long-context dataset-editing model-integration gpu-hardware bug-fixes quantization
Google's Gemma open models (2-7B parameters) outperform Llama 2 and Mistral in benchmarks but face criticism for an unusual license and poor image generation quality, which Google partially acknowledges. The upcoming Gemini Pro 1.5 model features a 1 million token context window, excelling in video understanding and needle-in-haystack tasks. Discord communities like TheBloke and LM Studio discuss mixed reception of Gemma models, anticipation for Llama 3 release, challenges in dataset editing, and hardware considerations such as NVIDIA GeForce RTX 3090 and RTX 4090 GPUs. LM Studio users report issues with version 0.2.15 Beta and ongoing integration of Gemma models, with resources shared on Hugging Face.
The Dissection of Smaug (72B)
smaug-72b qwen-1.0 qwen-1.5 gpt-4 mistral-7b miqumaid wizardlm_evol_instruct_v2_196k openhermes-2.5 abacus-ai hugging-face nous-research laion thebloke lm-studio intel nvidia elevenlabs fine-tuning model-merging quantization web-ui model-conversion hardware-setup privacy image-generation optical-character-recognition prompt-engineering bindureddy
Abacus AI launched Smaug 72B, a large finetune of Qwen 1.0, which remains unchallenged on the Hugging Face Open LLM Leaderboard despite skepticism from Nous Research. LAION introduced a local voice assistant model named Bud-E with a notable demo. The TheBloke Discord community discussed model performance trade-offs between large models like GPT-4 and smaller quantized models, fine-tuning techniques using datasets like WizardLM_evol_instruct_V2_196k and OpenHermes-2.5, and challenges in web UI development and model merging involving Mistral-7b and MiquMaid. The LM Studio Discord highlighted issues with model conversion from PyTorch to gguf, hardware setups involving Intel Xeon CPUs and Nvidia P40 GPUs, privacy concerns, and limitations in image generation and web UI availability.
RIP Latent Diffusion, Hello Hourglass Diffusion
gpt-4 latent-diffusion stable-diffusion meta-ai-fair openai hugging-face diffusion-models transformers image-generation model-efficiency fine-tuning quantization prompt-engineering roleplay training-optimization katherine-crowson lucidrains
Katherine Crowson from Stable Diffusion introduces a hierarchical pure transformer backbone for diffusion-based image generation that efficiently scales to megapixel resolutions with under 600 million parameters, improving upon the original ~900M parameter model. This architecture processes local and global image phenomena separately, enhancing efficiency and resolution without latent steps. Additionally, Meta's Self Rewarding LM paper has inspired lucidrains to begin an implementation. Discord summaries highlight GPT-4's robustness against quantification tricks, discussions on open-source GPT-0 alternatives, challenges in DPO training on limited VRAM with suggestions like QLoRA and rmsprop, and efforts to improve roleplay model consistency through fine-tuning and merging. Philosophical debates on AI sentience and GPT-4 customization for markdown and translation tasks were also noted.
1/10/2024: All the best papers for AI Engineers
chatgpt gpt-4 dall-e-3 stable-diffusion deepseek-moe openai deepseek-ai prompt-engineering model-release rate-limiting ethics image-generation moe collaborative-workspaces data-privacy abdubs darthgustav
OpenAI launched the GPT Store featuring over 3 million custom versions of ChatGPT accessible to Plus, Team, and Enterprise users, with weekly highlights of impactful GPTs like AllTrails. The new ChatGPT Team plan offers advanced models including GPT-4 and DALL·E 3, alongside collaborative tools and enhanced data privacy. Discussions around AI-generated imagery favored DALL·E and Stable Diffusion, while users faced rate limit challenges and debated the GPT Store's SEO and categorization. Ethical considerations in prompt engineering were raised with a three-layer framework called 'The Sieve'. Additionally, DeepSeek-MoE was noted for its range of Mixture of Experts (MoE) model sizes. "The Sieve," a three-layer ethical framework for AI, was highlighted in prompt engineering discussions.
1/6-7/2024: LlaMA Pro - an alternative to PEFT/RAG??
llama-3 llama-3-1-1b llama-3-8-3b gpt-4 gpt-3.5 dall-e openai mistral-ai llamaindex langchain fine-tuning model-expansion token-limits privacy multilinguality image-generation security custom-models model-training yannic-kilcher
New research papers introduce promising Llama Extensions including TinyLlama, a compact 1.1B parameter model pretrained on about 1 trillion tokens for 3 epochs, and LLaMA Pro, an 8.3B parameter model expanding LLaMA2-7B with additional training on 80 billion tokens of code and math data. LLaMA Pro adds layers to avoid catastrophic forgetting and balances language and code tasks but faces scrutiny for not using newer models like Mistral or Qwen. Meanwhile, OpenAI Discord discussions reveal insights on GPT-4 token limits, privacy reassurances, fine-tuning for GPT-3.5, challenges with multi-language image recognition, custom GPT creation requiring ChatGPT Plus, and security concerns in GPT deployment. Users also share tips on dynamic image generation with DALL-E and logo creation.
1/2/2024: Smol tweaks to Smol Talk
claude-2 bard copilot meta-ai gemini-ultra chatgpt openai meta-ai-fair perplexity-ai prompt-engineering api json yaml markdown chatbot image-generation vpn browser-compatibility personality-tuning plugin-issues
OpenAI Discord discussions highlight a detailed comparison of AI search engines including Perplexity, Copilot, Bard, and Claude 2, with Bard and Claude 2 trailing behind. Meta AI chatbot by Meta is introduced, available on Instagram and Whatsapp, featuring image generation likened to a free GPT version. Users report multiple browser issues with ChatGPT, including persistent captchas when using VPNs and plugin malfunctions. Debates cover prompt engineering, API usage, and data formats like JSON, YAML, and Markdown. Discussions also touch on ChatGPT's personality tuning and model capability variations. "Meta AI includes an image generation feature, which he likened to a free version of GPT."
12/7/2023: Anthropic says "skill issue"
claude-2.1 gpt-4 gpt-3.5 gemini-pro gemini-ultra gpt-4.5 chatgpt bingchat dall-e gpt-5 anthropic openai google prompt-engineering model-performance regulation language-model-performance image-generation audio-processing midi-sequence-analysis subscription-issues network-errors
Anthropic fixed a glitch in their Claude 2.1 model's needle in a haystack test by adding a prompt. Discussions on OpenAI's Discord compared Google's Gemini Pro and Gemini Ultra models with OpenAI's GPT-4 and GPT-3.5, with some users finding GPT-4 superior in benchmarks. Rumors about a GPT-4.5 release circulated without official confirmation. Concerns were raised about "selective censorship" affecting language model performance. The EU's potential regulation of AI, including ChatGPT, was highlighted. Users reported issues with ChatGPT Plus message limits and subscription upgrades, and shared experiences with BingChat and DALL-E. The community discussed prompt engineering techniques and future applications like image generation and MIDI sequence analysis, expressing hopes for GPT-5.