All tags
Model: "gpt-4"
not much happened today
gpt-4.5 gpt-4 gpt-4o o1 claude-3.5-sonnet claude-3.7 claude-3-opus deepseek-v3 grok-3 openai anthropic perplexity-ai deepseek scaling01 model-performance humor emotional-intelligence model-comparison pricing context-windows model-size user-experience andrej-karpathy jeremyphoward abacaj stevenheidel yuchenj_uw aravsrinivas dylan522p random_walker
GPT-4.5 sparked mixed reactions on Twitter, with @karpathy noting users preferred GPT-4 in a poll despite his personal favor for GPT-4.5's creativity and humor. Critics like @abacaj highlighted GPT-4.5's slowness and questioned its practical value and pricing compared to other models. Performance-wise, GPT-4.5 ranks above GPT-4o but below o1 and Claude 3.5 Sonnet, with Claude 3.7 outperforming it on many tasks yet GPT-4.5 praised for its humor and "vibes." Speculation about GPT-4.5's size suggests around 5 trillion parameters. Discussions also touched on pricing disparities, with Perplexity Deep Research at $20/month versus ChatGPT at $200/month. The emotional intelligence and humor of models like Claude 3.7 were also noted.
not much happened today
deepseek-r1 qwen-2.5 qwen-2.5-max deepseek-v3 deepseek-janus-pro gpt-4 nvidia anthropic openai deepseek huawei vercel bespoke-labs model-merging multimodality reinforcement-learning chain-of-thought gpu-optimization compute-infrastructure compression crypto-api image-generation saranormous zizhpan victormustar omarsar0 markchen90 sakanaailabs reach_vb madiator dain_mclau francoisfleuret garygodchaux arankomatsuzaki id_aa_carmack lavanyasant virattt
Huawei chips are highlighted in a diverse AI news roundup covering NVIDIA's stock rebound, new open music foundation models like Local Suno, and competitive AI models such as Qwen 2.5 Max and Deepseek V3. The release of DeepSeek Janus Pro, a multimodal LLM with image generation capabilities, and advancements in reinforcement learning and chain-of-thought reasoning are noted. Discussions include GPU rebranding with NVIDIA's H6400 GPUs, data center innovations, and enterprise AI applications like crypto APIs in hedge funds. "Deepseek R1's capabilities" and "Qwen 2.5 models added to applications" are key highlights.
not much happened today
o1-full sora gpt-4.5 gpt-4 claude-3.5-sonnet llama-3-1-nemotron-51b llama-3-1 llama-3 nemotron-51b openai google-deepmind anthropic nvidia huggingface vision model-performance neural-architecture-search model-optimization multimodality model-release model-training reinforcement-learning image-generation lucas-beyer alexander-kolesnikov xiaohua-zhai aidan_mclau giffmana joannejang sama
OpenAI announced their "12 Days of OpenAI" event with daily livestreams and potential releases including the O1 full model, Sora video model, and GPT-4.5. Google DeepMind released the GenCast weather model capable of 15-day forecasts in 8 minutes using TPU chips, and launched Genie 2, a model generating playable 3D worlds from single images. Leading vision researchers Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai moved from DeepMind to OpenAI, which is opening a Zürich office. Criticism arose over OpenAI's strategy and model quality compared to Anthropic and Claude 3.5 Sonnet. On Reddit, a modified llama.cpp supports Nvidia's Llama-3_1-Nemotron-51B, matching performance of larger 70B models via NAS optimization.
not much happened to end the week
gemini deepseek-r1 o1 chatgpt gpt-4 claude-3.5-sonnet o1-preview o1-mini gpt4o qwq-32b google-deepmind deeplearningai amazon tesla x-ai alibaba ollama multimodality benchmarking quantization reinforcement-learning ai-safety translation reasoning interpretability model-comparison humor yoshua-bengio kevinweil ylecun
AI News for 11/29/2024-11/30/2024 covers key updates including the Gemini multimodal model advancing in musical structure understanding, a new quantized SWE-Bench for benchmarking at 1.3 bits per task, and the launch of the DeepSeek-R1 model focusing on transparent reasoning as an alternative to o1. The establishment of the 1st International Network of AI Safety Institutes highlights global collaboration on AI safety. Industry updates feature Amazon's Olympus AI model, Tesla's Optimus, and experiments with ChatGPT as a universal translator. Community reflections emphasize the impact of large language models on daily life and medical AI applications. Discussions include scaling sparse autoencoders to gpt-4 and the need for transparency in reasoning LLMs. The report also notes humor around ChatGPT's French nickname.
Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo
claude-3-sonnet gpt-4 gemini-1.5 claude-3.5-sonnet anthropic openai langchain meta-ai-fair benchmarking prompt-engineering rag visuotactile-perception ai-governance theoretical-alignment ethical-alignment jailbreak-robustness model-releases alignment richardmcngo andrewyng philschmid
Anthropic released the 3.5 Sonnet benchmark for jailbreak robustness, emphasizing adaptive defenses. OpenAI enhanced GPT-4 with a new RAG technique for contiguous chunk retrieval. LangChain launched Promptim for prompt optimization. Meta AI introduced NeuralFeels with neural fields for visuotactile perception. RichardMCNgo resigned from OpenAI, highlighting concerns on AI governance and theoretical alignment. Discussions emphasized the importance of truthful public information and ethical alignment in AI deployment. The latest Gemini update marks a new #1 LLM amid alignment challenges. The AI community continues to focus on benchmarking, prompt-engineering, and alignment issues.
s{imple|table|calable} Consistency Models
llama-3-70b llama-3-405b llama-3-1 stable-diffusion-3.5 gpt-4 stability-ai tesla cerebras cohere langchain model-distillation diffusion-models continuous-time-consistency-models image-generation ai-hardware inference-speed multilingual-models yang-song
Model distillation significantly accelerates diffusion models, enabling near real-time image generation with only 1-4 sampling steps, as seen in BlinkShot and Flux Schnell. Research led by Yang Song introduced simplified continuous-time consistency models (sCMs), achieving under 10% FID difference in just 2 steps and scaling up to 1.5B parameters for higher quality. On AI hardware, Tesla is deploying a 50k H100 cluster potentially capable of completing GPT-4 training in under three weeks, while Cerebras Systems set a new inference speed record on Llama 3.1 70B with their wafer-scale AI chips. Stability AI released Stable Diffusion 3.5 and its Turbo variant, and Cohere launched new multilingual models supporting 23 languages with state-of-the-art performance. LangChain also announced ecosystem updates.
a calm before the storm
o1 o1-mini qwen2.5 gpt-4 llama-2-70b llama-7b anthropic openai alibaba microsoft blackrock groq aramco disney eth-zurich pudu-robotics slack long-context kv-cache-quantization diffusion-models reinforcement-learning robotics ai-integration multilinguality model-benchmarking model-performance model-optimization adcock_brett philschmid rohanpaul_ai jvnixon kateclarktweets sama
Anthropic is raising funds at a valuation up to $40 billion ahead of anticipated major releases. OpenAI launched new reasoning models o1 and o1-mini, with increased rate limits and a multilingual MMLU benchmark. Alibaba released the open-source Qwen2.5 model supporting 29+ languages, showing competitive performance to gpt-4 at lower cost. Microsoft and Blackrock plan to invest $30 billion in AI data centers, with Groq partnering with Aramco to build the world's largest AI inference center. Robotics advances include Disney Research and ETH Zurich's diffusion-based motion generation for robots and Pudu Robotics' semi-humanoid robot. Slack and Microsoft introduced AI-powered agents integrated into their platforms. Research highlights include long-context scaling for llama-2-70b using Dual Chunk Attention and KV cache quantization enabling 1 million token context on llama-7b models.
not much happened today
llama-3 o1 deepseek-2.5 gpt-4 claude-3.5-sonnet 3dtopia-xl cogvideox anthropic meta-ai-fair openai deepseek-ai llamaindex langchainai retrieval-augmented-generation prompt-caching multimodality multi-agent-systems reasoning diffusion-models image-to-video prompting enterprise-ai agentic-ai long-context model-evaluation caching model-cost-efficiency
Anthropic introduced a RAG technique called Contextual Retrieval that reduces retrieval failure rates by 67% using prompt caching. Meta is teasing multimodal Llama 3 ahead of Meta Connect. OpenAI is hiring for a multi-agent research team focusing on improved AI reasoning with their o1 models, which have sparked mixed reactions. DeepSeek 2.5 is noted as a cost-effective alternative to GPT-4 and Claude 3.5 sonnet. New models like 3DTopia-XL for 3D asset generation and CogVideoX for image-to-video conversion were highlighted. Techniques to boost reasoning by re-reading questions and combining retrieval with prompt caching were shared. Industry insights emphasize the necessity of AI adoption in enterprises and the disruption of traditional ML businesses. Tools like LangChainAI's LangGraph Templates and LlamaIndex's LlamaParse Premium enhance agentic applications and multimodal content extraction. Discussions on LLM evals and caching highlight production challenges and improvements. "Companies not allowing developers to use AI are unlikely to succeed" was a key sentiment.
AIPhone 16: the Visual Intelligence Phone
reflection-70b llama-3-70b qwen-2-72b llama-3-1-405b claude gpt-4 gemini apple openai weights-biases vision video-understanding benchmarking planning model-evaluation privacy ai-integration instruction-following yann-lecun
Apple announced the new iPhone 16 lineup featuring Visual Intelligence, a new AI capability integrated with Camera Control, Apple Maps, and Siri, emphasizing privacy and default service use over third-party AI like OpenAI. Apple Photos now includes advanced video understanding with timestamp recognition. Meanwhile, Reflection-70B claims to be a top open-source model but benchmarks show it performs close to Llama 3 70B and slightly worse than Qwen 2 72B. Yann LeCun highlighted ongoing challenges with LLM planning abilities, noting models like Llama-3.1-405b and Claude show some skill, while GPT-4 and Gemini lag behind. Weights & Biases is sponsoring an event to advance LLM evaluation techniques with prizes and API access.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.
not much happened today
grok-2 claude-3.5-sonnet claude-3.5 gpt-4 chatgpt-4o-latest anthropic x-ai google-deepmind openai mistral-ai meta-ai-fair salesforce box prompt-caching model-performance vision fine-tuning multilinguality ai-safety design-automation document-processing ai-agents ai-integration ai-job-market ai-acceleration humor demis-hassabis francois-chollet
Anthropic rolled out prompt caching in its API, reducing input costs by up to 90% and latency by 80%, enabling instant fine-tuning with longer prompts. xAI released Grok-2, a new model competing with frontier models from Google DeepMind, OpenAI, Anthropic, Mistral AI, and Meta AI Fair, supporting vision and text inputs and integrating external image generation models. Claude 3.5 Sonnet is reported to outperform GPT-4 in coding and reasoning, while ChatGPT-4o-latest shows reasoning improvements. François Chollet proposed a theory defining intelligence as the efficiency of operationalizing past information for future tasks. The Aya project involves 3000 collaborators building multilingual AI datasets. Demis Hassabis discussed AI hype and safe AI development in a podcast. Tools like Dora AI for Figma and Box's AI API enhance design automation and document processing. Salesforce released DEI, an open AI software engineering agents framework with a 55% resolve rate on SWE-Bench Lite. Industry trends highlight rapid AI integration, networking importance in the AI job market, and potential OpenAI GPT-4 expansion in response to competitors. Memes include humor about Apple Vision Pro.
a quiet weekend
sam-2 qwen2-math gpt-4 claude-3.5 figure deepmind boston-dynamics alibaba llamaindex robotics object-segmentation real-time-processing disease-prediction speech-recognition cli-tools model-performance adcock_brett rasbt hamel-husain rohanpaul_ai
Figure unveiled Figure 02, claimed as the most advanced humanoid robot, operating autonomously at BMW's Plant Spartanburg. DeepMind developed a table tennis robot achieving 100% wins against beginners and 55% against intermediates. Boston Dynamics showcased the dexterity of its fully-electric Atlas robot performing pushups and burpees. An autonomous dental robot performed the world's first dental procedure on a human, reducing a 2-hour process to 15 minutes using a 3D volumetric scanner. SAM 2 was introduced as an open model for real-time object segmentation without custom adaptation. Alibaba released Qwen2-Math, outperforming GPT-4 and Claude 3.5 in math capabilities. A new Listening-While-Speaking Language Model (LSLM) enables simultaneous listening and speaking in real-time. Researchers developed a disease prediction AI with 95% accuracy for diseases like coronary artery disease, type 2 diabetes, and breast cancer. Tools like LlamaParse CLI and MLX Whisper package enhance PDF parsing and speech recognition, with the latter running 40X faster than realtime on M1 Max. The news highlights significant advancements in robotics, AI models, and practical AI tools.
SciCode: HumanEval gets a STEM PhD upgrade
gpt-4 claude-3.5-sonnet llama-3-7b llama-3 dolphin-2.9.3-yi-1.5-34b-32k-gguf anthropic hugging-face nvidia benchmarks coding model-training gpu-optimization model-performance synthetic-data compiler-optimization zero-shot-learning yi-tay rohanpaul_ai alexalbert__ tri_dao abacaj
PhD-level benchmarks highlight the difficulty of coding scientific problems for LLMs, with GPT-4 and Claude 3.5 Sonnet scoring under 5% on the new SciCode benchmark. Anthropic doubled the max output token limit for Claude 3.5 Sonnet to 8192 tokens. The Q-GaLore method enables training LLaMA-7B on a single 16GB GPU. The Mosaic compiler now generates efficient code for NVIDIA H100 GPUs. The Dolphin 2.9.3-Yi-1.5-34B-32k-GGUF model on Hugging Face has over 111k downloads. Llama 3 shows strong performance, achieving 90% zero-shot accuracy on the MATH dataset. Discussions continue on the limitations and forms of synthetic data for model training.
RouteLLM: RIP Martian? (Plus: AINews Structured Summaries update)
gpt-4 gemma-2-27b gemma-2-9b lmsys openai llm-routing cost-efficiency model-performance model-optimization data-augmentation syntax-based-routing mixture-of-experts inference-throughput software-2.0 computer-vision karpathy bindureddy armand-joulin
LMSys introduces RouteLLM, an open-source router framework trained on preference data from Chatbot Arena, achieving cost reductions over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while maintaining 95% of GPT-4's performance. This approach surpasses previous task-specific routing by using syntax-based Mixture of Experts (MoE) routing and data augmentation, beating commercial solutions by 40%. The update highlights advances in LLM routing, cost-efficiency, and model performance optimization across multiple models rather than single-model or MoE-level improvements. Additionally, the AI Twitter recap notes the Gemma 2 model family as a top open model, the Block Transformer architecture for improved inference throughput, and a proposal for a fully Software 2.0 computer vision system by karpathy.
Mozilla's AI Second Act
llama-3 claude-3-opus gemini-1.5 deepseek-coder-v2 gpt-4 mozilla llamaindex anthropic etched-ai sohu deepseek openai vector-search inference-speed hardware-benchmarks context-windows open-source-models coding reasoning model-benchmarking gpu-inference agentic-ai justine-tunney stephen-hood tim-dettmers bindureddy
Mozilla showcased detailed live demos of llamafile and announced sqlite-vec for vector search integration at the AIE World's Fair. LlamaIndex launched llama-agents. Anthropic introduced new UI features and Projects for Claude with a 200K context window. Etched AI revealed a specialized inference chip claiming 500k tokens/sec, though benchmark claims are questioned. Sohu chip enables 15 agent trajectories/sec. Tim Dettmers shared theoretical GPU inference limits of ~300k tokens/sec for 8xB200 NVLink on 70B Llama. Deepseek Coder v2 outperforms Gemini and GPT-4 variants in coding and reasoning. The PyTorch documentary launched to little attention.
Hybrid SSM/Transformers > Pure SSMs/Pure Transformers
mamba-2-hybrid gpt-4 qwen-72b table-llava-7b nvidia lamini-ai sakana-ai luma-labs mixture-of-experts benchmarking fine-tuning multimodality text-to-video model-performance memory-optimization preference-optimization video-understanding multimodal-tables bryan-catanzaro bindureddy ylecun ctnzr corbtt realsharonzhou andrew-n-carr karpathy _akhaliq omarsar0
NVIDIA's Bryan Catanzaro highlights a new paper on Mamba models, showing that mixing Mamba and Transformer blocks outperforms either alone, with optimal attention below 20%. Mixture-of-Agents (MoA) architecture improves LLM generation quality, scoring 65.1% on AlpacaEval 2.0 versus GPT-4 Omni's 57.5%. The LiveBench AI benchmark evaluates reasoning, coding, writing, and data analysis. A hybrid Mamba-2-Hybrid model with 7% attention surpasses a Transformer on MMLU accuracy, jumping from 50% to 53.6%. GPT-4 performs better at temperature=1. Qwen 72B leads open-source models on LiveBench AI. LaminiAI Memory Tuning achieves 95% accuracy on a SQL agent task, improving over instruction fine-tuning. Sakana AI Lab uses evolutionary strategies for preference optimization. Luma Labs Dream Machine demonstrates advanced text-to-video generation. The MMWorld benchmark evaluates multimodal video understanding, and Table-LLaVa 7B competes with GPT-4V on multimodal table tasks.
The Last Hurrah of Stable Diffusion?
llama-3-8b llama-3 qwen-2 gpt-4 gpt-4o stability-ai togethercompute model-architecture fine-tuning benchmarks dataset-release model-evaluation reasoning model-training retrieval-augmented-generation multimodality emad-mostaque rohanpaul_ai fchollet mikeknoop micahgoldblum teknium1 rasbt percyliang
Stability AI launched Stable Diffusion 3 Medium with models ranging from 450M to 8B parameters, featuring the MMDiT architecture and T5 text encoder for image text rendering. The community has shown mixed reactions following the departure of key researchers like Emad Mostaque. On AI models, Llama 3 8B Instruct shows strong evaluation correlation with GPT-4, while Qwen 2 Instruct surpasses Llama 3 on MMLU benchmarks. The Mixture of Agents (MoA) framework outperforms GPT-4o on AlpacaEval 2.0. Techniques like Spectrum and QLoRA enable efficient fine-tuning with less VRAM. Research on grokking reveals transformers can transition from memorization to generalization through extended training. Benchmark initiatives include the $1M ARC Prize Challenge for AGI progress and LiveBench, a live LLM benchmark to prevent dataset contamination. The Character Codex Dataset offers open data on over 15,000 characters for RAG and synthetic data. The MLX 0.2 tool enhances LLM experience on Apple Silicon Macs with improved UI and faster retrieval-augmented generation.
Francois Chollet launches $1m ARC Prize
gpt-4 chatgpt openai apple togethercompute benchmarking agi pattern-recognition skill-acquisition privacy on-device-ai mixed-precision-quantization mixture-of-experts multimodality agentic-ai francois-chollet karpathy svpino philschmid clementdelangue sama gdb miramurati kevin-weil sarah-friar
François Chollet critiques current paths to AGI, emphasizing the importance of benchmarks that resist saturation and focus on skill acquisition and open-ended problem solving. The ARC-AGI puzzles exemplify "easy for humans, hard for AI" challenges to measure progress toward AGI. Meanwhile, Apple announces integration of ChatGPT into iOS, iPadOS, and macOS through a partnership with OpenAI, enabling AI-powered features like document summarization and photo analysis with privacy-preserving measures. Discussions highlight Apple's focus on deep AI integration and on-device models optimized with techniques like mixed-precision quantization, though some skepticism remains about their AI capabilities compared to GPT-4. Additionally, Together Compute introduces a Mixture of Agents approach achieving strong performance on AlpacaEval 2.0.
HippoRAG: First, do know(ledge) Graph
qwen-2 gpt-4 hipporag alibaba openai knowledge-graphs personalized-pagerank multi-hop-retrieval chain-of-thought implicit-reasoning sparse-autoencoders model-interpretability model-efficiency model-architecture fine-tuning reinforcement-learning rohanpaul_ai omarsar0 nabla_theta huybery
Alibaba released new open-source Qwen2 models ranging from 0.5B to 72B parameters, achieving SOTA results on benchmarks like MMLU and HumanEval. Researchers introduced Sparse Autoencoders to interpret GPT-4 neural activity, improving feature representation. The HippoRAG paper proposes a hippocampus-inspired retrieval augmentation method using knowledge graphs and Personalized PageRank for efficient multi-hop reasoning. New techniques like Stepwise Internalization enable implicit chain-of-thought reasoning in LLMs, enhancing accuracy and speed. The Buffer of Thoughts (BoT) method improves reasoning efficiency with significant cost reduction. A novel scalable MatMul-free LLM architecture competitive with SOTA Transformers at billion-parameter scale was also presented. "Single-Step, Multi-Hop retrieval" is highlighted as a key advancement in retrieval speed and cost.
Qwen 2 beats Llama 3 (and we don't know how)
qwen-2 llama-3 llama-3-70b gpt-4 nllb alibaba groq meta-ai-fair multilinguality benchmarking inference-speed sparse-autoencoders scaling-laws post-training instruction-following rejection-sampling execution-feedback model-release multilingual-models model-training philschmid huybery jonathanross321 awnihannun gdb nabla_theta ylecun
Alibaba released Qwen 2 models under Apache 2.0 license, claiming to outperform Llama 3 in open models with multilingual support in 29 languages and strong benchmark scores like MMLU 82.3 and HumanEval 86.0. Groq demonstrated ultra-fast inference speed on Llama-3 70B at 40,792 tokens/s and running 4 Wikipedia articles in 200ms. Research on sparse autoencoders (SAEs) for interpreting GPT-4 neural activity showed new training methods, metrics, and scaling laws. Meta AI announced the No Language Left Behind (NLLB) model capable of high-quality translations between 200 languages, including low-resource ones. "Our post-training phase is designed with the principle of scalable training with minimal human annotation," highlighting techniques like rejection sampling for math and execution feedback for coding.
Life after DPO (RewardBench)
gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan
xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.
Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing
gpt-4 gpt-4o gpt-4-turbo gpt-4o-mini llama bloom stable-diffusion cursor openai anthropic google-deepmind huggingface speculative-decoding code-edits multimodality image-generation streaming tool-use fine-tuning benchmarking mmlu model-performance evaluation synthetic-data context-windows sama abacaj imjaredz erhartford alexalbert svpino maximelabonne _philschmid
Cursor, an AI-native IDE, announced a speculative edits algorithm for code editing that surpasses GPT-4 and GPT-4o in accuracy and latency, achieving speeds of over 1000 tokens/s on a 70b model. OpenAI released GPT-4o with multimodal capabilities including audio, vision, and text, noted to be 2x faster and 50% cheaper than GPT-4 turbo, though with mixed coding performance. Anthropic introduced streaming, forced tool use, and vision features for developers. Google DeepMind unveiled Imagen Video and Gemini 1.5 Flash, a small model with a 1M-context window. HuggingFace is distributing $10M in free GPUs for open-source AI models like Llama, BLOOM, and Stable Diffusion. Evaluation insights highlight challenges with LLMs on novel problems and benchmark saturation, with new benchmarks like MMLU-Pro showing significant drops in top model performance.
OpenAI's PR Campaign?
alphafold-3 xlstm gpt-4 openai microsoft google-deepmind memory-management model-spec scaling multimodality performance transformers dynamic-memory model-architecture demis-hassabis sama joanne-jang omarsar0 arankomatsuzaki drjimfan
OpenAI faces user data deletion backlash over its new partnership with StackOverflow amid GDPR complaints and US newspaper lawsuits, while addressing election year concerns with efforts like the Media Manager tool for content opt-in/out by 2025 and source link attribution. Microsoft develops a top-secret airgapped GPT-4 AI service for US intelligence agencies. OpenAI releases the Model Spec outlining responsible AI content generation policies, including NSFW content handling and profanity use, emphasizing clear distinctions between bugs and design decisions. Google DeepMind announces AlphaFold 3, a state-of-the-art model predicting molecular structures with high accuracy, showcasing cross-domain AI techniques. New research on xLSTM proposes scaling LSTMs to billions of parameters, competing with transformers in performance and scaling. Microsoft introduces vAttention, a dynamic memory management method for efficient large language model serving without PagedAttention.
Kolmogorov-Arnold Networks: MLP killers or just spicy MLPs?
gpt-5 gpt-4 dall-e-3 openai microsoft learnable-activations mlp function-approximation interpretability inductive-bias-injection b-splines model-rearrangement parameter-efficiency ai-generated-image-detection metadata-standards large-model-training max-tegmark ziming-liu bindureddy nptacek zacharynado rohanpaul_ai svpino
Ziming Liu, a grad student of Max Tegmark, published a paper on Kolmogorov-Arnold Networks (KANs), claiming they outperform MLPs in interpretability, inductive bias injection, function approximation accuracy, and scaling, despite being 10x slower to train but 100x more parameter efficient. KANs use learnable activation functions modeled by B-splines on edges rather than fixed activations on nodes. However, it was later shown that KANs can be mathematically rearranged back into MLPs with similar parameter counts, sparking debate on their interpretability and novelty. Meanwhile, on AI Twitter, there is speculation about a potential GPT-5 release with mixed impressions, OpenAI's adoption of the C2PA metadata standard for detecting AI-generated images with high accuracy for DALL-E 3, and Microsoft training a large 500B parameter model called MAI-1, potentially previewed at Build conference, signaling increased competition with OpenAI. "OpenAI's safety testing for GPT-4.5 couldn't finish in time for Google I/O launch" was also noted.
DeepSeek-V2 beats Mixtral 8x22B with >160 experts at HALF the cost
deepseek-v2 llama-3-120b llama-3-400b gpt-4 mistral phi claude gemini mai-1 med-gemini deepseek-ai mistral-ai microsoft openai scale-ai tesla nvidia google-deepmind mixture-of-experts multi-head-attention model-inference benchmarking overfitting robotics teleoperation open-source multimodality hallucination-detection fine-tuning medical-ai model-training erhartford maximelabonne bindureddy adcock_brett drjimfan clementdelangue omarsar0 rohanpaul_ai
DeepSeek V2 introduces a new state-of-the-art MoE model with 236B parameters and a novel Multi-Head Latent Attention mechanism, achieving faster inference and surpassing GPT-4 on AlignBench. Llama 3 120B shows strong creative writing skills, while Microsoft is reportedly developing a 500B parameter LLM called MAI-1. Research from Scale AI highlights overfitting issues in models like Mistral and Phi, whereas GPT-4, Claude, Gemini, and Llama maintain benchmark robustness. In robotics, Tesla Optimus advances with superior data collection and teleoperation, LeRobot marks a move toward open-source robotics AI, and Nvidia's DrEureka automates robot skill training. Multimodal LLM hallucinations are surveyed with new mitigation strategies, and Google's Med-Gemini achieves SOTA on medical benchmarks with fine-tuned multimodal models.
$100k to predict LMSYS human preferences in a Kaggle contest
llama-3-70b llama-3 gpt-4 claude-3-opus prometheus-2 groq openai lmsys scale-ai ai2 nvidia benchmarking datasets fine-tuning reinforcement-learning model-alignment hallucination parameter-efficient-fine-tuning scalable-training factuality chatbot-performance bindureddy drjimfan percyliang seungonekim mobicham clefourrier
Llama 3 models are making breakthroughs with Groq's 70B model achieving record low costs per million tokens. A new Kaggle competition offers a $100,000 prize to develop models predicting human preferences from a dataset of over 55,000 user-LLM conversations. Open source evaluator LLMs like Prometheus 2 outperform proprietary models such as GPT-4 and Claude 3 Opus in judgment tasks. New datasets like WildChat1M provide over 1 million ChatGPT interaction logs with diverse and toxic examples. Techniques like LoRA fine-tuning show significant performance gains, and NVIDIA's NeMo-Aligner toolkit enables scalable LLM alignment across hundreds of GPUs. Factuality-aware alignment methods are proposed to reduce hallucinations in LLM outputs.
Evals: The Next Generation
gpt-4 gpt-5 gpt-3.5 phi-3 mistral-7b llama-3 scale-ai mistral-ai reka-ai openai moderna sanctuary-ai microsoft mit meta-ai-fair benchmarking data-contamination multimodality fine-tuning ai-regulation ai-safety ai-weapons neural-networks model-architecture model-training model-performance robotics activation-functions long-context sam-altman jim-fan
Scale AI highlighted issues with data contamination in benchmarks like MMLU and GSM8K, proposing a new benchmark where Mistral overfits and Phi-3 performs well. Reka released the VibeEval benchmark for multimodal models addressing multiple choice benchmark limitations. Sam Altman of OpenAI discussed GPT-4 as "dumb" and hinted at GPT-5 with AI agents as a major breakthrough. Researchers jailbroke GPT-3.5 via fine-tuning. Global calls emerged to ban AI-powered weapons, with US officials urging human control over nuclear arms. Ukraine launched an AI consular avatar, while Moderna partnered with OpenAI for medical AI advancements. Sanctuary AI and Microsoft collaborate on AI for general-purpose robots. MIT introduced Kolmogorov-Arnold networks with improved neural network efficiency. Meta AI is training Llama 3 models with over 400 billion parameters, featuring multimodality and longer context.
LLMs-as-Juries
gpt-4 gpt-3.5 sdxl ponyxl openai cohere financial-times memory training-data model-usage-limits data-cleansing ai-voice-assistants interface-agents image-generation model-extensions multi-agent-systems
OpenAI has rolled out the memory feature to all ChatGPT Plus users and partnered with the Financial Times to license content for AI training. Discussions on OpenAI's profitability arise due to paid training data licensing and potential GPT-4 usage limit reductions. Users report issues with ChatGPT's data cleansing after the memory update. Tutorials and projects include building AI voice assistants and interface agents powered by LLMs. In Stable Diffusion, users seek realistic SDXL models comparable to PonyXL, and new extensions like Hi-diffusion and Virtuoso Nodes v1.1 enhance ComfyUI with advanced image generation and Photoshop-like features. Cohere finds that multiple agents outperform single agents in LLM judging tasks, highlighting advances in multi-agent systems.
FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)
llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security
2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.
Multi-modal, Multi-Aspect, Multi-Form-Factor AI
gpt-4 idefics-2-8b mistral-instruct apple-mlx gpt-5 reka-ai cohere google rewind apple mistral-ai microsoft paypal multimodality foundation-models embedding-models gpu-performance model-comparison enterprise-data open-source performance-optimization job-impact agi-criticism technical-report arthur-mensch dan-schulman chris-bishop
Between April 12-15, Reka Core launched a new GPT4-class multimodal foundation model with a detailed technical report described as "full Shazeer." Cohere Compass introduced a foundation embedding model for indexing and searching multi-aspect enterprise data like emails and invoices. The open-source IDEFICS 2-8B model continues Google's Flamingo multimodal model reproduction. Rewind pivoted to a multi-platform app called Limitless, moving away from spyware. Reddit discussions highlighted Apple MLX outperforming Ollama and Mistral Instruct on M2 Ultra GPUs, GPU choices for LLMs and Stable Diffusion, and AI-human comparisons by Microsoft Research's Chris Bishop. Former PayPal CEO Dan Schulman predicted GPT-5 will drastically reduce job scopes by 80%. Mistral CEO Arthur Mensch criticized the obsession with AGI as "creating God."
ReALM: Reference Resolution As Language Modeling
flan-t5 gpt-4 apple openai hugging-face stability-ai reference-resolution finetuning quantization retrieval-augmented-generation open-source coding-agents podcast-generation image-generation ai-industry-trends takuto-takizawa
Apple is advancing in AI with a new approach called ReALM: Reference Resolution As Language Modeling, which improves understanding of ambiguous references using three contexts and finetunes a smaller FLAN-T5 model that outperforms GPT-4 on this task. In Reddit AI news, an open-source coding agent SWE-agent achieves 12.29% on the SWE-bench benchmark, and RAGFlow introduces a customizable retrieval-augmented generation engine. A new quantization method, QuaRot, enables efficient 4-bit inference. AI applications include a t-shirt design generator, podgenai for GPT-4 based podcast generation, and an open-source model from HuggingFace that runs without a GPU. Industry discussions focus on the impact of large language models on the AI field and efforts to decentralize AI development. Takuto Takizawa joins Stability AI Japan as Head of Sales & Partnerships.
DBRX: Best open model (just not most efficient)
dbrx grok mixtral llama-2 mpt-7b gpt-4 databricks hugging-face mistral-ai mosaicml openai mixture-of-experts model-efficiency tokenization model-training code-generation model-architecture open-source-models benchmarking fine-tuning
Databricks Mosaic has released a new open-source model called DBRX that outperforms Grok, Mixtral, and Llama2 on evaluations while being about 2x more efficient than Llama2 and Grok. The model was trained on 12 trillion tokens using 3,000 H100 GPUs over 2 months, with an estimated compute cost of $10 million. It uses OpenAI's 100k tiktoken tokenizer and shows strong zero-shot code generation performance, even beating GPT-4 on the Humaneval benchmark. DBRX also upstreamed work to MegaBlocks open source. Despite its scale and efficiency, DBRX's performance on MMLU is only slightly better than Mixtral, raising questions about its scaling efficiency. The focus of DBRX is on enabling users to train models efficiently, with MoE training being about 2x more FLOP-efficient than dense models, achieving similar quality with nearly 4x less compute than previous MPT models. This release is part of the ongoing competition for open-source AI leadership, including models like Dolly, MPT, and Mistral. "If it activates 36B params, the model's perf should be equivalent to a 72B dense model or even 80B," says Qwen's tech lead.
Andrew likes Agents
gpt-3.5 gpt-4 cyberrealistic_v40 platypus-xl sdxl-lightning openai stability-ai agents human-eval-benchmark fine-tuning local-llm-deployment inference-speed image-generation lora upscaling workflow-optimization andrew-ng lilian-weng emad
Andrew Ng's The Batch writeup on Agents highlighted the significant improvement in coding benchmark performance when using an iterative agent workflow, with GPT-3.5 wrapped in an agent loop achieving up to 95.1% correctness on HumanEval, surpassing GPT-4 zero-shot at 67.0%. The report also covers new developments in Stable Diffusion models like Cyberrealistic_v40, Platypus XL, and SDXL Lightning for Naruto-style image generation, alongside innovations in LoRA and upscaling techniques. Discussions on local LLM deployment and optimization focus on hardware setups and finetuning strategies for efficient inference and multi-user serving. Emad's departure from Stability AI and new Sora videos from OpenAI were also noted.
World_sim.exe
gpt-4 gpt-4o grok-1 llama-cpp claude-3-opus claude-3 gpt-5 nvidia nous-research stability-ai hugging-face langchain anthropic openai multimodality foundation-models hardware-optimization model-quantization float4 float6 retrieval-augmented-generation text-to-video prompt-engineering long-form-rag gpu-optimization philosophy-of-ai agi-predictions jensen-huang yann-lecun sam-altman
NVIDIA announced Project GR00T, a foundation model for humanoid robot learning using multimodal instructions, built on their tech stack including Isaac Lab, OSMO, and Jetson Thor. They revealed the DGX Grace-Blackwell GB200 with over 1 exaflop compute, capable of training GPT-4 1.8T parameters in 90 days on 2000 Blackwells. Jensen Huang confirmed GPT-4 has 1.8 trillion parameters. The new GB200 GPU supports float4/6 precision with ~3 bits per parameter and achieves 40,000 TFLOPs on fp4 with 2x sparsity.
Open source highlights include the release of Grok-1, a 340B parameter model, and Stability AI's SV3D, an open-source text-to-video generation solution. Nous Research collaborated on implementing Steering Vectors in Llama.CPP.
In Retrieval Augmented Generation (RAG), a new 5.5-hour tutorial builds a pipeline using open-source HF models, and LangChain released a video on query routing and announced integration with NVIDIA NIM for GPU-optimized LLM inference.
Prominent opinions include Yann LeCun distinguishing language from other cognitive abilities, Sam Altman predicting AGI arrival in 6 years with a leap from GPT-4 to GPT-5 comparable to GPT-3 to GPT-4, and discussions on the philosophical status of LLMs like Claude. There is also advice against training models from scratch for most companies.
The world's first fully autonomous AI Engineer
gpt-4 devin cognition-labs openai reinforcement-learning fine-tuning long-term-reasoning planning ai-agents software-engineering model-integration asynchronous-chat ide agentic-ai patrick-collison fred-ehrsam tim-dettmers
Cognition Labs's Devin is highlighted as a potentially groundbreaking AI software engineer agent capable of learning unfamiliar technologies, addressing bugs, deploying frontend apps, and fine-tuning its own AI models. It integrates OpenAI's GPT-4 with reinforcement learning and features tools like asynchronous chat, browser, shell access, and an IDE. The system claims advanced long-term reasoning and planning abilities, attracting praise from investors like Patrick Collison and Fred Ehrsam. The technology is noted for its potential as one of the most advanced AI agents, sparking excitement about agents and AGI.
Fixing Gemma
gemma claude-3-opus claude-3 mistral-large gpt-4 google unsloth anthropic mistral-ai finetuning numerical-precision benchmarking structured-data-extraction adaptive-equalizer information-theory hallucination-detection model-stability daniel-han yann-lecun francois-chollet arav-srinivas _aidan_clark_
Google's Gemma model was found unstable for finetuning until Daniel Han from Unsloth AI fixed 8 bugs, improving its implementation. Yann LeCun explained technical details of a pseudo-random bit sequence for adaptive equalizers, while François Chollet discussed the low information bandwidth of the human visual system. Arav Srinivas reported that Claude 3 Opus showed no hallucinations in extensive testing, outperforming GPT-4 and Mistral-Large in benchmarks. Reflections from Yann LeCun highlight ongoing AI progress toward human-level intelligence. The community is shifting pipelines to work better with Claude models, and emotional experiences in ML development were shared by Aidan Clark.
FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs
qlora fsdp inflection-2.5 gpt-4 answer.ai hugging-face meta-ai-fair nvidia inflectionai model-training quantization memory-optimization gradient-checkpointing cpu-offloading fine-tuning model-sharding reinforcement-learning chain-of-thought benchmarking jeremy_howard tim_dettmers yann_lecun
Jeremy Howard and collaborators released a new tool combining FSDP, QLoRA, and HQQ to enable training 70b-parameter models on affordable consumer GPUs like RTX 4090s with only 24GB RAM, overcoming traditional memory constraints that required expensive data center GPUs costing over $150k. The approach shards quantized models across multiple GPUs and uses techniques like gradient checkpointing and CPU offloading to achieve efficient training on desktop-class hardware. The blogpost details challenges and solutions integrating these methods, highlighting a significant cost reduction from $150k to under $2.5k for training large language models. Additionally, Twitter recaps mention Inflection AI's Inflection-2.5 model rivaling GPT-4 in benchmarks with less compute, and Grok improving speed by 3x. Yann LeCun discusses multi-step reasoning training for LLMs.
Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0
Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.
Not much happened today
claude-3 claude-3-opus claude-3-sonnet gpt-4 gemma-2b anthropic perplexity langchain llamaindex cohere accenture mistral-ai snowflake together-ai hugging-face european-space-agency google gpt4all multimodality instruction-following out-of-distribution-reasoning robustness enterprise-ai cloud-infrastructure open-datasets model-deployment model-discoverability generative-ai image-generation
Anthropic released Claude 3, replacing Claude 2.1 as the default on Perplexity AI, with Claude 3 Opus surpassing GPT-4 in capability. Debate continues on whether Claude 3's performance stems from emergent properties or pattern matching. LangChain and LlamaIndex added support for Claude 3 enabling multimodal and tool-augmented applications. Despite progress, current models still face challenges in out-of-distribution reasoning and robustness. Cohere partnered with Accenture for enterprise AI search, while Mistral AI and Snowflake collaborate to provide LLMs on Snowflake's platform. Together AI Research integrates Deepspeed innovations to accelerate generative AI infrastructure. Hugging Face and the European Space Agency released a large earth observation dataset, and Google open sourced Gemma 2B, optimized for smartphones via the MLC-LLM project. GPT4All improved model discoverability for open models. The AI community balances excitement over new models with concerns about limitations and robustness, alongside growing enterprise adoption and open-source contributions. Memes and humor continue to provide social commentary.
Claude 3 just destroyed GPT 4 (see for yourself)
claude-3 claude-3-opus claude-3-sonnet claude-3-haiku gpt-4 anthropic amazon google claude-ai multimodality vision long-context model-alignment model-evaluation synthetic-data structured-output instruction-following model-speed cost-efficiency benchmarking safety mmitchell connor-leahy
Claude 3 from Anthropic launches in three sizes: Haiku (small, unreleased), Sonnet (medium, default on claude.ai, AWS, and GCP), and Opus (large, on Claude Pro). Opus outperforms GPT-4 on key benchmarks like GPQA, impressing benchmark authors. All models support multimodality with advanced vision capabilities, including converting a 2-hour video into a blog post. Claude 3 offers improved alignment, fewer refusals, and extended context length up to 1 million tokens with near-perfect recall. Haiku is noted for speed and cost-efficiency, processing dense research papers in under three seconds. The models excel at following complex instructions and producing structured outputs like JSON. Safety improvements reduce refusal rates, though some criticism remains from experts. Claude 3 is trained on synthetic data and shows strong domain-specific evaluation results in finance, medicine, and philosophy.
Welcome Interconnects and OpenRouter
mistral-large miqu mixtral gpt-4 mistral-7b mistral-ai openai perplexity-ai llamaindex qwen langchain model-comparison model-optimization quantization role-playing story-writing code-clarity ai-assisted-decompilation asynchronous-processing quantum-computing encoder-based-diffusion open-source hardware-experimentation rag-systems nathan-lambert alex-atallah
Discord communities analyzed 22 guilds, 349 channels, and 12885 messages revealing active discussions on model comparisons and optimizations involving Mistral AI, Miqu, and GGUF quantized models. Highlights include comparing Mistral Large with GPT-4, focusing on cost-effectiveness and performance, and exploring quantization techniques like GPTQ and QLORA to reduce VRAM usage. Advanced applications such as role-playing, story-writing, code clarity, and AI-assisted decompilation were emphasized, alongside development of tools like an asynchronous summarization script for Mistral 7b. The intersection of quantum computing and AI was discussed, including DARPA-funded projects and encoder-based diffusion techniques for image processing. Community efforts featured new Spanish LLM announcements, hardware experimentation, and open-source initiatives, with platforms like Perplexity AI and LlamaIndex noted for innovation and integration. Speculation about Mistral AI's open-source commitment and tools like R2R for rapid RAG deployment highlighted collaborative spirit.
Karpathy emerges from stealth?
mistral-7b mixtral-8x7b zephyr-7b gpt-4 llama-2 intel mistral-ai audiogen thebloke tokenization quantization model-optimization fine-tuning model-merging computational-efficiency memory-optimization retrieval-augmented-generation multi-model-learning meta-reasoning dataset-sharing open-source ethical-ai community-collaboration andrej-karpathy
Andrej Karpathy released a comprehensive 2-hour tutorial on tokenization, detailing techniques up to GPT-4's tokenizer and noting the complexity of Llama 2 tokenization with SentencePiece. Discussions in AI Discord communities covered model optimization and efficiency, focusing on quantization of models like Mistral 7B and Zephyr-7B to reduce memory usage for consumer GPUs, including Intel's new weight-only quantization algorithm. Efforts to improve computational efficiency included selective augmentation reducing costs by 57.76% and memory token usage versus kNN for Transformers. Challenges in hardware compatibility and software issues were shared, alongside fine-tuning techniques such as LoRA and model merging. Innovative applications of LLMs in retrieval-augmented generation (RAG), multi-model learning, and meta-reasoning were explored. The community emphasized dataset sharing, open-source releases like SDXL VAE encoded datasets and Audiogen AI codecs, and ethical AI use with censorship and guardrails. Collaboration and resource sharing remain strong in these AI communities.
The Dissection of Smaug (72B)
smaug-72b qwen-1.0 qwen-1.5 gpt-4 mistral-7b miqumaid wizardlm_evol_instruct_v2_196k openhermes-2.5 abacus-ai hugging-face nous-research laion thebloke lm-studio intel nvidia elevenlabs fine-tuning model-merging quantization web-ui model-conversion hardware-setup privacy image-generation optical-character-recognition prompt-engineering bindureddy
Abacus AI launched Smaug 72B, a large finetune of Qwen 1.0, which remains unchallenged on the Hugging Face Open LLM Leaderboard despite skepticism from Nous Research. LAION introduced a local voice assistant model named Bud-E with a notable demo. The TheBloke Discord community discussed model performance trade-offs between large models like GPT-4 and smaller quantized models, fine-tuning techniques using datasets like WizardLM_evol_instruct_V2_196k and OpenHermes-2.5, and challenges in web UI development and model merging involving Mistral-7b and MiquMaid. The LM Studio Discord highlighted issues with model conversion from PyTorch to gguf, hardware setups involving Intel Xeon CPUs and Nvidia P40 GPUs, privacy concerns, and limitations in image generation and web UI availability.
MetaVoice & RIP Bard
mixtral nous-mixtral-dpo miqu-70b gpt-4 llama-2-70b-instruct llama-2 llama-2-70b llama-2-70b-instruct coqui metavoice google openai thebloke text-to-speech voice-cloning longform-synthesis prompt-engineering direct-preference-optimization lora-fine-tuning transformers gpu-acceleration apple-silicon content-authenticity metadata ai-censorship open-source-ai model-comparison usability model-limitations
Coqui, a TTS startup that recently shut down, inspired a new TTS model supporting voice cloning and longform synthesis from a small startup called MetaVoice. Google discontinued the Bard brand in favor of Gemini. On TheBloke Discord, discussions focused on AI training with models like Mixtral, Nous Mixtral DPO, and Miqu 70B, comparing them to OpenAI's GPT models, and debated prompt engineering, lorebooks, and removing safety features via LoRA fine-tuning on models such as Llama2 70B instruct. Technical topics included transformer layer offloading limitations and adapting LLaMa 2 for Apple Silicon. On OpenAI Discord, DALL-E images now include C2PA metadata for content authenticity, sparking debates on AI censorship, metadata manipulation, and open-source AI models versus commercial giants like GPT-4. Users discussed GPT-4 usability, limitations, and practical applications.
RWKV "Eagle" v5: Your move, Mamba
rwkv-v5 mistral-7b miqu-1-70b mistral-medium llama-2 mistral-instruct-v0.2 mistral-tuna llama-2-13b kunoichi-dpo-v2-7b gpt-4 eleutherai mistral-ai hugging-face llamaindex nous-research rwkv lmsys fine-tuning multilinguality rotary-position-embedding model-optimization model-performance quantization speed-optimization prompt-engineering model-benchmarking reinforcement-learning andrej-karpathy
RWKV v5 Eagle was released with better-than-mistral-7b evaluation results, trading some English performance for multilingual capabilities. The mysterious miqu-1-70b model sparked debate about its origins, possibly a leak or distillation of Mistral Medium or a fine-tuned Llama 2. Discussions highlighted fine-tuning techniques, including the effectiveness of 1,000 high-quality prompts over larger mixed-quality datasets, and tools like Deepspeed, Axolotl, and QLoRA. The Nous Research AI community emphasized the impact of Rotary Position Embedding (RoPE) theta settings on LLM extrapolation, improving models like Mistral Instruct v0.2. Speed improvements in Mistral Tuna kernels reduced token processing costs, enhancing efficiency. The launch of Eagle 7B with 7.52B parameters showcased strong multilingual performance, surpassing other 7B class models.
GPT4Turbo A/B Test: gpt-4-1106-preview
gpt-4-turbo gpt-4 gpt-3.5 openhermes-2.5-mistral-7b-4.0bpw exllamav2 llama-2-7b-chat mistral-instruct-v0.2 mistrallite llama2 openai huggingface thebloke nous-research mistral-ai langchain microsoft azure model-loading rhel dataset-generation llm-on-consoles fine-tuning speed-optimization api-performance prompt-engineering token-limits memory-constraints text-generation nlp-tools context-window-extension sliding-windows rope-theta non-finetuning-context-extension societal-impact
OpenAI released a new GPT-4 Turbo version, prompting a natural experiment in summarization comparing the November 2023 and January 2024 versions. The TheBloke Discord discussed troubleshooting model loading errors with OpenHermes-2.5-Mistral-7B-4.0bpw and exllamav2, debates on RHEL in ML, dataset generation for understanding GPT flaws, and running LLMs like Llama and Mistral on consoles. LangChain fine-tuning challenges for Llama2 were also noted. The OpenAI Discord highlighted GPT-4 speed inconsistencies, API vs web performance, prompt engineering with GPT-3.5 and GPT-4 Turbo, and DALL-E typo issues in image text. Discussions included NLP tools like semantic-text-splitter and collaboration concerns with GPT-4 Vision on Azure. The Nous Research AI Discord focused on extending context windows with Mistral instruct v0.2, MistralLite, and LLaMA-2-7B-Chat achieving 16,384 token context, plus alternatives like SelfExtend for context extension without fine-tuning. The societal impact of AI technology was also considered.
RIP Latent Diffusion, Hello Hourglass Diffusion
gpt-4 latent-diffusion stable-diffusion meta-ai-fair openai hugging-face diffusion-models transformers image-generation model-efficiency fine-tuning quantization prompt-engineering roleplay training-optimization katherine-crowson lucidrains
Katherine Crowson from Stable Diffusion introduces a hierarchical pure transformer backbone for diffusion-based image generation that efficiently scales to megapixel resolutions with under 600 million parameters, improving upon the original ~900M parameter model. This architecture processes local and global image phenomena separately, enhancing efficiency and resolution without latent steps. Additionally, Meta's Self Rewarding LM paper has inspired lucidrains to begin an implementation. Discord summaries highlight GPT-4's robustness against quantification tricks, discussions on open-source GPT-0 alternatives, challenges in DPO training on limited VRAM with suggestions like QLoRA and rmsprop, and efforts to improve roleplay model consistency through fine-tuning and merging. Philosophical debates on AI sentience and GPT-4 customization for markdown and translation tasks were also noted.
Sama says: GPT-5 soon
gpt-5 mixtral-7b gpt-3.5 gemini-pro gpt-4 llama-cpp openai codium thebloke amd hugging-face mixture-of-experts fine-tuning model-merging 8-bit-optimization gpu-acceleration performance-comparison command-line-ai vector-stores embeddings coding-capabilities sam-altman ilya-sutskever itamar andrej-karpathy
Sam Altman at Davos highlighted that his top priority is launching the new model, likely called GPT-5, while expressing uncertainty about Ilya Sutskever's employment status. Itamar from Codium introduced the concept of Flow Engineering with AlphaCodium, gaining attention from Andrej Karpathy. On the TheBloke Discord, engineers discussed a multi-specialty mixture-of-experts (MOE) model combining seven distinct 7 billion parameter models specialized in law, finance, and medicine. Debates on 8-bit fine-tuning and the use of bitsandbytes with GPU support were prominent. Discussions also covered model merging using tools like Mergekit and compatibility with Alpaca format. Interest in optimizing AI models on AMD hardware using AOCL blas and lapack libraries with llama.cpp was noted. Users experimented with AI for command line tasks, and the Mixtral MoE model was refined to surpass larger models in coding ability. Comparisons among LLMs such as GPT-3.5, Mixtral, Gemini Pro, and GPT-4 focused on knowledge depth, problem-solving, and speed, especially for coding tasks.
1/10/2024: All the best papers for AI Engineers
chatgpt gpt-4 dall-e-3 stable-diffusion deepseek-moe openai deepseek-ai prompt-engineering model-release rate-limiting ethics image-generation moe collaborative-workspaces data-privacy abdubs darthgustav
OpenAI launched the GPT Store featuring over 3 million custom versions of ChatGPT accessible to Plus, Team, and Enterprise users, with weekly highlights of impactful GPTs like AllTrails. The new ChatGPT Team plan offers advanced models including GPT-4 and DALL·E 3, alongside collaborative tools and enhanced data privacy. Discussions around AI-generated imagery favored DALL·E and Stable Diffusion, while users faced rate limit challenges and debated the GPT Store's SEO and categorization. Ethical considerations in prompt engineering were raised with a three-layer framework called 'The Sieve'. Additionally, DeepSeek-MoE was noted for its range of Mixture of Experts (MoE) model sizes. "The Sieve," a three-layer ethical framework for AI, was highlighted in prompt engineering discussions.
1/6-7/2024: LlaMA Pro - an alternative to PEFT/RAG??
llama-3 llama-3-1-1b llama-3-8-3b gpt-4 gpt-3.5 dall-e openai mistral-ai llamaindex langchain fine-tuning model-expansion token-limits privacy multilinguality image-generation security custom-models model-training yannic-kilcher
New research papers introduce promising Llama Extensions including TinyLlama, a compact 1.1B parameter model pretrained on about 1 trillion tokens for 3 epochs, and LLaMA Pro, an 8.3B parameter model expanding LLaMA2-7B with additional training on 80 billion tokens of code and math data. LLaMA Pro adds layers to avoid catastrophic forgetting and balances language and code tasks but faces scrutiny for not using newer models like Mistral or Qwen. Meanwhile, OpenAI Discord discussions reveal insights on GPT-4 token limits, privacy reassurances, fine-tuning for GPT-3.5, challenges with multi-language image recognition, custom GPT creation requiring ChatGPT Plus, and security concerns in GPT deployment. Users also share tips on dynamic image generation with DALL-E and logo creation.
12/23/2023: NeurIPS Best Papers of 2023
gpt-4 palm2 hermes-2.5 mistral-7b nous-research hugging-face apple context-length malware-security video-content music-content linear-layers api-access large-language-models embedding vector-databases model-merging model-interpretability striped-hyena-architecture quantization rmsnorm attention-mechanisms
The Latent Space Pod released a 3-hour recap of the best NeurIPS 2023 papers. The Nous Research AI Discord community discussed optimizing AI performance with shorter context lengths, malware security concerns linked to HuggingFace, and shared insights on video and music content. Technical discussions included the DYAD research paper proposing a faster alternative to linear layers, Apple's ML Ferret machine learning tool, and accessing PALM2 via API. The community also explored Large Language Models focusing on specialized models, data scaling, embedding/vector databases, model merging, and interpretability, with mentions of Hermes 2.5, GPT-4, and Mistral. Additionally, there were conversations on the Striped Hyena Architecture, quantization challenges, and fixes related to RMSNorm and the "Attention is All You Need" paper.
12/22/2023: Anyscale's Benchmark Criticisms
gpt-4 gpt-3.5 bard anyscale openai microsoft benchmarking performance api prompt-engineering bug-tracking model-comparison productivity programming-languages storytelling
Anyscale launched their LLMPerf leaderboard to benchmark large language model inference performance, but it faced criticism for lacking detailed metrics like cost per token and throughput, and for comparing public LLM endpoints without accounting for batching and load. In OpenAI Discord discussions, users reported issues with Bard and preferred Microsoft Copilot for storytelling, noting fewer hallucinations. There was debate on the value of upgrading from GPT-3.5 to GPT-4, with many finding paid AI models worthwhile for coding productivity. Bugs and performance issues with OpenAI APIs were also highlighted, including slow responses and message limits. Future AI developments like GPT-6 and concerns about OpenAI's transparency and profitability were discussed. Prompt engineering for image generation was another active topic, emphasizing clear positive prompts and the desire for negative prompts.
12/21/2023: The State of AI (according to LangChain)
mixtral gpt-4 chatgpt bard dall-e langchain openai perplexity-ai microsoft poe model-consistency model-behavior response-quality chatgpt-usage-limitations error-handling user-experience model-comparison hallucination-detection prompt-engineering creative-ai
LangChain launched their first report based on LangSmith stats revealing top charts for mindshare. On OpenAI's Discord, users raised issues about the Mixtral model, noting inconsistencies and comparing it to Poe's Mixtral. There were reports of declining output quality and unpredictable behavior in GPT-4 and ChatGPT, with discussions on differences between Playground GPT-4 and ChatGPT GPT-4. Users also reported anomalous behavior in Bing and Bard AI models, including hallucinations and strange assertions. Various user concerns included message limits on GPT-4, response completion errors, chat lags, voice setting inaccessibility, password reset failures, 2FA issues, and subscription restrictions. Techniques for guiding GPT-4 outputs and creative uses with DALL-E were also discussed. Users highlighted financial constraints affecting subscriptions and queries about earning with ChatGPT and token costs.
12/20/2023: Project Obsidian - Multimodal Mistral 7B from Nous
gpt-4 gpt-3.5 dall-e-3 nous-research teknim openai multimodality image-detection security-api bias facial-recognition healthcare-ai gpu-optimization prompt-engineering vision
Project Obsidian is a multimodal model being trained publicly, tracked by Teknium on the Nous Discord. Discussions include 4M: Massively Multimodal Masked Modeling and Reason.dev, a TypeScript framework for LLM applications. The OpenAI Discord community discussed hardware specs for running TensorFlow JS for image detection, security API ideas for filtering inappropriate images, and concerns about racial and cultural bias in AI, especially in facial recognition and healthcare. Challenges with GPT-3.5 and GPT-4 in word puzzle games were noted, along with GPU recommendations prioritizing VRAM for AI inference. Users also debated GPT-4's vision capabilities, limitations of DALL·E 3, platform access issues, and prompting strategies for better outputs.
12/19/2023: Everybody Loves OpenRouter
gpt-4 gpt-3.5 mixtral-8x7b-instruct dolphin-2.0-mistral-7b gemini openai mistral-ai google hugging-face performance memory-management api prompt-engineering local-language-models translation censorship video-generation
OpenRouter offers an easy OpenAI-compatible proxy for Mixtral-8x7b-instruct. Discord discussions highlight GPT-4 performance and usability issues compared to GPT-3.5, including memory management and accessibility problems. Users debate local language models versus OpenAI API usage, with mentions of Dolphin 2.0 Mistral 7B and Google's video generation project. Prompt engineering and custom instructions for GPT models are also key topics. Concerns about censorship on models like Gemini and translation tool preferences such as DeepL were discussed.
12/16/2023: ByteDance suspended by OpenAI
claude-2.1 gpt-4-turbo gemini-1.5-pro gpt-5 gpt-4.5 gpt-4 openai google-deepmind anthropic hardware gpu api-costs coding model-comparison subscription-issues payment-processing feature-confidentiality ai-art-generation organizational-productivity model-speculation
The OpenAI Discord community discussed hardware options like Mac racks and the A6000 GPU, highlighting their value for AI workloads. They compared Claude 2.1 and GPT 4 Turbo on coding tasks, with GPT 4 Turbo outperforming Claude 2.1. The benefits of the Bard API for gemini pro were noted, including a free quota of 60 queries per minute. Users shared experiences with ChatGPT Plus membership issues, payment problems, and speculated about the upcoming GPT-5 and the rumored GPT-4.5. Discussions also covered the confidentiality of the Alpha feature, AI art generation policies, and improvements in organizational work features. The community expressed mixed feelings about GPT-4's performance and awaited future model updates.
12/15/2023: Mixtral-Instruct beats Gemini Pro (and matches GPT3.5)
mixtral gemini-pro gpt-3.5 gpt-4.5 gpt-4 chatgpt lmsys openai deepseek cloudflare huggingface performance context-window prompt-engineering privacy local-gpu cloud-gpu code-generation model-comparison model-usage api-errors karpathy
Thanks to a karpathy shoutout, lmsys now has enough data to rank mixtral and gemini pro. The discussion highlights the impressive performance of these state-of-the-art open-source models that can run on laptops. In the openai Discord, users compared AI tools like perplexity and chatgpt's browsing tool, favoring Perplexity for its superior data gathering, pricing, and usage limits. Interest was shown in AI's ability to convert large code files with deepseek coder recommended. Debates on privacy implications for AI advancement and challenges of running LLMs on local and cloud GPUs were prominent. Users reported issues with chatgpt including performance problems, loss of access to custom GPTs, and unauthorized access. Discussions also covered prompt engineering for large context windows and speculations about gpt-4.5 and gpt-4 future developments.
12/14/2023: $1e7 for Superalignment
gemini bard gpt-4 gpt-4.5 llama-2 openai llamaindex perplexity-ai prompt-engineering api custom-gpt json bug-fixes chatbots performance tts code-generation image-recognition jan-leike patrick-collison
Jan Leike is launching a new grant initiative inspired by Patrick Collison's Fast Grants to support AI research. OpenAI introduced a new developers Twitter handle @OpenAIDevs for community updates. Discussions on OpenAI's Gemini and Bard chatbots highlight their ability to read each other's instructions and offer unique coding solutions. Users reported various issues with GPT-4, including performance problems, customization difficulties, and a resolved bug in image recognition. There are ongoing conversations about prompt engineering challenges and new JSON mode support in Convo-lang for API use. Concerns about misuse of chatbots for illegal activities and alternatives like Llama2 models and the Perplexity chatbot were also discussed.
12/13/2023 SOLAR10.7B upstages Mistral7B?
solar-10.7b llama-2 mistral-7b phi-2 gpt-4 gemini upstage nous-research openai mistral-ai microsoft depth-up-scaling pretraining synthetic-data gpu-training api-usage model-integration agi asi chat-models vision model-performance fine-tuning
Upstage released the SOLAR-10.7B model, which uses a novel Depth Up-Scaling technique built on the llama-2 architecture and integrates mistral-7b weights, followed by continued pre-training. The Nous community finds it promising but not exceptional. Additionally, weights for the phi-2 base model were released, trained on 1.4 trillion tokens including synthetic texts created by GPT-3 and filtered by GPT-4, using 96 A100 GPUs over 14 days. On OpenAI's Discord, users discussed challenges with various GPT models, including incoherent outputs, API usage limitations, and issues with GPT-4 Vision API. Conversations also covered understanding AGI and ASI, concerns about OpenAI's partnership with Axel Springer, and pricing changes for GPT Plus. Discussions included the Gemini chat model integrated into Bard and comparisons with GPT-4 performance.
12/12/2023: Towards LangChain 0.1
mixtral-8x7b phi-2 gpt-3 chatgpt gpt-4 langchain mistral-ai anthropic openai microsoft mixture-of-experts information-leakage prompt-engineering oauth2 logo-generation education-ai gaming-ai api-access model-maintainability scalability
The Langchain rearchitecture has been completed, splitting the repo for better maintainability and scalability, while remaining backwards compatible. Mistral launched a new Discord community, and Anthropic is rumored to be raising another $3 billion. On the OpenAI Discord, discussions covered information leakage in AI training, mixture of experts (MoE) models like mixtral 8x7b, advanced prompt engineering techniques, and issues with ChatGPT performance and API access. Users also explored AI applications in logo generation, education, and gaming, and shared solutions for Oauth2 authentication problems. A new small language model named Phi-2 was mentioned from Microsoft.
12/11/2023: Mixtral beats GPT3.5 and Llama2-70B
mixtral-8x7b gpt-4 gpt-3.5-turbo llama-3 openhermes-2.5 llava-v1.5-13b-gptq mistral-ai openai huggingface sparse-mixture-of-experts fine-tuning quantization gpu-hardware transformers model-deployment open-source coding-datasets
Mistral AI announced the Mixtral 8x7B model featuring a Sparse Mixture of Experts (SMoE) architecture, sparking discussions on its potential to rival GPT-4. The community debated GPU hardware options for training and fine-tuning transformer models, including RTX 4070s, A4500, RTX 3090s with nvlink, and A100 GPUs. Interest was expressed in fine-tuning Mixtral and generating quantized versions, alongside curating high-quality coding datasets. Resources shared include a YouTube video on open-source model deployment, an Arxiv paper, GitHub repositories, and a blog post on Mixture-of-Experts. Discussions also touched on potential open-source releases of GPT-3.5 Turbo and llama-3, and running OpenHermes 2.5 on Mac M3 Pro with VRAM considerations.
12/10/2023: not much happened today
mixtral-8x7b-32kseqlen mistral-7b stablelm-zephyr-3b openhermes-2.5-neural-chat-v3-3-slerp gpt-3.5 gpt-4 nous-research openai mistral-ai hugging-face ollama lm-studio fine-tuning mixture-of-experts model-benchmarking inference-optimization model-evaluation open-source decentralized-ai gpu-optimization community-engagement andrej-karpathy yann-lecun richard-blythman gabriel-syme pradeep1148 cyborg_1552
Nous Research AI Discord community discussed attending NeurIPS and organizing future AI events in Australia. Highlights include interest in open-source and decentralized AI projects, with Richard Blythman seeking co-founders. Users shared projects like Photo GPT AI and introduced StableLM Zephyr 3B. The Mixtral model, based on Mistral, sparked debate on performance and GPU requirements, with comparisons to GPT-3.5 and potential competitiveness with GPT-4 after fine-tuning. Tools like Tensorboard, Wandb, and Llamahub were noted for fine-tuning and evaluation. Discussions covered Mixture of Experts (MoE) architectures, fine-tuning with limited data, and inference optimization strategies for ChatGPT. Memes and community interactions referenced AI figures like Andrej Karpathy and Yann LeCun. The community also shared resources such as GitHub links and YouTube videos related to these models and tools.
12/8/2023 - Mamba v Mistral v Hyena
mistral-8x7b-moe mamba-3b stripedhyena-7b claude-2.1 gemini gpt-4 dialogrpt-human-vs-machine cybertron-7b-v2-gguf falcon-180b mistral-ai togethercompute stanford anthropic google hugging-face mixture-of-experts attention-mechanisms prompt-engineering alignment image-training model-deployment gpu-requirements cpu-performance model-inference long-context model-evaluation open-source chatbots andrej-karpathy tri-dao maxwellandrews raddka
Three new AI models are highlighted: Mistral's 8x7B MoE model (Mixtral), Mamba models up to 3B by Together, and StripedHyena 7B, a competitive subquadratic attention model from Stanford's Hazy Research. Discussions on Anthropic's Claude 2.1 focus on its prompting technique and alignment challenges. The Gemini AI from Google is noted as potentially superior to GPT-4. The community also explores Dreambooth for image training and shares resources like the DialogRPT-human-vs-machine model on Hugging Face. Deployment challenges for large language models, including CPU performance and GPU requirements, are discussed with references to Falcon 180B and transformer batching techniques. User engagement includes meme sharing and humor.
12/7/2023: Anthropic says "skill issue"
claude-2.1 gpt-4 gpt-3.5 gemini-pro gemini-ultra gpt-4.5 chatgpt bingchat dall-e gpt-5 anthropic openai google prompt-engineering model-performance regulation language-model-performance image-generation audio-processing midi-sequence-analysis subscription-issues network-errors
Anthropic fixed a glitch in their Claude 2.1 model's needle in a haystack test by adding a prompt. Discussions on OpenAI's Discord compared Google's Gemini Pro and Gemini Ultra models with OpenAI's GPT-4 and GPT-3.5, with some users finding GPT-4 superior in benchmarks. Rumors about a GPT-4.5 release circulated without official confirmation. Concerns were raised about "selective censorship" affecting language model performance. The EU's potential regulation of AI, including ChatGPT, was highlighted. Users reported issues with ChatGPT Plus message limits and subscription upgrades, and shared experiences with BingChat and DALL-E. The community discussed prompt engineering techniques and future applications like image generation and MIDI sequence analysis, expressing hopes for GPT-5.
Is Google's Gemini... legit?
gemini gemini-pro gemini-ultra gpt-4 gpt-3.5 claude-2.1 palm2 google openai chain-of-thought context-windows prompt-engineering model-evaluation multimodality speech-processing chatbot-errors subscription-management swyx
Google's Gemini AI model is generating significant discussion and skepticism, especially regarding its 32-shot chain of thought MMLU claim and 32k context window. The community is comparing Gemini's performance and capabilities with OpenAI's GPT-4 and GPT-3.5, highlighting the upcoming Gemini Pro and Gemini Ultra models on the Bard platform. Users report various OpenAI service issues including chatbot errors and subscription problems. Discussions also cover prompt engineering techniques, AI model evaluation comparing GPT-4, Claude 2.1, and PaLM2, and improvements in speech and multimodal capabilities. The bot now supports reading and summarizing links from platforms like arXiv, Twitter, and YouTube, enhancing user interaction.