Frozen AI News archive

Creating a LLM-as-a-Judge

**Anthropic** released details on Claude 3.5 SWEBench+SWEAgent, while **OpenAI** introduced SimpleQA and **DeepMind** launched NotebookLM. **Apple** announced new M4 Macbooks, and a new SOTA image model, Recraft v3, emerged. Hamel Husain presented a detailed 6,000-word treatise on creating LLM judges using a method called **critique shadowing** to align LLMs with domain experts, addressing the problem of untrusted and unused data in AI teams. The workflow involves expert-reviewed datasets and iterative prompt refinement. Additionally, **Zep** introduced a temporal knowledge graph memory layer to improve AI agent memory and reduce hallucinations. **Anthropic** also integrated Claude 3.5 Sonnet with GitHub Copilot, expanding access to Copilot Chat users.

Canonical issue URL

AI News for 10/29/2024-10/30/2024. We checked 7 subreddits, 433 Twitters and 32 Discords (231 channels, and 2558 messages) for you. Estimated reading time saved (at 200wpm): 241 minutes. You can now tag @smol_ai for AINews discussions!

On a day when Anthropic (Claude 3.5 SWEBench+SWEAgent details), OpenAI (SimpleQA), DeepMind (NotebookLM) and Apple (M4 Macbooks) and a mysterious new SOTA image model (Recraft v3) have releases, it is rare to focus on news from a smaller name, but we love news you can use.

After his hit Your AI Product Needs Evals (our coverage here), Hamel Husain is back with an epic 6,000 word treatise on Creating a LLM-as-a-Judge That Drives Business Results, with a clear problem statement: AI teams have too much data they don't trust and don't use.

image.png

There are a lot of standard themes echoed in Hamel's AI.Engineer talk (as well as the very fun Weights & Biases one), but this piece is notable for its strong recommendation of critique shadowing as to create few-shot examples for LLM judges to align with domain experts:

image.png

Critique Shadowing TLDR:

The final workflow looks like this:

image.png

Handy, and, as Hamel mentions in the article, this is our critique-and-domain-expert-heavy iterative process for building AINews as well!


[Sponsored by Zep] Why do AI agents need a memory layer, anyway? Well, including the full interaction history in prompts leads to hallucinations, poor recall, and costly LLM calls. Plus, most RAG pipelines struggle with temporal data, where facts change over time. Zep is a new service that tackles these problems using a unique structure called a temporal knowledge graph. Get up and running in minutes with the quickstart.

swyx's commentary: the docs for the 4 memory APIs of Zep also helped me better understand the scope of what Zep does/doesn't do and helped give a better mental model of what a chatbot memory API should look like agnostic of Zep. Worthwhile!


{% if medium == 'web' %}

Table of Contents

[TOC]

{% else %}

The Table of Contents and Channel Summaries have been moved to the web version of this email: [{{ email.subject }}]({{ email_url }})!

{% endif %}


AI Twitter Recap

all recaps done by Claude 3.5 Sonnet, best of 4 runs.

GitHub Copilot and AI Integration

AI Advancements and Research

AI Applications and Tools

Programming Languages and Tools

Memes and Humor


AI Reddit Recap

/r/LocalLlama Recap

Theme 1. Apple's M4 Mac Mini: A New Contender for AI Development

Theme 2. Stable Diffusion 3.5 Medium Released on Hugging Face

Theme 3. AI Safety and Alignment: Debates and Criticisms

Other AI Subreddit Recap

r/machinelearning, r/openai, r/stablediffusion, r/ArtificialInteligence, /r/LLMDevs, /r/Singularity

AI Research and Techniques

AI Applications and Impacts

AI Model Releases and Improvements

AI Industry and Business

AI Ethics and Societal Impact

Memes and Humor


AI Discord Recap

A summary of Summaries of Summaries by O1-preview

Theme 1: Apple's M4 Chips Supercharge AI Performance

Theme 2: AI Models Stir Up the Community with Updates and Controversies

Theme 3: Fine-Tuning and Training Hurdles Challenge AI Developers

Theme 4: AI Disrupts Software Engineering and Automation Tools Flourish

Theme 5: OpenAI Tackles Factuality and Enhances User Experience


PART 1: High level Discord summaries

LM Studio Discord


HuggingFace Discord


Unsloth AI (Daniel Han) Discord


OpenAI Discord


Perplexity AI Discord


aider (Paul Gauthier) Discord


OpenRouter (Alex Atallah) Discord


Stability.ai (Stable Diffusion) Discord


Nous Research AI Discord


Eleuther Discord


Interconnects (Nathan Lambert) Discord


Notebook LM Discord Discord


GPU MODE Discord


Torchtune Discord


Cohere Discord


Latent Space Discord


tinygrad (George Hotz) Discord


Modular (Mojo 🔥) Discord


LlamaIndex Discord


DSPy Discord


OpenInterpreter Discord


LangChain AI Discord


OpenAccess AI Collective (axolotl) Discord


LAION Discord


LLM Agents (Berkeley MOOC) Discord


Mozilla AI Discord


Gorilla LLM (Berkeley Function Calling) Discord


The Alignment Lab AI Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The LLM Finetuning (Hamel + Dan) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The DiscoResearch Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


PART 2: Detailed by-Channel summaries and links

{% if medium == 'web' %}

LM Studio ▷ #general (161 messages🔥🔥):

  • Apple's MacBook Pro Announcement
  • M3 Max Performance
  • Model Access and Inference
  • H100 GPU Rental Pricing
  • Local vs. Remote Model Usage

Links mentioned:


LM Studio ▷ #hardware-discussion (600 messages🔥🔥🔥):

  • M4 Ultra expectations
  • Comparison of Apple Silicon
  • GPU performance discussion
  • AI model fitting on GPUs
  • Windows vs. Linux for AI tasks

Links mentioned:


HuggingFace ▷ #general (326 messages🔥🔥):

  • Hugging Face API Usage
  • Image Analysis with Ollama
  • Machine Learning Education
  • Transformers and Attention Models
  • Docker Spaces and Private Images

Links mentioned:


HuggingFace ▷ #today-im-learning (8 messages🔥):

  • Llama-3.1 70B compatibility
  • Parallel computing setup
  • Fine-tuning datasets on Hugging Face
  • LLM recommendations for Q&A and sentiment analysis

Link mentioned: What if my dataset isn’t on the Hub? - Hugging Face NLP Course: no description found


HuggingFace ▷ #cool-finds (17 messages🔥):

  • Latent Space Regularization
  • Anthropic Agent in LlamaIndex
  • Computational Modeling Guidelines
  • Turing's Contributions
  • Nomic Atlas Insights

Links mentioned:


HuggingFace ▷ #i-made-this (3 messages):

  • Transformer Tokenizer Updates
  • GPUs and Docker Integration
  • New Blog Post on Docker
  • Dstack Task Configurations

Link mentioned: Using Docker and Docker Compose inside GPU-enabled containers - dstack: The latest release of dstack allows for the direct use of Docker and Docker Compose within run configurations.


HuggingFace ▷ #computer-vision (2 messages):

  • User Engagement
  • Future Discussions

HuggingFace ▷ #NLP (10 messages🔥):

  • Qwen 2 model issues
  • Langchain SQL agent with GPT-4
  • Mini Omni 2 feedback

HuggingFace ▷ #diffusion-discussions (3 messages):

  • Diffusion models for non-standard data
  • FoldingDiff project
  • Consistency Models in AI

Link mentioned: GitHub - microsoft/foldingdiff: Diffusion models of protein structure; trigonometry and attention are all you need!: Diffusion models of protein structure; trigonometry and attention are all you need! - microsoft/foldingdiff


Unsloth AI (Daniel Han) ▷ #general (89 messages🔥🔥):

  • Gradient Accumulation Issues
  • Apple's New Mac Mini Release
  • Training Foundation Models
  • Dataset Preparation in ML
  • Vision Fine-Tuning Delay

Links mentioned:


Unsloth AI (Daniel Han) ▷ #off-topic (41 messages🔥):

  • Quitting school early
  • Experiences of aging
  • R&D in tech
  • Job searching after a master's
  • Collaborative work offers

Link mentioned: Brain Dog Brian Dog GIF - Brain dog Brian dog Cooked - Discover & Share GIFs: Click to view the GIF


Unsloth AI (Daniel Han) ▷ #help (28 messages🔥):

  • Continued Pretraining with Unsloth
  • Installation Issues with Unsloth
  • Fine-Tuning Models with Custom Datasets
  • Instruct Fine-Tuning with Llama Models
  • GPU VRAM Management during Fine-Tuning

Links mentioned:


Unsloth AI (Daniel Han) ▷ #research (12 messages🔥):

  • ThunderKittens Update
  • Rickroll in Research
  • Community Reactions
  • Paper on ThunderKittens

Links mentioned:


OpenAI ▷ #ai-discussions (129 messages🔥🔥):

  • AGI Development
  • Model Efficiency
  • Quantization Techniques
  • AI Tools and Integration
  • Nvidia GPUs for AI

Link mentioned: Meta Releases Llama3.2 1B/3B Quantized Models: Accelerated Edge Inference, Reduced Memory Usage: Meta launches Llama3.2 quantized models with 2-4x faster inference and reduced memory usage, optimized for mobile devices.


OpenAI ▷ #gpt-4-discussions (11 messages🔥):

  • Open Source LLM Tool
  • Custom GPT Data Uploads
  • Performance of RAG
  • Cave Johnson AI

OpenAI ▷ #prompt-engineering (2 messages):

  • Stochasticity
  • Prompt Generation Tool

OpenAI ▷ #api-discussions (2 messages):

  • Stochasticity
  • Prompt generation
  • Playground tools

Perplexity AI ▷ #announcements (1 messages):

  • Perplexity Supply
  • New Shipping Options

Links mentioned:


Perplexity AI ▷ #general (124 messages🔥🔥):

  • File Upload Issues
  • Pro Subscription Promo Codes
  • Spaces and Collections Changes
  • NFL Widgets Introduction
  • Comparison of Perplexity and Consensus for Research

Links mentioned:


Perplexity AI ▷ #sharing (9 messages🔥):

  • Earth's Temporary New Moon
  • Python Module Headers
  • Search History Viewing
  • Jang Gyuri
  • Discord Registration

Perplexity AI ▷ #pplx-api (5 messages):

  • Differences in Playground and API results
  • API for Perplexity Spaces
  • Perplexity API usage for development

aider (Paul Gauthier) ▷ #general (107 messages🔥🔥):

  • Aider commands
  • Haiku 3.5 release status
  • Qodo AI vs Cline
  • Skyvern AI automation
  • Gemini usage in development

Links mentioned:


aider (Paul Gauthier) ▷ #questions-and-tips (12 messages🔥):

  • Aider Configuration
  • DeepSeek Coder with Ollama
  • Input Caching Efficiency
  • Consistent Code Generation Guidelines
  • Connection Issues with Copilot

Link mentioned: Specifying coding conventions: Tell aider to follow your coding conventions when it works on your code.


aider (Paul Gauthier) ▷ #links (3 messages):

  • New Bash Tools from Claude Anthropic
  • Integration of New Tools with Code Assistants

Link mentioned: GitHub - disler/anthropic-computer-use-bash-and-files: Contribute to disler/anthropic-computer-use-bash-and-files development by creating an account on GitHub.


OpenRouter (Alex Atallah) ▷ #announcements (1 messages):

  • Oauth issue
  • API key creation

OpenRouter (Alex Atallah) ▷ #app-showcase (1 messages):

  • Flexible Chat App for macOS
  • Alpha Testing
  • User Feedback

Link mentioned: imgur.com: Discover the magic of the internet at Imgur, a community powered entertainment destination. Lift your spirits with funny jokes, trending memes, entertaining gifs, inspiring stories, viral videos, and ...


OpenRouter (Alex Atallah) ▷ #general (114 messages🔥🔥):

  • OpenRouter Key Issues
  • Model Selection in OpenRouter
  • Haiku 3.5 Release
  • Prompt Caching for Models
  • OpenRouter Chat Functionality

Links mentioned:


OpenRouter (Alex Atallah) ▷ #beta-feedback (5 messages):

  • Integration Feature Access

Stability.ai (Stable Diffusion) ▷ #general-chat (117 messages🔥🔥):

  • GPU Comparisons
  • Training Models
  • Stable Diffusion Issues
  • Using Auto1111 vs Comfy UI
  • Recent Model Developments

Links mentioned:


Nous Research AI ▷ #general (67 messages🔥🔥):

  • Microsoft's Control Over OpenAI
  • Latency Issues with AI Models
  • Deployment of Autonomous Twitter Agents
  • Flash Attention and CUDA Compatibility
  • Performance of Hermes 3 vs Other Models

Links mentioned:


Nous Research AI ▷ #ask-about-llms (12 messages🔥):

  • Function Calling Datasets in Spanish
  • Effectiveness of Hermes 3
  • Data Retention Policies of API-based AIs
  • Apple's Private Cloud Compute
  • Concerns about Data Privacy

Links mentioned:


Nous Research AI ▷ #interesting-links (8 messages🔥):

  • AI-generated Code at Google
  • NotebookLM
  • Code Metrics
  • Continuum App

Link mentioned: Tweet from Andrew Curran (@AndrewCurran_): Sundar Pichai said on the earnings call today that more than 25% of all new code at Google is now generated by AI.


Nous Research AI ▷ #reasoning-tasks (3 messages):

  • Stocks
  • Meme coin simulation
  • Synthetic datasets

Eleuther ▷ #general (16 messages🔥):

  • Running multiple instances on a single GPU
  • Testing RAG with CSV data
  • Entity extraction and temperature settings
  • Submissions and rebuttals for COLING
  • Variability in harmful instructions across benchmarks

Eleuther ▷ #research (44 messages🔥):

  • Modular Duality in Optimization
  • Comparison of Optimization Papers
  • Training Diffusion Models
  • Limitations of Diffusion Models
  • Operator Norms in Neural Networks

Links mentioned:


Eleuther ▷ #interpretability-general (7 messages):

  • sae_dashboard
  • AI paper on concept geometry

Link mentioned: Tweet from Max Tegmark (@tegmark): Our new AI paper reveals surprising geometric structure in the LLM-learned concepts: 1) They form brain-like "lobes", 2) they form "semantic crystals" much more precise than it first ...


Eleuther ▷ #lm-thunderdome (15 messages🔥):

  • Freezing Embedding Layer
  • Multiple Choice Prompt Format
  • Winogrande Context Handling
  • Eval Harness API Issues
  • Answer Matching Heuristics

Interconnects (Nathan Lambert) ▷ #news (16 messages🔥):

  • Elon Musk xAI funding talks
  • Cursor premium discussion
  • Robonato embodiment talks
  • Creative Writing Arena insights

Links mentioned:


Interconnects (Nathan Lambert) ▷ #random (54 messages🔥):

  • Claude 3 Tokenizer
  • AI2 New Office
  • MacBook Pro Pricing
  • AI2 Cringe Video
  • Pacific Northwest Scenery

Links mentioned:


Interconnects (Nathan Lambert) ▷ #memes (3 messages):

  • 420gunna's reign
  • Impressive financial milestone
  • Timeliness commentary

Interconnects (Nathan Lambert) ▷ #posts (8 messages🔥):

  • Voiceover feedback
  • Email mishaps

Notebook LM Discord ▷ #announcements (1 messages):

  • NotebookLM usability study
  • Audio Overviews feedback
  • Participant incentives
  • Remote chat opportunities

Link mentioned: Participate in an upcoming Google UXR study!: Hello, I’m contacting you with a short questionnaire to verify your eligibility for an upcoming usability study with Google. This study is an opportunity to provide feedback on something that's c...


Notebook LM Discord ▷ #use-cases (38 messages🔥):

  • Simli real-time avatars
  • Pictory for podcast videos
  • Voice splitting techniques
  • NotebookLM podcast capabilities
  • Hedra character generation

Links mentioned:


Notebook LM Discord ▷ #general (40 messages🔥):

  • Podcast Generation Limitations
  • Issues with Language Switching
  • Notebook Features Request
  • Audio Segmentation Techniques
  • Interruption Issues in Podcasts

Links mentioned:


GPU MODE ▷ #general (4 messages):

  • AI Impact on Jobs
  • Interest in Deep Tech
  • Advertisement Messages

GPU MODE ▷ #torch (16 messages🔥):

  • FSDP2 API Updates
  • Memory Profiling in Rust with PyTorch
  • torchao Optimizers and SR Support
  • CUDA Kernel Debugging with C++
  • Early Pruning Configs in Triton Kernels

Links mentioned:


GPU MODE ▷ #beginner (12 messages🔥):

  • int8 vs fp16 tensor cores
  • GPU options for compute-heavy tasks
  • Cloud GPU vs Local GPU Deployment
  • Performance overheads in tensor operations

GPU MODE ▷ #torchao (1 messages):

starsupernova: Will take a look!!


GPU MODE ▷ #rocm (15 messages🔥):

  • HIP memory error issues
  • GFX1100 contention problems
  • ROCM version concerns

GPU MODE ▷ #sparsity-pruning (7 messages):

  • Variably sized block pruning
  • Structured pruning methods
  • Unstructured sparsity methods
  • Lottery Ticket Hypothesis
  • Structured sparse winning tickets

Link mentioned: Coarsening the Granularity: Towards Structurally Sparse Lottery Tickets: The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i.e., winning tickets) that can be trained in isolation to match full accuracy. Despite many exciting...


GPU MODE ▷ #thunderkittens (7 messages):

  • ThunderKittens talk schedule
  • Livestream on CUDA and ThunderKittens
  • TK vs Triton and CUTLASS
  • TK library approach
  • Mamba-2 kernel complexity

Link mentioned: CUDA + ThunderKittens, but increasingly drunk.: My friend Quinn (x.com/qamcintyre) asked me to teach him CUDA and ThunderKittens, and I agreed on the condition that we film it so that I can make new studen...


Torchtune ▷ #general (27 messages🔥):

  • Llama 3.2 QLoRA Training
  • Quantization Issues
  • Activation Checkpointing Impact
  • Adapter Weights Saving Time
  • Checkpointing Performance Discrepancies

Link mentioned: improve resume from checkpoint · Issue #1551 · pytorch/torchtune: The current experience with resume from checkpoint can improve. A few potential ways: good defaults: Resuming from checkpoint should have as default using the last checkpoint saved, so the user can...


Torchtune ▷ #dev (20 messages🔥):

  • kv cache implementation
  • dynamic cache resizing
  • multi-query attention
  • PyTorch 2.5 enhancements

Links mentioned:


Cohere ▷ #discussions (4 messages):

  • SOAP Optimizer
  • Account issues

Cohere ▷ #questions (17 messages🔥):

  • Cohere Command R Model performance
  • Rate limit issues
  • Support contact for assistance
  • Enterprise use cases focus
  • Budget software model application

Cohere ▷ #api-discussions (15 messages🔥):

  • Embed V3 Comparisons
  • Structured Output Augmentation
  • Model Type Display Issues
  • Fine-tuning Model Training Errors
  • Gamification Ideas

Cohere ▷ #projects (1 messages):

  • Invites status
  • Application rejections

Cohere ▷ #cohere-toolkit (1 messages):

sssandra: <@1132196995361157171> hi! is this an error you're getting using toolkit?


Latent Space ▷ #ai-general-chat (36 messages🔥):

  • Browserbase Funding
  • ChatGPT Chat History Search
  • LLM Evaluation Challenges
  • Realtime API Updates
  • SimpleQA Benchmark

Links mentioned:


tinygrad (George Hotz) ▷ #general (11 messages🔥):

  • Ethos NPU Opinions
  • Evaluation Kit Ownership
  • Tinygrad Development Questions
  • Tinygrad Font Inquiry

Links mentioned:


tinygrad (George Hotz) ▷ #learn-tinygrad (18 messages🔥):

  • Training Jobs on Tinybox
  • Qwen2's Base Building Blocks
  • EfficientNet OpenCL Issues
  • Exporting Models to ONNX
  • Testing Time Training Approaches

Links mentioned:


Modular (Mojo 🔥) ▷ #general (6 messages):

  • Idiom of Mojo vs Python
  • Learning Resources for Mojo
  • Contributing to NuMojo and Basalt Projects
  • Linear Algebra Implementation in Mojo
  • GPU Utilization in Mojo

Modular (Mojo 🔥) ▷ #mojo (17 messages🔥):

  • Mojo architecture
  • C++ compatibility
  • Syntax proposal
  • C++ macros
  • Custom decorators

Links mentioned:


LlamaIndex ▷ #blog (2 messages):

  • create-llama app
  • ToolhouseAI tools
  • hackathon insights

LlamaIndex ▷ #general (14 messages🔥):

  • Multi-agent query pipelines
  • LlamaIndex workflows
  • RecursiveRetriever class issues
  • Parallel agents with memory

Link mentioned: multi-agent-concierge/video_tutorial_materials at main · run-llama/multi-agent-concierge: An example of multi-agent orchestration with llama-index - run-llama/multi-agent-concierge


LlamaIndex ▷ #ai-discussion (1 messages):

  • RAG with LlamaIndex
  • Text-to-SQL integration

DSPy ▷ #papers (5 messages):

  • Extreme Multi-Label Classification
  • DSPy Programming Model
  • Online Search for Labels

Link mentioned: In-Context Learning for Extreme Multi-Label Classification: Multi-label classification problems with thousands of classes are hard to solve with in-context learning alone, as language models (LMs) might lack prior knowledge about the precise classes or how to ...


DSPy ▷ #general (4 messages):

  • DSPy Structure Enforcements
  • Structured Outputs
  • MIPROv2 Integration

OpenInterpreter ▷ #general (4 messages):

  • Job Automation Predictions
  • Open Interpreter vs Claude
  • Restoring Specific Chat Profiles

OpenInterpreter ▷ #ai-content (4 messages):

  • ChatGPT Chat History Search
  • Digitizing Scent
  • Scent Teleportation
  • Limited Release Fragrance

Links mentioned:


LangChain AI ▷ #general (2 messages):

  • invoke function response time
  • FastAPI routes efficiency
  • Hugging Face Transformers documentation

LangChain AI ▷ #share-your-work (1 messages):

  • Knowledge Nexus AI
  • KNAI Discord Community
  • KNAI Publication on Medium
  • Decentralized Knowledge Systems
  • Knowledge Graphs and Semantic Web

LangChain AI ▷ #tutorials (1 messages):

  • OppyDev Plugin System
  • Enhancing AI Output

OpenAccess AI Collective (axolotl) ▷ #general (1 messages):

duh_kola: True but the ones I want to train are instruction versions lol


OpenAccess AI Collective (axolotl) ▷ #general-help (2 messages):

  • LoRA finetuning
  • H100 GPUs
  • BitsAndBytes issue

LAION ▷ #general (2 messages):

  • Image Decoding Issues

LAION ▷ #research (1 messages):

thejonasbrothers: https://arxiv.org/abs/2410.20424


LLM Agents (Berkeley MOOC) ▷ #mooc-questions (2 messages):

  • LLM Agents Quizzes
  • LLM Agents Hackathon
  • Course Sign Up
  • Discord Channel

Link mentioned: Large Language Model Agents MOOC: MOOC, Fall 2024


Mozilla AI ▷ #announcements (1 messages):

  • Transformer Labs Event
  • Lumigator Tech Talk

Gorilla LLM (Berkeley Function Calling) ▷ #leaderboard (1 messages):

  • Llama-3.1-8B-Instruct (FC)
  • Llama-3.1-8B-Instruct (Prompting)
  • Function Calling Performance
  • Model Comparison





{% else %}

The full channel by channel breakdowns have been truncated for email.

If you want the full breakdown, please visit the web version of this email: [{{ email.subject }}]({{ email_url }})!

If you enjoyed AInews, please share with a friend! Thanks in advance!

{% endif %}