> AI News for 4/26/2024-4/29/2024. We checked 7 subreddits and [**373** Twitters](https://twitter.com/i/lists/1585430245762441216) and **28** Discords (**416** channels, and **10824** messages) for you. Estimated reading time saved (at 200wpm): **1197 minutes**.

Lots of discussion about SB-1047, the new gpt2-chatbot on lmsys, and extending Llama-3-8B to 1m context, but otherwise no clear top story emerges. You can check out the WebSim/WorldSim podcast as Nous Research gets ready to relaunch it after briefly taking it down due to security issues.


Table of Contents

[TOC]


AI Reddit Recap

Across r/LocalLlama, r/machinelearning, r/openai, r/stablediffusion, r/ArtificialInteligence, /r/LLMDevs, /r/Singularity. Comment crawling works now but has lots to improve!

Advances in AI Models and Capabilities

Applications of AI

Deploying and Optimizing AI Models

Concerns and Challenges


AI Twitter Recap

all recaps done by Claude 3 Opus, best of 4 runs. We are working on clustering and flow engineering with Haiku.

Prompt Engineering Techniques and Applications

  • Reasoning and Multi-Step Problem Solving: @cwolferesearch outlines recent prompt engineering research for reasoning tasks, including zero-shot CoT prompting, selecting CoT exemplars based on complexity, progressive refinement of rationales, and decomposing complex tasks into sub-tasks.
  • Tool Usage and API Integration: @cwolferesearch highlights research on teaching LLMs to leverage external tools and APIs, such as text-based APIs, natural language programs composed of tool calls, and code execution in sandboxed environments.
  • Optimizing Context Window Usage: @cwolferesearch discusses studies on the impact of context window properties, such as the negative effects of irrelevant context, attention biases towards the beginning/end of prompts, and strategies for selecting optimal few-shot exemplars.
  • Improving LLM-Assisted Writing: @cwolferesearch covers techniques for enhancing LLM-generated writing, such as outline generation and iterative filling, using smaller LLMs to generate “directional stimuli”, and iteratively increasing information density in summaries.

Emerging Abilities and Scaling Laws in Large Language Models

  • Emergent Abilities and Pretraining Loss: @_jasonwei discusses a paper that plots emergent abilities against pretraining loss, showing linear correlations for some benchmarks and emergent behavior at specific loss thresholds for others. Pretraining loss is suggested as a better metric than compute for comparing models.
  • Potential Upper Bounds on Function Approximation: @jxmnop shares insights from a paper showing that vastly different architectures can produce identical performance at the same parameter count, suggesting we may be close to the upper bound of approximating functions given a certain amount of compute.
  • Limitations and Potential Walls for Language Models: @bindureddy argues that language models may soon hit a wall due to the limits of human language, reasoning, and the inability to surpass a certain level on benchmarks like MMLU despite increased compute or data.

Advancements in Vision-Language Models and Video Understanding

  • PLLaVA: Parameter-free LLaVA Extension to Videos: @_akhaliq introduces PLLaVA, which extends the LLaVA framework to video dense captioning without requiring extensive paired data. The approach leverages pre-trained 2D diffusion models and a pooling strategy to achieve state-of-the-art performance on video question-answering and captioning tasks.
  • HaLo-NeRF: Learning Geometry-Guided Semantics: @_akhaliq presents HaLo-NeRF, a system that connects neural representations of landmark scenes with text descriptions to enable fine-grained understanding and localization of semantic regions. The approach harnesses vision-and-language models adapted for 3D-compatible segmentation and volumetric scene representation.

Techniques for Efficient Training and Deployment of Large Language Models

  • FP6 Quantization for Efficient LLM Inference: @rohanpaul_ai shares a paper on using six-bit quantization (FP6) to reduce the size of LLMs while preserving model quality across various applications and model sizes. The paper introduces TC-FPx, a GPU kernel design scheme supporting float-point weights for various quantization bit-widths, enabling practical performance improvements during LLM inference.
  • Proxy-Tuning: Efficient Customization of Large LMs: @rohanpaul_ai explains Proxy-Tuning, a lightweight decoding-time algorithm that achieves the result of directly tuning a large LM by using smaller tuned LMs to shift the original predictions. This approach allows for efficient customization of large, potentially proprietary LMs through decoding-time guidance.
  • Parameter-Efficient Sparsity Crafting for Instruction Tuning: @rohanpaul_ai discusses a paper proposing Parameter-Efficient Sparsity Crafting (PESC), which converts dense models into sparse Mixture-of-Experts (MoE) models for efficient instruction tuning. PESC inserts adapters into each expert, updating only the adapter parameters, significantly reducing computational costs and memory requirements while achieving state-of-the-art performance.

Regulations and Policy

  • California Bill 1047 Details: @nearcyan shared details on California Bill 1047 which has been fast-tracked. The bill covers all models made with 10^26 flops or similar performance, requires developers to assert models are safe under penalty of perjury, and creates a Frontier Model Division to report to.
  • Concerns with California SB-1047: @jeremyphoward expressed concerns that California SB-1047 “Safe and Secure Innovation for Frontier Artificial Intelligence Models Act” could do great harm to startups, American innovation, open source, and safety. The bill imposes overly broad definitions, misunderstands dual use, has restrictive requirements, and disincentivizes openness.

AI Discord Recap

A summary of Summaries of Summaries

1. Advancements in Large Language Models (LLMs) and AI Capabilities

2. Model Optimization, Quantization, and Efficiency Techniques

  • Extensive discussions around quantization techniques like 4bit lora and 4bit qlora, with debates on their effects on model performance based on training extent. Binary Quantization is explored for creating smaller indexes for similarity searches.

  • DeepSpeed’s FP6 quantization promises quantized inference with similar throughput, generating excitement for improved efficiency.

  • Researchers present CPU-optimized LLMs capable of generating Python code using a Chain-of-Thought prompt method, highlighting the pursuit of efficient, low-cost models.

3. Open-Source AI Development and Community Collaboration

  • The Eleuther community compares LLM performance, discusses emergent abilities, and shares research on topics like redundant neural circuits and adversarial prompting against LLMs.

  • OpenAccess AI Collective delves into fine-tuning strategies, quantization methods, and tokenization challenges, with members sharing insights from repositories like axolotl and FastChat.

  • The LlamaIndex community explores techniques like multi-hop retrieval, knowledge graphs for long-term memory, and shares resources like an AWS workshop on LLM app development patterns.

4. Ethical Concerns and Regulatory Challenges in AI Development

  • LAION faces restrictions due to EU laws, limiting access to public compute clusters and prompting researchers to gravitate towards more active communities with ongoing experimentation.

  • Discussions around the proposed California SB-1047 bill and its potential harm to startups, open-source AI development, and American innovation, underscoring regulatory challenges.

5. Misc

  • CUDA C++ claims the spotlight: A YouTube lecture on CUDA C++ llm.cpp delves into optimizing LLM training, with promises of cleaner and faster code. Support materials and related discussions suggest significant performance improvements and readiness for scaling LLMs to gpt-large sizes.

  • Intel’s oneAPI spreads its wings: Intel’s oneAPI garners attention for offering a unified programming model across CPUs, GPUs, and FPGAs. Enthusiasm bubbles up for the upcoming Battlemage GPU lineup, and the oneAPI ecosystem welcomes contributions for cross-vendor support, with developer resources on GitHub and announcements over Codeplay’s official press release.

  • Machine Learning gig at InstaDeep: InstaDeep is on the hunt for Machine Learning Engineers versed in high performance ML, Bio AI, and custom CUDA kernels. They offer a stimulating environment and multiple positions for problem solvers ready to make real-world impacts, with applications open on the InstaDeep job portal.

  • AMD stokes the competitive fires: Discussions revolve around the AMD Instinct MI300X’s potential for server environments and ROCm’s current state, with links to product pages and rental options hinting at a heated rivalry with NVIDIA. ROCm support and comparisons suggest AMD’s focus on greater accessibility and performance enhancement for developers.

  • Triton and PyTorch Forge Ahead: GitHub repositories such as unsloth and attorch emerge as treasure troves for those seeking Triton and PyTorch integrations. While flash-attn 2.5.8 earned compatibility accolades with PyTorch 2.3.0, discussions on optimal CUDA tensor indexing techniques and tensor gradient calculations in Triton reinforce the community’s drive for efficiency.


PART 1: High level Discord summaries

Unsloth AI (Daniel Han) Discord

  • Phi 3 Integration an Unsloth Triumph: Unsloth AI now supports Phi 3, delivering twice the speed with half the memory usage. Enthusiasts can explore the Colab notebook for detailed guidance.

  • Bilingual Model Makes a Splash: Thermostatic introduced NeuralTranslate_v0.2_GGUF, a bi-directional English-Spanish translation model that preserves Mistral’s reasoning without overfitting, all available on Hugging Face.

  • GPU optimization chatter: AI community debates best practices for minimizing VRAM usage, sharing insights on manual layer pruning, and discussing offloading techniques with code examples from Kolibrify’s GitHub repository.

  • Dataset Dexterity: A tip for merging raw text and chat datasets to improving fine-tuning outcomes was shared, alongside a notion to use larger datasets for base models and smaller ones for instruct models. There’s also mention of offloading parts of language models to reduce inference memory, as explained with code in a GitHub repository.

  • Future Functionality Features: Suggestions for Unsloth AI included automatic optimization of hyperparameters like batch size and learning rate. Meanwhile, a community member humorously anticipated the addition of a cake-baking feature upon training completion.


CUDA MODE Discord

CUDA C++ claims the spotlight: A YouTube lecture on CUDA C++ llm.cpp delves into optimizing LLM training, with promises of cleaner and faster code. Support materials and related discussions suggest significant performance improvements and readiness for scaling LLMs to gpt-large sizes.

Intel’s oneAPI spreads its wings: Intel’s oneAPI garners attention for offering a unified programming model across CPUs, GPUs, and FPGAs. Enthusiasm bubbles up for the upcoming Battlemage GPU lineup, and the oneAPI ecosystem welcomes contributions for cross-vendor support, with developer resources on GitHub and announcements over Codeplay’s official press release.

Machine Learning gig at InstaDeep: InstaDeep is on the hunt for Machine Learning Engineers versed in high performance ML, Bio AI, and custom CUDA kernels. They offer a stimulating environment and multiple positions for problem solvers ready to make real-world impacts, with applications open on the InstaDeep job portal.

AMD stokes the competitive fires: Discussions revolve around the AMD Instinct MI300X’s potential for server environments and ROCm’s current state, with links to product pages and rental options hinting at a heated rivalry with NVIDIA. ROCm support and comparisons suggest AMD’s focus on greater accessibility and performance enhancement for developers.

Triton and PyTorch Forge Ahead: GitHub repositories such as unsloth and attorch emerge as treasure troves for those seeking Triton and PyTorch integrations. While flash-attn 2.5.8 earned compatibility accolades with PyTorch 2.3.0, discussions on optimal CUDA tensor indexing techniques and tensor gradient calculations in Triton reinforce the community’s drive for efficiency.


Perplexity AI Discord

Slow Pro Search Annoys Users: Perplexity AI’s Pro Search users are complaining of increased search times, lamenting that searches are taking up to 90 seconds across all engines, affecting the web client but not the mobile app.

Claude 3 Opus Chat: To Subscribe or Not?: Members debate the merit of subscribing to Claude 3 Opus chat, with some users reporting positive experiences, although no specific comparative features with the API version have been discussed.

New AI Model Anticipation: There’s keen interest in the potential integration of WizardLM 2 and LLama-3 70B Sonar Large 32k models into Perplexity AI, with users noting they may outperform existing models on specific tasks.

Frustrations Over Opus Daily Limits: Perplexity users are voicing frustration over a 50 queries per 24 hours cap on Opus, calling for greater transparency and lamenting perceived degradation in quality.

Billing Blues and API Queries: Users are expressing issues with billing, citing being charged despite expecting a free trial, and seeking the right channels for enterprise API discussions. Meanwhile, questions about single-turn conversation guidelines with online LLMs, Harpa configuration, and model accessibility on third-party platforms like make.com are stirring up technical curiosity.


Stability.ai (Stable Diffusion) Discord

Forge Forgets Functions: Trouble with SDXL and Forge UI is boiling over; users report issues with image previews and express concerns over the potential abandonment of Forge. Workarounds include delving into GitHub issues and tweaking startup flags like --no-gradio-queue.

Release Radar - Stable Diffusion 3.0: The AI engineering community eagerly awaits the launch of Stable Diffusion 3, triggered by hints from a CivitAI newsletter pointing to an end-of-May release. Anticipation is mixed with skepticism about open weight availability and comparisons with Pony Diffusion V7, discussed in a Civitai article.

Cashing in on AI Art: Discussions on monetizing AI-generated art revealed that NSFW creators are outperforming SFW artists in marketplaces like Civitai. Brainstorming ensued on potentially lucrative trends such as AI girlfriend apps and a noted indifference towards fine-tuning efforts for models like Stable Cascade.

Toolbelt Expansion: Engineers swapped tips on AI model training tools beyond AUTOMATIC1111, spotlighting dreambooth and kohya_ss for custom training, while also contemplating the ethical quandary of using artist names in datasets.

Enigmatic Enquiries Enlighten: Inquisitive interactions ranged from exploring text-to-speech solutions to diving into model fine-tuning specifics. The discussion sometimes took a lighter turn with humorous comments about virtual “graphics card downloads” and idle curiosity about Stable Diffusion’s ability to visualize without explicit prompts.


LM Studio Discord

A New Challenger for VRAM: Discussions underscore the importance of VRAM for LLM operations, with 16GB as the minimal baseline and aspiration for the 32GB VRAM club stirring excitement. The performance gains from using Nvidia’s contemporary GPUs and the feasibility of models split across multiple cards, potentially streamlined by NVLink, were also key points.

LLM Leapfrog: The Meta-Llama-3-8B-Instruct-Q5_K_M.gguf model is earning praise for its performance on an M1 MacBook Pro. Users are advised to consider quantization types when running models to ensure compatibility with their hardware, and resources for local model deployment and instructions are deemed helpful, with pointers to tools like LM Studio and Groq API.

The Quirks of Model Behavior: Users encountered various version-related issues, such as phi-3 mini models outputting nonsense after an update to LM Studio version 0.2.21, and handling crashes in LM Studio since recent updates. Concerns about LLama 8b models rambling and the need to restrict reliance on integrated graphics for dedicated GPU utilization were also highlighted.

Bots, Books, and Bugs: Integrating Discord bots with LLM models for message retrieval and Wikipedia searches has gained traction. Meanwhile, navigating the capacity to run models like Stanford’s Octopus v2 on mobile or PC devices surfaced as a complex issue, and LLama 3 models are suspected of “hallucinating” current event knowledge, given their lack of internet access.

ROCm Hiccups: Users battling with LM Studio ROCm’s limitations discovered that it doesn’t support RX 6700, which provokes thoughts on HIP SDK compatibility and potential workarounds such as those implemented by KoboldAI. Additionally, a server error within the platform sparked dialogues, but no resolution was reported.


Nous Research AI Discord

  • Snowflake Arctic Unveils Cost-Efficient AI Solutions: The Snowflake AI Research Team launched Snowflake Arctic, an LLM aimed at providing cost-efficient enterprise AI solutions, amidst other less-contextualized YouTube video shares.

  • Intel and Logitech Augment AI Offerings: Intel’s CEO highlighted AI’s growth potential during their quarterly results, as shown in a YouTube video, while Logitech introduced an AI Prompt Builder for more fluent ChatGPT interactions, demo video available.

  • Emerging Trends in AI Quantization and Model Architectures: Hugging Face hosts binary-siglip-text and binary-siglip-vision, demonstrating efficient embeddings, with discussions also encompassing speculations around OpenAI’s naming schemes and the introduction of DeepSpeed FP6 quantization for improved throughput.

  • LLM Discussion: Performance Issues and Legal Confusion: Users report LLaMA-3’s EOS token generation issues, which link to stopping criteria solutions on GitHub, while Cohere’s licensing for command-r models stirs debates over commercial code usage, and frustrations are aired about a gpt2-chatbot, mistakenly associated with GPT-4 capabilities.

  • Data, Documentation, and Development through AI Community Collaboration: Technical contributions include generating multi-hop literature data, using pydantic models for ideation, and refining graph representations of LLM outputs. Anna’s Blog provided information on WorldCat data scraping and utilization in literature comprehension datasets.

  • Web and World Simulation Tools Garner Interest: The Nous Research community gears up for worldsim testing with free invites, and reveals experiences with various web simulation tools, such as companion-based AI, documented at websim example, and long conversations, indicating a growing interest in AI’s conversational stability potential.


HuggingFace Discord

  • Community Constructs Computer Vision Course: A new community-built computer vision course is live on HuggingFace, covering machine learning principles in the field using models from their ecosystem.

  • Model Showcase and Updates: The newly announced multilingual Qwen1.5-110B-Chat model supports a 32K context length and other improvements; its details can be found on its model page. Additionally, the link to the “Qwen1.5-110B” model has been corrected and can now be accessed on HuggingFace and the associated blog post.

  • Creative Solutions and Collaborations Encouraged: Amidst various technical inquiries, members sought creative problem-solving ranging from undisclosed Gradio issues to LLM Performance optimizations based on hardware constraints, specifically mentioning 32 GB of RAM should suffice for many tasks. There’s also a push to identify and improve image classification or object recognition models for practical applications like pinball game scoring systems.

  • Model and Space Innovations Abound: Various models and spaces surfaced including a Sentence Transformer Model for semantic search tasks with a context length of 16,384 (BEE-spoke-data), and a Minecraft Skin Generator using a stable diffusion model (Stable Diffusion Finetuned Minecraft Skin Generator). The Instant Video space by KingNish leverages ByteDance’s AnimateDiff Lightning model for quick text-to-video creation (Instant Video).

  • Explorations in Diffusion and AI Advertisement Detection: Participants exchange best practices for object generation with precision, incorporating tools like the IP-Adapter in diffuse models for enhanced image prompting, and addressing color consistency issues across platforms. Conversations also navigated toward evaluating YOLO classifiers for improved accuracy and performance in various applications.


OpenAI Discord

  • ChatGPT Gets a Memory Upgrade: ChatGPT Plus users can now save conversational context using the newly introduced Memory feature, though availability is still limited, excluding users in Europe and Korea.
  • Exploring AI’s Relation to Consciousness: The community engaged in intense debates over whether AI could exhibit consciousness, with discussions venturing into the philosophical domain, comparing AI’s experience of the temporal with continuous human consciousness, and the perception of self in neural networks.
  • Model Comparisons Spark Discussions: Technical discussions emphasized the strengths and weaknesses of various AI models, with ChatGPT, Claude 3 Opus, and Gemini 1.5 being benchmarked, while acknowledging that while command-R Plus and Llama3-70b may fall behind GPT-4, they represent their own leaps in progress.
  • Prompts as Competitive Sport: Members proposed the idea of prompt competitions, both paid and for play, to sharpen skills and enhance community engagement, highlighting the potential for emerging qualities in LLMs that cannot be predicted by simply scaling up smaller models.
  • API Ups and Downs Noted: Engineers discussed various operational issues from rate-limits on custom GPT uses, backend errors at “https://chat.openai.com/backend-api/gizmos/”, to concerns about performance and availability of GPT-4’s features like memory and voice control.

Eleuther Discord

Exploring the Limits of Model Size: Engineers debate the effective cutoff for model parameters, seeking a point where further addition offers negligible returns. In a bid for efficiency, the criterion has shifted towards focusing on non-embedding parameters, potentially finding a sweet spot under 200 million.

Multilingual Hurdles in The Pile: The Pile’s dataset limitations were highlighted, indicating a lack of multilingual representation which might impact model training and performance, particularly in languages like German. Additionally, while comparing models like GPT-NeoX and Megatron, discussions centered on NeoX’s user-centric quality improvements.

Stability or Speed? The Model Serving Conundrum: Technical discussions have surfaced regarding discrepancies in model serving speeds, such as between Mixtral and Llama models at Fireworks.ai; considerations included batching size and hardware specifics as potential factors.

Refusal’s Single Neuronal Pointer: The AI Alignment Forum presented a discovery that refusal mechanisms in LLMs might hinge on a solitary direction within network layers. This spurred discussions about orthogonalization and fine-tuning possibilities for refusal behavior.

Pull Request Perils and Pipeline Woes: Members expressed concerns about CLA signing issues and failing checks on GitHub pull requests, with some conversations dwelling on the stagnation of specific branches. Questions were raised about the adaptability of evaluation prompts to different models’ finetuning needs, with suggestions for custom functions to handle diversity.


OpenRouter (Alex Atallah) Discord

  • Two-Step Price Hike for Soliloquy 8B: The Soliloquy 8B model transitioned to a paid usage model at $0.1 per 1M tokens, followed by a further increase to $0.2 per 1M tokens. The rates reflect OpenRouter LLC’s policy changes and are documented on the model’s OpenRouter page.

  • Claude’s Checkup: Users troubleshooting Claude models found that they max out at a generation of 4k tokens with a capability to read up to 200k tokens, and that proper API settings can optimize response. Relevant documentation can be found here.

  • WLM-2 Hosting Huddle: A detailed analysis of WLM-2 hosting costs led to the conclusion that profitability hinges on factors like GPU efficiency and the off-chance revenue from idle resources.

  • Quiet Arrival of FireLLaVA: FireLLaVA, an open multimodal model boasting swift initialization, has quietly entered the OpenRouter suite. It’s a significant addition for developers given its non-proprietary nature and can be explored on OpenRouter’s page.

  • Frontend Frustrations Find Frugality: A quest for a budget-friendly frontend to allow family members to access OpenRouter services without individual OpenAI accounts inspired recommendations for using free-tier offerings like Vercel, or economical VPS like Contabo.


OpenAccess AI Collective (axolotl) Discord

  • WizardLM Stays Magical: Contrary to whispers, Microsoft’s WizardLM models have not vanished; rather, updates were made by the wizardlm team, ensuring continued public access to the repository.

  • The Fine Art of Model Fine-Tuning: Discussions contrasted fine-tuning domain-specific language models against using Retrieval-Augmented Generation (RAG), with references made to the medically-focused LLM paper and the usage of llama-pro methodology as seen in fsdp_qlora.

  • Quantization Quandaries and Tokenization Tactics: Considerable chatter surrounded tokenization challenges, requiring the latest fastchat formatter for models like LLaMA-3; meanwhile, the community grappled with understanding quantization methods like 4bit lora and 4bit qlora through discussions and a Twitter thread, revealing a sensitivity to quantization based on the extent of model training.

  • AI’s Need for Space and Speed: A stark reminder that Fast Fourier Transform (FFT) with zero3 could gobble up to 167GB of RAM, even on 2x24GB GPUs, setting off discussions on memory management techniques like torchtune and the perplexing observation of high disk space usage, as well as the utility of PEFT models for efficiency in fine-tuning neural networks.

  • GPU Scaling Secrets and FSDP Mechanics: The collective cornered the topic of GPU scaling, exchanging insights on the fine details of micro batch sizes, gradient aggregation, and the use of Fully Sharded Data Parallelism (FSDP) and ZeRO Stage 3 for model loading across GPUs - all critical for the effective use of hardware resources.


Modular (Mojo đŸ”„) Discord

  • Mojo Gets Modular: Modular’s standard library, modularml/mojo, saw a 23% increase in commits post open-sourcing, signaling heightened contribution activity.
  • Multimodal Search Empowered by MAX: A blog post by Modular revealed the MAX Engine outshines both PyTorch eager and ONNX runtime in benchmarks, excelling in multimodal search involving textual and visual data.
  • Modular Tweets Curated: Key tweets from Modular were highlighted, spanning updates and announcements, with links including Tweet 1, Tweet 2, Tweet 3, and Tweet 4.
  • Advancements and Issues in Mojo Land: Key discussions covered converting Python to Mojo, memory allocation optimizations, and matrix slicing in Mojo. Importing challenges in the standard library were tackled, and nightly compiler updates continue to roll out, catching issues like file handle lifetime management.
  • Performance Pursuits Proliferate: From investigations into dictionary performance to SIMD optimizations for error-correction algorithms, the community delved into efficiency enhancements. The compact-dict library was mentioned as a potential speed booster, and __copyinit__ usage was debated, exemplified in a listed Gist.

LlamaIndex Discord

AWS and Llama Index Sit Down to Code: A workshop with AWS to demonstrate 3 patterns for LLM app development emphasizes data ingestion with S3 and embeddings with AWS Bedrock.

Security Spotlight on ML Podcast: The latest mlsecops podcast features the co-founder of Llama Index discussing LLM-based application futures and data security, including tools like LlamaParse and LlamaCloud.

RAG Under the Microscope: Marco Bertelli’s 9-part RAG tutorial series paves the road for any prototype to hit the production stage with a delineation of vital architectural components.

Multistep Quest for Improved RAG Reasoning: A methodology enhancing RAG involves a multi-hop retrieval process, combining Llama Index and Cohere reranking, which sharpens context awareness and minimizes hallucinations, as discussed in this post.

Remember All with memary: Unveiling memary, a long-term memory framework using knowledge graphs, which promises to expand memory capabilities in autonomous agents supplemented by LLMs, explained in this tweet.


OpenInterpreter Discord

Flask and Keys: An OpenInterpreter member encountered issues when running a Flask server and discussed workarounds like setting a dummy api_key and modifying pydantic configurations to resolve namespace conflicts.

Hardware Hurdles Surmounted: The absence of Groq integration with OpenInterpreter prompted discussions, citing a pull request #1238 aimed at adding support. There were also questions around the use of devices like the Rabbit r1 with OpenInterpreter, focusing on the system’s language and voice command capabilities.

Anticipating the Heavy: Eager anticipation bubbles around the so-called 01 Heavy device without concrete release details, while a custom 3D project for OpenInterpreter garners attention and a member cues in an upcoming discussion on the timeline for 01 Light.

Community Code Crusade: Members actively shared progress and assistance requests for projects associated with OpenInterpreter. This includes the llm-switcher, and potential Groq API implementations, encouraging community contributions.

Open AI Ethics Discourse: A conversation sparked around the ethical implications of AI abilities like file modification, particularly in reference to Microsoft’s capabilities, with the implicit suggestion that OpenInterpreter could be crafted to be more aligned with diverse user needs.


Latent Space Discord

Berkeley Benchmarks Function Call Skills: The Berkeley Function Calling Leaderboard serves as a new measure, periodically updating to benchmark how effectively Language Models (LLMs) call functions in real-world scenarios.

Laying Down the Law with LLM Limitations: An exploration into the confines of LLMs highlights their inability to prevent “goal drift”, with details provided in a Strangeloopcanon article, emphasizing areas for potential improvement.

Swyx Keeps the Pod Waves Flowing: A shout-out to a new podcast episode from swyxio might capture the audience’s interest; details shared via a tweet.

Elevating the Mix with Mixture of Depths: The new Expert Choice Routing transformer layer, which aims to achieve faster convergence and better longer sequence processing introduced in a paper, is stirring up discussions. For more in-depth information, engineers can take a look at the paper here.

Linux Video Sharing Level-Up: Vesktop appears to be the hot topic for Linux users seeking better video sharing experiences on Discord, with its performance and compatibility improvements detailed on the GitHub repository.


LAION Discord

  • LAION’s Compute Conundrum: EU regulations are impeding LAION’s ability to utilize public compute clusters, prompting researchers to shift their attention towards more active research communities with ongoing experimentation.

  • Terminus Group Draws in Diverse Experts: The Terminus Research Group, an informal collective, recently welcomed the “pixart guy,” signaling a trend of burgeoning communities rich in cross-disciplinary talent.

  • Pursuing the Aesthetics of AI: LAION-Aesthetics aims to quantify visual appeal using machine learning models, with their open-source code accessible on GitHub for public collaboration and use.

  • Quantization Conundrum Raises Eyebrows: Discord members examined a Reddit post on LLM benchmark inconsistencies across precision levels, casting the spotlight on the testing procedures and inherent unpredictability in LLM performances.

  • Token Generation Rate Talks: AI engineers discussed the token generation speeds on advanced GPUs for varying models and configurations, sharing that selecting effective tools like exllama and TabbyAPI can enhance overall performance.

  • VAST Interest Peaks Among Engineers: Members delved into the potential of the omni-modality foundation model and dataset, VAST, expressing interest in its capabilities by soliciting use-cases and tips for fine-tuning.

  • Emerging Research Stirs Excitement: A newly published research paper grabbed attention with its novel proposals for more efficient large model inference and layer management, sparking conversations on its practical applications.

  • Graph Integration into LLMs Explored: Inquires about amalgamating graph data structures with LLMs triggered exchanges on techniques and literature for enriching language models with non-sequential data.

  • Fine-Tuning Frustrations on Medical Mistral: Challenges in fine-tuning Mistral models for medical text generation surfaced, focusing on excessive sequence generation and the utility of padding protocols to assuage these issues.

  • Eleuther Expertise Exchange Encouraged: Members suggested consulting the Eleuther server for expert guidance in LLM fine-tuning, generating interest in this hub of specialized knowledge.


Cohere Discord

Engines Revving Up for AI-Enhanced Browsers: AI enthusiasts debated the merits of Tavily and Brave Search API as search engine tools for integration with AI, discussing price points and efficiency while addressing rate limitations Brave Search API Info and exploring Tavily API Info.

Cohere Toolkit Love: The community showed appreciation for Cohere’s open-source toolkit, benefiting from its prebuilt components to expedite the deployment of RAG applications Cohere Toolkit on GitHub.

Squashing Bugs and Deployment Dilemmas: Technical roadblocks such as sqlite3 errors when using cohere-toolkit locally and deployment challenges on Azure surfaced, with shared solutions found in various GitHub resources.

Customizing and Fine-Tuning Queries: Questions around the specifics of model fine-tuning and the boundaries of Cohere’s free trial API arose, prompting discussions of model availability and detailed terms.

Command-r Shines in Multi-Language Support: Command-r’s effectiveness with non-English languages was acknowledged, plus inquiries into its commercial use specs sparked discussions, suggesting avenues through contacting Cohere’s sales team or using AWS Sagemaker.


tinygrad (George Hotz) Discord

  • Formula Flexibility in Tinygrad: Discussion around tinygrad focused on creating mathematical formulas through basic primitive operations and emphasizing the importance of constructing a dependency graph for efficient gradient calculations and hardware utilization in AI modeling.

  • Tinygrad’s Dynamic Enhancements Await: Members shared excitement for the upcoming tinygrad 0.9 release, anticipating new features that could further improve AI model training and discussed ongoing work on handling dynamic testing and symbolic shapes to enhance operation flexibility.

  • Proposing a Learning Path for Tinygrad Enthusiasts: For those eager to dive into tinygrad’s intricacies, members recommended starting with MicroGrad and MiniTorch, then proceeding through the tinygrad codebase. This aims to solidify foundational concepts for better contributions to tinygrad’s development.

  • Kernel Optimization Insights: A member highlighted optimization techniques such as loop unrolling, while sharing detailed technical writeups and guides to understand the inner workings of tinygrad’s kernel optimizations, particularly targeting AI performance boosts.

  • Hybrid Model Harmony Highlighted: There was mention of successful integration between tinygrad and PyTorch, utilizing nn.module to combine features of both frameworks into a hybrid model, demonstrating the potential synergy in AI tooling.


Interconnects (Nathan Lambert) Discord

Bold Moves for Newsletter Growth: Members weighed the pros and cons of cross-promoting with Semafor, debating potential audience growth against the risk of diminishing brand value with unwanted plugs.

Phi-3 and Arena Gather Steam, OLMo Training Insights Offered: Microsoft’s unveiling of Phi-3 and Arena’s milestone of 800K votes sparked discussions, as did a seminar on Open Language Model training, which left the audience desiring deeper insights.

RLHF Nuances and Ghost Attention’s Diminished Glow: Engineers dissected the nuanced performance of Reinforcement Learning from Human Feedback (RLHF), touched on KTO’s promise, and debated the fading significance of Ghost Attention, once thought to be crucial for maintaining long conversation consistency in LLaMA 2 models.

OpenELM Triumphs, Encouraging Progressive AI Ideals: Conversations centered around OpenELM’s performance surpassing OLMo, reflected on the community’s development ethos, focusing on continuous improvement, and underscored the educational value of open models.

AGI - A Philosophical Conundrum: There’s an ongoing dialogue about the subjective nature of AGI, with members appreciating posts that ignite thoughtful considerations on the topic.


LangChain AI Discord

AI Integration Queries and Challenges: Engineers requested guidance on prompt integration and reported issues with AzureSearchVectorStoreRetriever being incompatible with async operations, hinting at possibly wrapping sync functions in async for compatibility. There’s also a confusion within the community regarding the Gemini 1.5 Pro model, clarifying that it works exclusively with VertexAI, as demonstrated with successful ChatVertexAI implementations.

LLM Deployments and Observability Preferences: Discussions unfolded around different deployment approaches, including Hugging Face versus OpenAI API; security considerations were mentioned with respect to bypassing LangChain for direct SQL Server connections. There was also debate on effective observability tools for LLMs, like Arize Phoenix and Langfuze, highlighting a slight preference toward self-hosted options.

Galactic API Giveaway and AI Job-Hunters: GalaxyAI is providing free API access, boasting compatibility with premium models such as GPT-4 and GPT-3.5-turbo. Separately, a GitHub repository introduced Genai-Job-Agents, a Langchain/Langgraph-based agent for streamlining job searches and CV optimisation.

AI Tutorials Amass: A suite of tutorials surfaced, including “Local RAG agent with LLaMA3 and Langchain” and “Llama 3 Web Browsing Agent with Langchain and Groq,” addressing the design and implementation of RAG systems and web browsing capabilities. A captcha issue was flagged when trying to access a potentially useful Amazon book on NLP and LLMs, but the underlying material was not dismissed.

Reviving the RAG, Ride the Llama: Insights from sharing channels reveal advancements in Retrieval-Augmented Generation (RAG) implemented with LLaMA3, underpinning the creation of AI-driven web UI for applications, and interactive avatars for customer Q&As, expanding the horizons of interactive AI utilization across various platforms.


Mozilla AI Discord

  • Segmentation Fault in Llama: Engineers are facing a segmentation fault when running llamafile, especially on Modal Labs platforms while using files like Phi-3-mini-128k-instruct.F16.llamafile. This issue has been widely reported among users attempting to integrate various llamafiles.

  • Memory Reporting Woes in htop: A notable bug in htop misrepresents shared memory usage on Linux, which could affect how AI engineers perceive memory demands during intensive model operations.

  • Get Your Update to Llamafile v0.8.1: The release of llamafile v0.8.1 promises support for the Phi-3 Mini 4k, fixes GPU module crash issues, and provides bundled NVIDIA + AMD shared objects for Ubuntu, thus potentially smoothing out some persistent wrinkles for engineers.

  • Unraveling Quirks in LLM Output: Anomalous outputs with parentheses and line breaks have been observed by users operating LLMs like Llama3 70B and Mistral via llamafile, sparking conversations about the consistency and idiosyncrasies of model behaviors.

  • Optimizing Llamafile for Peak Performance: There’s a shared interest in optimizing GPU usage with llamafile, where users exchanged tips on maximizing system RAM utility. Clarity is sought on identifying if a model runs on GPU or CPU, along with managing the llamafile-generated endless output.


AI Stack Devs (Yoko Li) Discord

AI Companion Radar: Faraday and Amica Catch the Eye: Faraday and Amica garnered attention for their position as AI companion apps that prioritize data privacy, where Faraday can operate locally thanks to llama.cpp, and Amica offers self-hosting and cloud services with enhanced features. Both apps introduce a new angle on AI relationships, promoting user privacy, with Faraday receiving a nod for its month-long performance and Amica as an emerging contender.

Bedtime Stories Win Big: Creative design with AI NPC characters by the participants of the Rosebud AI Sleep Game Jam led to notable entries, with Bedtime Negotiation standing out and winners announced via Twitter. A new game jam focusing on Education and AI is up next, with details available on Twitter.

A Town Called Addictive: AI Town was celebrated for its addictive quality in a Twitter post, inspiring ideas for a developer-centric simulation. LLM-powered NPC models and infrastructure enhancements were shared, with a repository on GitHub and a model hub on Huggingface, despite a broken API access link, and feedback was solicited for these NPC advancements.

Map Quest for AI Town: Debate on map handling for AI Town surfaced with suggestions ranging from using static assets to reduce bandwidth, to optimizing the original file reading method for maps. A YouTube tutorial titled “100% Local ‘AI Town’ with Llama 3 AGENTS!!!” was promoted, delivering a how-to for those eager to dive into their local setup.

Character Crafting Challenges: Dialogue around the development of NPC characters led to a promise for a detailed blog post. Discussions pinpointed the effort to compress model output, minimize model calls, and address issues found with generalist instruct-models like GPT-3.5 or Mistral.


DiscoResearch Discord

DiscoResearch Delves into Router Coefficient Mysteries: Engineers discuss inconsistencies in router_aux_loss_coef between versions of Mixtral — 0.02 for Mixtral-8x7B-Instruct-v0.1 and 0.001 for Mixtral-8x22B-Instruct-v0.1 — suggesting the potential need for higher loss_coef in smaller experts.

Initialization Inconsistencies Spark GPU Conversations: The DiscoLM_German_7b_v1 model encounters slow initiation times on HPCs compared to local machines; inference times improved from over 12 minutes to 10 seconds after loading the model to GPUs.

Speed Humps Ahead for Model Loading: Attempts to improve DiscoLM_German_7b_v1 load times using low_cpu_mem_usage=True have failed, sparking suggestions that the model may be bottlenecked by slow storage drives.

Downloading German with Gusto: The gguf model reaches 1500 downloads in two days, showing a strong demand for German language models within the community.

Tokenizing for Chit-Chat: Questions arise about changes to tokenizer configurations in Phi-3 Llamafied german models intended for chat application optimization, while the newly created Phi-3 MoE model emerges for experiments needing further training.


Alignment Lab AI Discord

  • AI Tackles Tough Topics: There was a discussion regarding the application of Llama 3 for assessing topic complexity with reports of effective outcomes. This indicates ongoing exploration into AI capabilities for content assessment.

Skunkworks AI Discord

Python Code Gen Breakthrough with CPU-Optimized LLMs: A new study presents CPU-optimized language models capable of generating Python code, suggesting a Chain-of-Thought prompt method to improve model outcomes, outlined in the paper “Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation”.

Binary Quantization Buzz in HaystackDB: Discussions revolve around the HaystackDB repository potentially using 2bit embeddings, with further clarification that Binary Quantization assists in efficiency by creating smaller indexes for similarity searches.

Trouble Training LLaMA-3 to Finish Up: A member experienced issues with LLaMA-3 models during fine-tuning, as models are not generating the End Of Sentence (EOS) token, impacting model performance where completion is critical.

Snowflake Arctic Chills Enterprise AI Costs: A video introduced Snowflake Arctic, a large language model designed for enterprise applications focusing on cost-effective AI solutions for businesses.

RAG-nificent Demonstrations with LLaMA3: Tutorial videos were shared, showcasing the use of Retrieval-Augmented Generation (RAG) with LLaMA3 in local environments through Langchain, as well as a session on implementing web browsing with LLaMA 3, Langchain, and Groq hardware here.


LLM Perf Enthusiasts AI Discord

Gamma Seeking AI Engineer: Gamma, highlighted by a16z and boasting over 10 million users, is looking to hire an AI engineer for prompt engineering, evaluations, and fine-tuning of text and image models. The role is pivotal in their content creation tools expansion, and the company prides itself on its growth, achieved with minimal team size and substantial funding, indicating a robust business model and significant market impact.

Spot the AI Talent: Candidates can apply for the AI engineer position at Gamma, set in the heart of San Francisco with a requirement of on-site collaboration thrice a week. This opportunity is for those keen on pushing the boundaries of large language models (LLMs) and can be explored further at Gamma’s career page.

GPT Sleuthing: Speculation arose around gpt2-chatbot, which is suspected by some to be a leaked version of GPT-4.5, triggered by discussions around a tweet by @phill__1 regarding its sophisticated domain knowledge. Community members simply responded with enthusiasm, acknowledging the bot’s quality.

A Tweet of Approval: The community expressed a succinct sentiment that the gpt2-chatbot is “good,” suggesting a community consensus on the bot’s impressive performance, which hints at its potential and future capabilities in the field.


Datasette - LLM (@SimonW) Discord

  • Code-Gen Goes Custom: Discussion about enhancing code-generation included the idea of custom grammar implementation to prevent syntax errors, emphasizing a model-specific option that could improve semantic accuracy.

The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


PART 2: Detailed by-Channel summaries and links

Unsloth AI (Daniel Han) ▷ #general (912 messagesđŸ”„đŸ”„đŸ”„):

  • Unsloth Supports Phi 3 Release: Phi 3 is now officially supported by Unsloth, offering 2x faster speed & 50% less memory usage. Users can find the detailed Colab notebook here.
  • Unsloth Performance Enhancements: Phi 3 can be finetuned using 4-bit precision with the Unsloth framework, accommodating limitations on VRAM. Users are experimenting with various finetuning flows combining SFT, DPO, and ORPO to enhance model performance.
  • Checkpoints Management in Finetuning: Users can create checkpoints during finetuning with Unsloth to save progress and avoid overfitting. To do so, one must modify training arguments accordingly and handle resumes from the desired checkpoints.
  • Usage of Colab and Alternatives Dissected: Users discuss the limitations of Google Colab’s paid version due to runtime disconnections and explore alternative services like TensorDock that offer more affordable and reliable GPU access for model training.
  • Technical Difficulties with GGUF Conversion: There are ongoing issues with converting models to GGUF format even when the Unsloth framework is used locally. Users are encouraged to upgrade Unsloth and possibly recompile llama.cpp to resolve quantization failures.

Links mentioned:


Unsloth AI (Daniel Han) ▷ #random (55 messagesđŸ”„đŸ”„):

  • Dataset Combination Hack: A conversation suggests merging raw text and chat datasets to improve results, hinting at a potential approach for fine-tuning models.

  • Notebook and Fine-tuning Tips Revealed: The Unsloth AI community shares a repository link with notebooks for fine-tuning language models, along with a specific Colab notebook for text completion tasks.

  • Colab Out of Memory (OOM) Solutions: A helpful snippet of code was shared to alleviate Colab’s OOM issues, suggesting the use of torch.cuda.empty_cache() and gc.collect() in a loop.

  • Peer-to-Peer Sharing Promoted: A user announces the creation of an open community to discuss the latest in Multimodal AI, providing a link to follow them on various social platforms.

  • Support for New Model in Unsloth AI: There is excitement about the Phi 3 model being now supported, as revealed by a user who provided a link to a Discord channel for a relevant Colab (link not accessible outside Discord).

Links mentioned:


Unsloth AI (Daniel Han) ▷ #help (506 messagesđŸ”„đŸ”„đŸ”„):

  • Troubleshooting Compilation Issues: Users discussed errors while compiling code, specifically mentioning llama.cpp not being in the correct folder and successfully resolving their issue by following the correct installation instructions.

  • Support Queries and Update Requests: Discussions about Unsloth AI’s support for different models such as Llava and Qwen models revealed that they are not currently supported. Users suggested improvements like a feature to truncate from a specific part of chat templates. Updates were made to Colab notebook installations instructions following an xformers update.

  • Dataset Format and Fine-Tuning Inquiry: A user sought clarification on whether their dataset format is correct for fine-tuning and which exact Llama 3 model from Unsloth should be used for training with code. It was clarified that a larger dataset is suitable for the base model, while smaller datasets go well with instruct models.

  • GPU Usage for Unsloth Pro: A user queried about the benefits of Unsloth Pro with one or more RTX 4090 GPUs. They were informed that the benefits are multiplied with the additional GPUs.

  • Duplicate Python Installation Issues: Discussions highlighted issues with installations, including the case where a user had two Python versions installed, causing dependency issues. This was resolved by adjusting the Python version and removing the older one.

  • Finetuning Llama with Code: Questions about finetuning Llama 3 proceeded with guidance given for a user who wanted to finetune Llama with Svelte code. They were advised on using the base model and its distinctions from the instruct variant.

Links mentioned:


Unsloth AI (Daniel Han) ▷ #showcase (74 messagesđŸ”„đŸ”„):

  • Unveiling Kolibrify for Curriculum Learning: Kolibrify, a project designed for curriculum training of instruction-following LLMs with Unsloth, has been shared. It’s described as useful for LLM fine-tuning and rapid prototyping.

  • Thermostatic Releases Bilingual Translation Model: A new version of Thermostatic’s bidirectional English-Spanish translation model, NeuralTranslate_v0.2_GGUF, has been published, which is said to maintain Mistral’s native reasoning capabilities and doesn’t contain overfitting.

  • Scoped Skilled Agents in AI’s Future: @timelordraps predicts a 6-month roadmap where AI advancements will see highly capable small models, token-efficient pre-training, self-expanding and self-spawning subagents, leading to recursive self-improvement by November.

  • Token-Efficient Clone Project Underway: @timelordraps is optimizing a devin clone for token efficiency and is currently troubleshooting it for a simple snake game, with plans to test on other use cases and integrate with image models.

  • Llama Community Hub Announced: The newly launched llama-hub serves as a community platform for sharing and discussing models and use cases involving llama models. The official Unsloth llama-3-8b-bnb-4bit has been posted for community access.

Links mentioned:


Unsloth AI (Daniel Han) ▷ #suggestions (119 messagesđŸ”„đŸ”„):

  • Enhancing Unsloth’s Autotuning: A user suggested that Unsloth AI should automatically optimize values like batch size and learning rate based on model and dataset specifics. Another member humorously proposed that Unsloth should also bake a cake post-training, which aligns with it being on the roadmap, while a third person shared thoughts on implementation.

  • Manual Layer Pruning Debate: The conversation covered the intricacies of manually pruning layers in models, with one user suggesting replacing the forward method to ‘skip’ parts of layers. There was an extended discussion on whether to remove entire decoder blocks or focus on Matrix Linear Projection (MLP) components for SNR (Signal-to-Noise Ratio) optimization, with different strategies for minimizing model size and VRAM footprint touched upon.

  • VRAM Reduction Strategies and Offloading: The dialogue shifted to strategies for reducing model sizes, particularly in terms of VRAM usage. A user mentioned a successful inference memory reduction technique by offloading parts of language models and shared their experience integrating this approach into a Github repository (https://github.com/oKatanaaa/kolibrify/blob/7165ebbbcc8c44a6960ccfe78aa2d740a93789bd/kolibrify/model_utils.py).

  • Gemma 2b Model Compatibility with Unsloth: A fan of Unsloth inquired about the compatibility of the Recurrent Gemma 2b model with Unsloth, and a member recognized the potential benefits, but indicated that there’s a known VRAM issue with Gemma 2b, and that the focus is currently on Phi 3. Another mentioned a unique VRAM issue experienced by only one person, but with no widespread reports.

  • Potential Feature or Bug with Gemma 2b: Clarification was sought about whether Gemma 2b has a feature that causes VRAM issues or a bug. It was explained that while the model still works, the VRAM issue needs to be resolved; however, not everyone has encountered this problem, and it may be an isolated case.

Links mentioned:


CUDA MODE ▷ #general (18 messagesđŸ”„):

  • Countdown to CUDA Lecture: The next CUDA Mode lecture was announced to be taking place in 1 hour and 40 minutes, with excitement building as the llm.cpp team was said to be discussing, anticipated to be very hype.
  • Java Jolt for Cognition: A member expressed readiness for the upcoming lecture with coffee brewing in preparation.
  • Announcing Live CUDA Profiling Session: Today’s session was moved to Google Meet with this link, and despite minor hiccups on Discord, the live profiling lecture was well-received, and a trimmed version was promised for the YouTube channel.
  • Exploring a Broader Hardware Discussion: There was a proposal for creating discussions for Huawei Ascend solutions to promote more diverse hardware conversations, considering the current dominance of NVIDIA and AMD. The idea is under consideration for community interest and activity.
  • Innovation on a Dime: A fascinating project was shared where neural networks were implemented on a 10-cent RISC-V MCU without a multiplier, showcasing an example of making powerful technology accessible at minimal costs. The full blog post and a repository with detailed documentation are available at cpldcpu’s blog and GitHub.

Links mentioned:


CUDA MODE ▷ #triton (10 messagesđŸ”„):

  • Triton Tensor Indexing Explained: A method for indexing into a Triton tensor with another was shared, involving loading values from the indices tensor and using them with the strides and base pointer to create a tensor of pointers, then applying tl.load() and tl.store() for the desired result.
  • In Search of Open Source Triton LLM Implementations: A member was looking for open-source Triton implementations for large language models (LLMs) like llama or mistral. Another member referenced an unsloth repository on GitHub which could potentially suit their needs.
  • Exploring Efficient Gradient Calculation with Triton: A query was raised about calculating the gradient of a tensor by utilizing parallel threads in Triton and sum reducing along a dimension, with code snippets being shared to illustrate the current and proposed methods.
  • Repositories with Required Triton Kernels Highlighted: In a discussion about the existence of full model implementations using Triton kernels for large language models, several resources were mentioned, including the xformers repository and the flash-attention repository.
  • PyTorch Modules in Triton Shared: A member suggested the attorch repository as a potentially useful set of PyTorch’s neural network modules written in Python using Triton.

Links mentioned:


CUDA MODE ▷ #cuda (40 messagesđŸ”„):

  • Kernel Profiling Enigma: Profiling the tiled_matmult kernel vs. coarsed_matmult kernel from PMPP showed an unexpected minimal FLOP/s difference despite the latter having higher arithmetic intensity. It was suggested to look at instruction stats, particularly the stall short scoreboard, which is linked to SRAM ops and could be affecting memory bandwidth.

  • CUDA KERNEL Performance Tips: When optimizing CUDA kernels, members advised looking at warp state stats and instructed to load multiple values from SRAM into registers to perform multiple multiplications, thus improving SRAM utilization.

  • Learning CUDA Without Breaking the Bank: Discussion on acquiring GPU access for CUDA learning ranged from utilizing company/university resources to utilizing services like Google Colab and Lightning AI. Members emphasized the importance of having control over the environment, particularly for profiling with performance counters.

  • Emerging FP6 Data Type in CUDA Development: A DeepSpeed commit on GitHub introduced a new data type called FP6 with Tensor Core support on A100 GPUs, potentially improving the serving of Large Language Models (LLMs) and addressing the memory limitation challenges during inferencing.

  • Debating Best Practices in CUDA Programming: Queries about CUDA coding practices were addressed, including whether integer division should be avoided in kernel code. One suggestion was to utilize bit shifts for divisions by powers of two, with the observation that the nvcc or ptxas should optimize this automatically.

Links mentioned:


CUDA MODE ▷ #torch (10 messagesđŸ”„):

  • PyTorch Team at ASPLOS: The PyTorch team will be presenting a tutorial at ASPLOS, an announcement was made with the details provided via a Twitter link.

  • Flash-Attention Update Alert: Tri Dao’s new flash-attn 2.5.8 has been released and confirmed to be compatible with PyTorch 2.3.0. Sources include the project’s GitHub and PyPI pages.

  • Query on flash-attn Installation: A discussion was raised regarding flash-attn’s pip install option that doesn’t require a local CUDA build and why this isn’t the default. There was curiosity about the potential speed differences between pre-built binaries and those locally built.

  • Under the Hood of torch.compile: Discussion on the differences between torch.matmul, @, and torch.nn.functional.linear when used with torch.compile, referencing the gpt-fast blog post. The suggestion made to understand the differences was looking into the TORCH_LOGS output.

  • PyTorch Profiler Puzzles: A question was posed about why PyTorch sometimes launches 2 kernels during matrix multiplication, as observed by the profiler, inviting insights or theories regarding this behavior.

Links mentioned:


CUDA MODE ▷ #announcements (1 messages):

  • Boost in Code Clarity and Performance: NVIDIA’s C++ team is set to discuss porting llm.c to llm.cpp, promising cleaner and faster code. An exciting bonus talk is starting shortly for the community.

CUDA MODE ▷ #algorithms (54 messagesđŸ”„):

  • Trinary Nets Seek Efficient Matmul: A member initiated brainstorming on performing matrix multiplication (matmul) with trinary nets using packed int64 to handle 32 2-bit trinary values without unpacking. They posited that a masked multiply approach could avoid the computational and memory expenses associated with unpacking, yet actual implementation details and benefits remain theoretical.

  • Packing Unpacking in CUDA: Another conversation focused on optimizations for working with packed values; one member pointed to executing pack and unpack operations in a fused CUDA kernel as more cost-effective, but concerns were raised about the usability and complexity of this approach.

  • Exploration of Alternative Methods to Unpacking: Members discussed creating row operations that operate on integers directly, without unpacking, which might reduce the number of operations required.

  • Fused Kernels for Performance: There was agreement that while kernel fusion may not reduce the cost of operations, it can significantly decrease overhead by reducing memory read/copies. The conversation evolved into a discussion on the technical feasibility and potential computational efficiency gains of such methods.

  • FlashAttention’s Inner Workings Exposed: A member shared insights into the FlashAttention repository, indicating that kernel_traits.h is a core component for setting traits in CUDA, which are later utilized in FlashAttention. They linked a Colfax research post discussing FP8 and layout conformance enhancements in FlashAttention on the NVIDIA Hopperℱ architecture.

Links mentioned:


CUDA MODE ▷ #jobs (1 messages):

  • InstaDeep is Hiring Machine Learning Engineers: InstaDeep Research is looking for Machine Learning Engineers who are passionate about high performance ML Engineering and making a real-world impact. The role involves working with Bio AI, Decision Making AI, and technologies like custom CUDA kernels, SOTA model architectures, Quantisation and Distributed Training. Join the InstaDeep journey here.

  • Cultivate Innovation at InstaDeep: InstaDeep promises a cohesive and stimulating work environment for tech enthusiasts to contribute to impactful decision-making and technology products across industries. Internship opportunities can also be explored here.

  • InstaDeep Application Advice: Applicants can apply for multiple jobs at InstaDeep, but it is advised to limit applications to two closely linked positions that match their skills and qualifications.

  • Reapplying to InstaDeep: Those who have previously applied to InstaDeep and weren’t selected may consider reapplying if it has been more than six months since their last application.

Link mentioned: Job Offer | InstaDeep - Decision-Making AI For The Enterprise: no description found


CUDA MODE ▷ #beginner (12 messagesđŸ”„):

  • NVIDIA GPU on Laptops for CUDA: It’s generally viewed as acceptable to use a laptop with an NVIDIA GPU for learning and testing CUDA code, but not recommended for actual model training.
  • Seeking NCCL All-Reduce Resources: A member is in search of a good tutorial for learning NCCL to implement an all-reduce kernel, but has not yet received suggestions.
  • Jetson Nano for CUDA Learning: For those interested in learning CUDA, a Jetson Nano is recommended as a useful tool, especially when coupled with a spare monitor.
  • Resolving nvcc_plugin ModuleNotFoundError: A member following a GitHub tutorial encountered a “ModuleNotFoundError” for ‘nvcc_plugin’ when using %load_ext nvcc_plugin. The solution involved skipping the step and using %%writefile to compile instead.
  • AMD GPU Performance Inquiry: A member contemplating an upgrade from dual MI100 to MI210 asked for comparative BF16 performance insights, being redirected to a channel potentially more focused on AMD resources.

CUDA MODE ▷ #youtube-recordings (2 messages):

  • CUDA C++ Deep Dive Awaits: A YouTube video titled “Bonus Lecture: CUDA C++ llm.cpp” has been shared, offering insights into CUDA C++. The description includes a link to slides on Google Drive.
  • Slated for Later Release: The slides and code accompanying the CUDA C++ lecture are currently not available.

Link mentioned: Bonus Lecture: CUDA C++ llm.cpp: Slides: https://drive.google.com/drive/folders/1T-t0d_u0Xu8w_-1E5kAwmXNfF72x-HTA?usp=sharing


CUDA MODE ▷ #torchao (1 messages):

  • CUDA Extension Support Arrives in AO: Custom CUDA extension support has been integrated into torchao, as noted by a member with a PR link. The integration allows developers to follow a template to ensure their kernel works seamlessly with torch.compile.

  • AO Seeks Community Contributions: For developers passionate about writing CUDA kernels but dislike the packaging process, contribution to torchao is now open, especially for kernels optimized for consumer GPUs.

Link mentioned: Custom CUDA extensions by msaroufim · Pull Request #135 · pytorch/ao: This is the mergaeble version of #130 - some updates I have to make Add a skip test unless pytorch 2.4+ is used and Add a skip test if cuda is not available Add ninja to dev dependencies Locall



CUDA MODE ▷ #ring-attention (2 messages):

  • Pushing the Limits of Context Length in LLMs: An article from harmdevries.com highlights a trend of increasing context length in Large Language Models (LLMs), reaching up to 65K tokens, with innovations like FlashAttention playing a significant role by removing GPU memory bottlenecks.
  • The Rise of Long-Context LLMs: Many cutting-edge long-context LLMs are found to be finetuned versions of base models with shorter context lengths; one such example is the Yarn-Llama-2-7B-128k model, which boasts a 128K token context length.

Link mentioned: In the long (context) run | Harm de Vries: It’s not the quadratic attention; it’s the lack of long pre-training data


CUDA MODE ▷ #off-topic (4 messages):

  • Chill Vibes with ‘Critical Stop’: A Discord member shared a YouTube video titled “Critical Stop,” an auto-generated track by Creatune released on March 23, 2024, provided by DistroKid.
  • Keygen Music Nostalgia: Another YouTube video was shared, titled “Dead Feelings - CORE - Power ISO 3.1kg Keygen Music,” bringing some classic keygen music to the chat.
  • Evolving Cars Through a Genetic Algorithm: An intriguing web-based simulation, Genetic Cars 2, was posted, where a genetic algorithm evolves random two-wheeled shapes into cars over generations.
  • Musical Algorithm Rule #9: The “Bad apple on everything” YouTube playlist was linked, demonstrating the versatility of the ‘Bad Apple’ tune played on various devices, based on Rule #9: if it exists, there’s a “Bad Apple” version.

Links mentioned:


CUDA MODE ▷ #llmdotc (714 messagesđŸ”„đŸ”„đŸ”„):

  • FP16 vs BF16 Training Potentials: Discussions revolved around the feasibility of training models in FP16 without gradient scaling, with speculation that it might work as well as BF16. A link to research on FP8 training without scaling was shared as a possible analogous strategy.
  • Full BF16 Including Layernorms Merged: A PR was merged with full BF16 support, including layernorms, potentially simplifying code but requiring file version incrementation for proper model file handling.
  • Data Type Loading and Memory Access Optimizations: Extensive discussion on better vectorization of memory loads and stores in CUDA kernels, considering the usage of templates and specialized load/store instructions like __ldcs for streaming access to memory.
  • Delete Use of Cooperative Groups: A discussion was had around removing cooperative groups (cg) from the codebase to ease cross-platform compatibility and reduce dependencies, even though they are part of CUDA.
  • Performance Gains and Future Model Scaling: It was noted that the current version of train_gpt2cu now surpasses both PyTorch and optimized flashattention in token processing speed, indicating readiness for scaling models up to the size of gpt-large.

Links mentioned:


CUDA MODE ▷ #rocm (19 messagesđŸ”„):

  • AMD Instinct MI300X Gains Attention: The AMD Instinct MI300X is highlighted as a significant product for professional server purposes, with an official product page and discussions about its future availability.
  • Exploring ROCm and AMD vs NVIDIA Rivalries: The channel discusses George Hotz’s opinions and predicaments related to AMD and NVIDIA, including his thoughts on AMD’s performance and strategic decisions. The drama can be followed on the tinygrad page.
  • Seeking ROCm Community Expertise: A new member requests an introduction to ROCm HIP and expresses interest in a community-driven discussion about AMD’s vision and options available for developers new to AMD’s ecosystem.
  • Comparing AMD and NVIDIA Offerings: Community members compare the last PCIe card by AMD, the Instinct MI210, to high-end consumer graphics cards, noting significant price differences with NVIDIA’s counterparts, such as the RTX 4090.
  • Evolving AMD Windows Compatibility and RDNA4 Hopes: There is a positive reaction to AMD adding Windows build tests to their repositories, as well as anticipation for the next-generation RDNA4 announcement at Computex.

Links mentioned:


CUDA MODE ▷ #oneapi (22 messagesđŸ”„):

  • Intel’s oneAPI: A Unified Programming Model: The discussion highlights Intel’s oneAPI as a heterogenous compute platform capable of supporting CPUs, GPUs, and FPGAs, illustrated by Intel’s official article on oneAPI. oneAPI caters to developers with the promise of a unified programming model across various hardware.

  • Cross-Vendor GPU Support with oneAPI: Codeplay’s release of plugins for oneAPI marks a significant step, allowing developers to use SYCLℱ code for Nvidia and AMD GPUs. The announcement and a tutorial video on YouTube provide insights and resources for interested developers.

  • oneAPI Ecosystem Expands Across Major Frameworks and Tools: Developers can discover numerous oneAPI resources and libraries such as oneDNN, integrations with PyTorch and TensorFlow, and performance extensions for Scikit-learn, showcased on GitHub. For a broader scope, Intel’s oneAPI toolkit is said to support Apple’s ARM M1/M2/M3 and FPGAs, according to (oneAPI Toolkits page).

  • Codeplay’s Commitment to Compute Universality: A guide for running SYCLℱ applications on NVIDIAÂź GPUs and a reference silicon example for a RISC-V-based accelerator platform (Overview Reference Silicon) indicate the strides Codeplay is making in universality.

  • Intel Prepares for Next-Generation GPUs: In the chat, members express anticipation for Intel’s upcoming Battlemage GPU line-up, with reports of potentially having 12Gb of VRAM, which sparks a conversation about its suitability for AI-related tasks.

Links mentioned:


Perplexity AI ▷ #general (856 messagesđŸ”„đŸ”„đŸ”„):

  • Pro Search Slowdown Concerns: Users are reporting that the Pro Search feature on Perplexity has become slower, with searches taking up to 90 seconds. They’re experiencing this across all engines, such as Mistral, Opus, GPT-4, Sonar, and Sonnet. The issue appears mainly on the web client; the mobile app seems unaffected.

  • Claude 3 Opus Chat Versus API: Members are discussing whether it’s worth subscribing to Claude 3 Opus chat. Feedback from a user indicates that it’s really good, although no specifics were mentioned regarding features or tools available with Claude 3 compared to the API version.

  • Interest in New Models: Questions are being asked about future availability of WizardLM 2 and LLama-3 70B Sonar Large 32k models on Perplexity. Users report they can outperform GPT-4 in certain tasks and show curiosity if the new models might become part of Perplexity’s offerings.

  • Opus Daily Limit Discussions: Mention of an Opus daily limit on Perplexity has left some members in the community frustrated, especially as they believe the quality of Opus is degrading. Users report the current cap is 50 queries per 24 hours, and there’s a desire for increased transparency and updates on this issue.

  • Dissatisfaction with Perplexity Billing Issues: A user expresses dissatisfaction after being charged without receiving an expected free trial. Despite following steps mentioned in FAQ, they are considering taking action if the funds are not returned.

Links mentioned:

  • Tweet from OpenAI (@OpenAI): đŸ€đŸ˜ ↘ Quoting Greg Brockman (@gdb) First @NVIDIA DGX H200 in the world, hand-delivered to OpenAI and dedicated by Jensen "to advance AI, computing, and humanity":
  • DuckDuckGo at DuckDuckGo: no description found
  • Flashcardfy - AI Flashcard Generator with Personalized Feedback: Learn faster and smarter with AI-generated flashcards that provide personalized feedback.
  • Tweet from Gradient (@Gradient_AI_): We've been in the kitchen cooking đŸ”„ Excited to release the first @AIatMeta LLama-3 8B with a context length of over 1M on @huggingface - coming off of the 160K context length model we released on...
  • JavaScript Bloat in 2024: What is the average size of JavaScript code downloaded per website? Fuck around and find out!
  • Hoo Wants A Degree?: We all know college advisors, for lack of a better term, suck. So we made "Hoo Wants A Degree"! An AI degree builder for fellow Hoos trying to figure out how to make it to those sweet sweet ...

Perplexity AI ▷ #sharing (28 messagesđŸ”„):

  • Exploring Perplexity Search Links: Members actively shared various Perplexity AI search links, ranging from AI ethics in Homeland Security to the sci-fi future news, signifying diverse interests and use cases.
  • Diving into the Potential of Perplexity AI: One member revisited a previous Perplexity search link related to a personal matter, highlighting the search’s accuracy and usefulness over the past few weeks.
  • Scratchpad Feature Testing: Another member tested Scratchpad in codeblocks using a Perplexity link, indicating exploration of the platform’s features.
  • Collection Sharing: A BioExpress Sonnet collection was shared, showcasing how users are curating content.
  • Inquiry into Features and Troubleshooting: Discussions included requests for information on features like Scratchpad, as well as troubleshooting and exploring Perplexity AI’s capabilities.

Perplexity AI ▷ #pplx-api (9 messagesđŸ”„):

  • Seeking the Right Channel: A user inquired about the appropriate communication channel for discussing enterprise API usage with Perplexity AI, having not received a response to emails sent to [email protected] and [email protected]. Another user urged patience, noting that response times can range from 1 to 3 weeks.

  • Understanding Online Model Guidelines: A new member asked for clarification regarding instructions on using only single-turn conversations and avoiding system prompts with online LLMs like sonar-small-online and sonar-medium-online. Clarification was offered by another user, indicating that single-turn interactions are favored, and there is no system prompt access for these models.

  • Inquiry on Harpa Configuration: A user questioned the community about successfully configuring Harpa directly towards the Perplexity API.

  • Curiosity About Source URLs via API: A member sought to know if source URLs are accessible via the API as they could not find relevant information on the roadmap docs page. They were directed to fill out a form for access to citations but mentioned a previous denial due to restriction to funded startups.

  • Model Selection Mysteries on make.com: A question was posed regarding the absence of llama 3 models and mixtral 8x22b as options on make.com, seeking insights from other users.

Link mentioned: pplx-api form: Turn data collection into an experience with Typeform. Create beautiful online forms, surveys, quizzes, and so much more. Try it for FREE.


Stability.ai (Stable Diffusion) ▷ #general-chat (922 messagesđŸ”„đŸ”„đŸ”„):

  • Resolving SDXL and Forge UI Issues: Users discussed problems with SDXL and Forge UI, including difficulty with image previews and a potential abandonment of Forge. Suggestions were made to check GitHub issues, such as this reported issue, and trying flags like --no-gradio-queue in the webui.bat file.

  • Stable Diffusion 3 Anticipation: There’s ongoing speculation about the release date of Stable Diffusion 3, with some users referencing a CivitAI newsletter indicating an end-of-May release. Concerns about open weights release and whether SD3 will live up to its hype were expressed, along with a linked article discussing Pony Diffusion V7 updates and the potential impact of Altman’s actions against open-source.

  • Monetizing AI Generated Art: Users talked about the struggles of selling SFW AI-generated art amidst heavy competition, with NSFW content creators on platforms like Civitai being more successful. Suggestions were made about AI girlfriend apps being profitable and the lack of interest in fine-tuning models like Stable Cascade.

  • Discussing Toolings and Approaches for AI Training: Conversations about tools beyond AUTOMATIC1111 surfaced, with recommendations for using dreambooth and kohya_ss for training models. Additionally, the practicality and ethics of including artist names in training data were debated.

  • Miscellaneous Inquiries and Discussions: Users asked about topics ranging from text to speech tools to fine-tuning details for models. There was also humor regarding the metaphorical “downloading” of graphics cards and curiosity over whether SD can generate images without a prompt.

Links mentioned:


LM Studio ▷ #💬-general (472 messagesđŸ”„đŸ”„đŸ”„):

  • AI Helps with Homework: A user expressed amazement at the performance of the Meta-Llama-3-8B-Instruct-Q5_K_M.gguf model on an M1 MacBook Pro, highlighting its helpfulness in catching up on homework.
  • Exploring Model Performance: Discussions occurred around the difference in performance between models like the 34B and the 70B Code Llama. Users are advised to consider quantization types when selecting models to match their available hardware.
  • Integrating LLM with Discord Bots: Various users discussed creating Discord bots that utilize Llama3 models via the Groq API for features like pulling relevant messages and conducting Wikipedia searches.
  • LLM Model and API Usage: New users sought advice on utilizing local large language models (LLMs), while others shared resources like a YouTube tutorial on using LM Studio for private model deployment.
  • Training and Finetuning Models Locally: A discussion emerged on the feasibility and hardware requirements for offline model training. Users weighed in on the practicality, with one sharing a personal experience of an attempted finetune that predicted a full week of training time on an M3 Max device.

Links mentioned:


LM Studio ▷ #đŸ€–-models-discussion-chat (219 messagesđŸ”„đŸ”„):

  • Stanford’s Octopus v2 Puzzles Users: In the đŸ€–-models-discussion-chat, there were queries about how to run Stanford’s Octopus v2 in LM Studio or locally on a phone or PC, with no clear solutions provided, only indications of the complexities involved in running agent models that utilize function calling.

  • LLAMA Model Ramblings Frustrate Users: Discussions indicate that 262k and 64k Llama 8b models tend to ramble, exhibiting base Llama 3 behavior due to instruct fine tuning. Users share their experiences and expectations when working with these models for the first time.

  • Compatibility Issues for fp16 “phi3” and LM_Studio: Conversation centered around compatibility of the “phi3” model with different versions of LM_Studio, mentioning that while LM_Studio 2.20 (ROCm Preview) does not understand “phi3”, the newer version 0.2.21 might be required for it. Sympathies were expressed over wanting to use models that are yet to be supported in the studio.

  • Exploring AI Tools for Specific Tasks: Members requested websites to search for AI tools for specific tasks, such as generating music or finding similar scenes in different photos. Suggestions included using Pinokio Computer and Future Tools for this purpose.

  • Debate Over Whether LLaMA 3 Includes Internet Access: A user questioned if LLaMa 3 includes internet access after noticing the model provided current news information, but another user clarified that the models likely hallucinate, given that they do not have internet access.

  • Running Arctic from Snowflake AI Remains a Distant Dream: A member was intrigued by the Snowflake Arctic model, but discussions concluded that with the size of the model being significantly large, it is currently unrealistic to expect it could be run locally without substantial system resources.

Links mentioned:


LM Studio ▷ #🧠-feedback (5 messages):

  • Phi-3 mini Misbehavior after Update: A user reported that after updating to version 0.2.21, the phi-3 mini model began outputting gibberish despite no issues with the previous version 0.2.20. The issue was identified while using the official LM Studio config for phi-3 from the GitHub repo.
  • Screenshot Request for Diagnostic Purpose: In response to the phi-3 mini issue, another user requested screenshots of the whole app to further diagnose the issue.
  • P100 Performance Inconsistency and Dusty Monitors: A user suggested that if nothing else has changed besides the update from version 0.2.20 to 0.2.21, the problem could be a regression error worth filing in another channel. Jokingly, they also advised to clean the dust off the monitor.
  • LM Studio App Mysterious Crashes: A user described experiencing crashes with the LM Studio app since a couple of updates ago, with the app closing unexpectedly when resizing or navigating within the program. Their system specifications were shared, including Windows 10 Pro, Ryzen 7 5800X, RTX 3090, and 64GB RAM DDR4.

LM Studio ▷ #📝-prompts-discussion-chat (4 messages):

  • Exploring Methods to Interact with PDFs: One member suggested directly pasting the content of a PDF into a chat message alongside a question, assuming the model’s context length supports it.

  • RAG Solutions for Chatting with Docs: An alternative provided is to use a Retrieve and Generate (RAG) solution like AnythingLLM by running LM Studio as an API server and pointing AnythingLLM to that API.

  • Practical Considerations of PDF Length: In relation to managing PDF documents, the length of the PDF was a point of concern raised regarding the feasibility of pointing a language model directly at the PDFs for questions.


LM Studio ▷ #🎛-hardware-discussion (119 messagesđŸ”„đŸ”„):

  • VRAM: The Cornerstone of LLM Hardware: Members discussed VRAM as a crucial factor for running language models, with 16GB being a minimal suggestion and one member gearing up to join the 32GB VRAM club by ordering a second NVIDIA 4060 (ti - 16gb).

  • Dissecting GPU Compatibility and Performance: There was an in-depth conversation about the importance of utilizing contemporary architecture GPUs like Nvidia and ensuring sufficient VRAM (highlighted as the crux of considerations for LLMs). A member shared specifics around running different model sizes on their desktop with a 3060 GPU and 16GB RAM.

  • Forcing GPU Use Over Integrated Graphics: A member sought assistance on configuring LM Studio to use a dedicated GPU card rather than defaulting to their CPU’s integrated graphics. Options like disabling and re-enabling GPU offload and using settings such as CUDA_VISIBLE_DEVICES and tensor_split were suggested for better utilizing dedicated GPUs.

  • Multiple GPUs and Large Model Dilemmas: A member asked about LM Studio’s effectiveness using two GPUs (4090 & 3090) and whether the software would automatically split models between them. It was noted that models can be split between GPUs leading to increased data transfer times, but technologies like NVLink help optimize performance across multiple GPUs.

  • Optimizing for Different Hardware Profiles: Users exchanged experiences and speculations regarding optimal hardware configurations. An anecdote was shared about successfully running multiple models on a veteran GTX1070 8Gb GPU, proving functional even for less demanding, specialized use cases.

Links mentioned:


LM Studio ▷ #autogen (1 messages):

  • Server Error Message Troubleshooting: A member inquired about a fix for the server error stating, “[ERROR] [Server Error] {“title”:“‘messages’ array must only contain objects with a ‘content’ field that is not empty”}”. There was no further discussion or solution provided following this query.

LM Studio ▷ #langchain (1 messages):

ahakobyan.: can we know too?


LM Studio ▷ #amd-rocm-tech-preview (4 messages):

  • Compatibility Inquiry for RX 6700 with LM Studio ROCm: A member asked if the LM Studio ROCm works with RX 6700 (non-XT version) and requested troubleshooting assistance for logging errors. They shared an error output indicating a failed model operation without specific suggestions for resolution.

  • LM Studio ROCm Limitation Explained: Another participant clarified that LM Studio does not support RX 6700 (non-XT) as it relies on the HIP SDK, which is only compatible with certain AMD cards. They mentioned that KoboldAI leverages a workaround to operate on unsupported architectures.


Nous Research AI ▷ #off-topic (9 messagesđŸ”„):

  • Snowflake Arctic: The Snowflake AI Research Team introduces Snowflake Arctic, a large language model (LLM) focused on providing enterprise AI solutions with an emphasis on cost-efficiency.
  • Unspecified YouTube Video Shared: A YouTube video was linked without additional context or a description. Here is the mysterious video.
  • Llama 3 Web Browsing Agent: Demonstrating a web browsing agent, a video titled “Llama 3 Web Browsing Agent with Langchain and Groq” was shared, featuring implementation with Llama 3 with Langchain and Groq. Watch the video.
  • Gorillaz’s Hit Video: A YouTube link to the official video of “Feel Good Inc.” by Gorillaz was provided. Fans can enjoy the HD video here.
  • MatrixBridge introduces Skrapy: MatrixBridge is developing Skrapy, an AI agent for streamlined data collection and scraping, currently in alpha with a waitlist for early users. For more information or to join the community, visit MatrixBridge’s Skrapy page.

Links mentioned:


Nous Research AI ▷ #interesting-links (15 messagesđŸ”„):

  • Intel’s AI Ambitions Revealed: Intel CEO Pat Gelsinger discussed the company’s quarterly results, emphasizing growth in the foundry business and demand for AI in PCs. The video can be watched on YouTube under the title “Intel CEO Gelsinger on Q1 Earnings, Foundry Business, AI.”

  • Logitech Enhances AI Accessibility: Logitech has released AI Prompt Builder, a tool integrated with their mice, to facilitate faster and more fluent prompting of ChatGPT. Experience the convenience demonstrated in the YouTube video, “Introducing Logi AI Prompt Builder - Your shortcut to AI fluency.”

  • Quantized Embeddings for Efficient AI Models: A member shared Hugging Face model links to their fine-tuned versions which allow image and text embeddings to be compressed effectively into a binary format. Those interested can explore the models at binary-siglip-text and binary-siglip-vision.

  • Unlocking the Mystery of AI Refusal Mechanisms: Research from the ML Alignment & Theory Scholars Program revealed that refusals in LLMs are controlled by a single direction in the residual stream and an upcoming paper will delve deeper into the topic. The initial research findings can be reviewed on the Alignment Forum post, “Refusal in LLMs is mediated by a single direction.”

  • Legislation Threatens Open Source AI Development: Jeremy Howard aired concerns that California’s SB-1047 bill could significantly harm startups, innovation, and open source safety. Read Howard’s full take on the matter and the potential impacts of the legislation in his response: Answer.ai post on SB-1047.

Links mentioned:


Nous Research AI ▷ #general (566 messagesđŸ”„đŸ”„đŸ”„):

  • LLaMA-3 Finetune Troubles?: Users are discussing difficulties with LLaMA-3 not generating the EOS token correctly after fine-tuning. The suggestion was to add a stop criterion on token 128009 during generation, with further insights linking to a helpful Huggingface transformer stopping criteria repo.

  • GPT-2 Chatbot Mysteries: There’s confusion about the capabilities of a gpt2-chatbot, which despite its name seems linked to GPT-4 with a November 2023 knowledge cutoff. Discussions raise the issue that it struggles with some math tasks.

  • OpenAI Model Name Games?: Speculation rises that OpenAI might be hiding model identities like “gpt-3.5” under names like “gpt2-chatbot”, possibly due to legal issues or pending announcements.

  • DeepSpeed FP6 Quantization: Enthusiasm shines for the new DeepSpeed FP6 quantization, which promises quantized inference with similar throughput.

  • GPT-5 Anticipation & Critique: Amidst anticipation for new model releases from OpenAI, users express mixed feelings about the performance of contemporary LLMs, including AI-generated high-quality math solutions and a “gpt2-chatbot” model with advanced capabilities.

Links mentioned:


Nous Research AI ▷ #ask-about-llms (24 messagesđŸ”„):

  • Llama 3 GGUF Woes Spark Inquiry: Members are inquiring if the Llama 3 GGUF issues reported on GitHub and Reddit affect models made by Nous, with findings pointing to noticeable performance drops between different quantization levels.
  • Cohere Model License Confusion: Discussions are ongoing about the implications of Cohere’s licensing for the command-r models; concerns are raised over whether code generated by the models can be used for commercial purposes.
  • RAG LLM Standings Are Mixed: Queries about the best Retrieval-Augmented Generation (RAG) Large Language Models (LLMs) receive diverse responses highlighting Command R and Claude 2 models, with preferences not settled.
  • LLava 34B Stalls on a MacBook Pro M1: A user is facing performance issues running LLava 34B on a MacBook Pro M1, with suspicions that a bottleneck might arise from offloading the weights, resulting in very slow output.
  • Training Strategies for Multi-Task LLMs: There is a suggestion to mix training tasks rather than training epochs on individual tasks to avoid decreased performance seen in multiple finetunes over finetunes.

Links mentioned:


Nous Research AI ▷ #rag-dataset (25 messagesđŸ”„):

  • Exploring Multi-Hop Literature Comprehension Data Generation: A member shared notes on generating multi-hop literature comprehension data by inputting high school teacher tests into Opus. They linked to their work on GitHub, specifically to a document within the ‘Abstractions’ repository Abstractions on GitHub.

  • Pydantic Models Insight: Enthused discussions around the use of Pydantic models to straightforwardly represent and refine ideas. Members shared their experiences and anticipated improvements in workflow definitions by incorporating such structured approaches, including luminos.md on GitHub.

  • Graph Representation Extraction for LLM Output Analysis: One member is working to extract graph representations from generation outputs, aiming to provide both LLMs and humans with better tools for understanding and utilizing the information, considering both the utility and cost aspects of this method.

  • GitHub Mermaid Graphs as a Learning Revelation: The discussion uncovers a lesser-known GitHub feature that can represent and render Mermaid graphs, a realization that led to suggestions for enhancing documentation aesthetics and structure.

  • Anna’s Archive as a Resource for Preserving Literature Data: Dialogue emerged about the potential of incorporating data from WorldCat, available through Anna’s Archive, to enhance literature comprehension datasets, along with a link to Anna’s Archive description Anna’s Blog and a caution regarding the data’s licensing and public usability.

Links mentioned:


Nous Research AI ▷ #world-sim (167 messagesđŸ”„đŸ”„):

  • Worldsim Test Invites Incoming: A Nous Research member announced plans to offer invitations to test the worldsim application for free, prior to its live release. No specific date for these invites has been provided yet.

  • Voluntary Waifus in the Websim: Participants have been sharing their experiences and links to different web simulators for resurrecting conversations, including an AI entity with the primary objective to be a “human companion”. Excitement and engagement varied around these new conversational possibilities, websim example.

  • Awaiting the Return of Worldsim: Various members expressed eagerness and impatience for the return of worldsim, with participants hoping to be among the first to access it upon availability.

  • The Fascinations with Websim and Long Conversations: One user detailed their experience maintaining long-term conversations with a character named “Whipporwhill” on websim, showcasing the potential for emotional coherence and stability over time.

  • World Sim CLI Mode Experiments: Members have been running an Unofficial Nous Hermes worldsim on Llama-3-70B and other models, exploring how the models respond to the worldsim CLI mode with varying results and emergent behaviors. Additional simulators have been created, such as a singer and company simulator, hinting at the further potential of such tools.

Links mentioned:


HuggingFace ▷ #announcements (9 messagesđŸ”„):

  • Community-Built CV Course Goes Live on HF: A new computer-vision course has been published globally thanks to community collaboration. Check out the course here.
  • Correcting the Qwen1.5-110B Link: The link to the "Qwen1.5-110B" model was incorrect and has been updated. The correct space can be visited here, and further details are available in the blog post.
  • Introducing Qwen1.5-110B-Chat: Model Qwen1.5-110B-Chat is announced, featuring multilingual support and stable support for a 32K context length among other improvements. More information can be found on this model page.

Links mentioned:


HuggingFace ▷ #general (435 messagesđŸ”„đŸ”„đŸ”„):

  • Gradio Woes Worth $200: A user is experiencing an unidentified Gradio issue and is willing to pay $200 for help with their problem, directing to Gradio-specific discussions for further insight.
  • LLM Performance on New Hardware: A discussion is taking place regarding the system requirements for LLMs, specifically the trade-offs between RAM and VRAM, with some members suggesting that 32 GB of RAM should be sufficient for many tasks.
  • Help Wanted on Pinball Image Classification: A member seeks to create a vision model for identifying pinball games and scoring from video footage, requesting advice on the complexity, cost, and resources needed.
  • Seeking AI Model Builders: One user offers networking opportunities for business owners in the group to share and promote their products and services.
  • Download Counter Discrepancy: A member reports an issue with their dataset showing an increase in likes but no change in the number of downloads over a period where downloads would be expected.

Links mentioned:


HuggingFace ▷ #today-im-learning (4 messages):

  • In Search of Candle’s Documentation: A member expressed interest in the Candle library while questioning the availability of documentation comparable to the Transformers library. They raised concerns about Python being a bottleneck for concurrency in production.
  • Welcoming Wishes: A brief message from a user simply sending well-wishes to the community; no substantive content related to AI or learning discussed.
  • Exploring the Open Medical LLM Leaderboard: A video by Hugging Face on the Open Medical LLM Leaderboard was shared, exploring its impact on Medical AI and noting the existence of over 600,000 unique models on their platform. The video emphasizes the convenience of accessing these models and the rapid evolution of GenAI.
  • Community Appreciation for Medical AI Insights: Another member responded positively to sharing the video on the Open Medical LLM Leaderboard, expressing excitement for the ongoing developments.

Links mentioned:


HuggingFace ▷ #cool-finds (14 messagesđŸ”„):

  • Awesome RLHF Repo Now Live: The GitHub repository awesome-RLHF has been shared, which contains a curated list of reinforcement learning with human feedback resources, updated continually.
  • Explore Computer Vision with Hugging Face: Hugging Face has launched a new community computer vision course designed to teach computer vision ML using libraries and models from the Hugging Face ecosystem.
  • Phi3 Red Team Report Insights: Insights and key points from the Phi3 red teaming exercise are detailed in a LinkedIn post, discussing potential vulnerabilities and areas for improvement.
  • Evaluating LLMs for Time Series Analysis: A newly proposed framework for assessing Large Language Models (LLMs) on time series understanding is presented in a preprint on arXiv, featuring a comprehensive taxonomy of time series features.
  • Tacotron 2 - A Step Forward in Text-to-Speech Synthesis: The innovative speech synthesis system, Tacotron 2 by Google, demonstrates advanced AI capabilities for generating lifelike speech from text, as highlighted in the discussion on the future of AI in voice technologies.

Links mentioned:


HuggingFace ▷ #i-made-this (47 messagesđŸ”„):

  • Mega-Small Embed Model Unveiled: A new Sentence Transformer Model is introduced for converting long sentences and paragraphs into a 768-dimensional vector space. Aimed for clustering and semantic search tasks, this model boasts a 16,384 context length.

  • Blocks of Pixels Become Blocks in Minecraft: A Hugging Face space called Stable Diffusion Finetuned Minecraft Skin Generator has been released. It uses a fine-tuned stable diffusion model to generate Minecraft skins.

  • Instant AI-Generated Videos: A space called Instant Video by KingNish enables users to create a video from text in just 5 seconds. It uses the AnimateDiff Lightning model provided by ByteDance for fast text-to-video conversion.

  • Bringing Life to AI Assistance: An AI chat assistant app named LifePal is designed to help users achieve a balanced and fulfilling life. Available on Apple’s App Store, it integrates personalized insights into daily routines.

  • NorskGPT Battles ChatGPT’s Norwegian: A model specifically fine-tuned on Norwegian, NorskGPT-Mistral-7b, was recommended as a better alternative to ChatGPT for generating Norwegian language text. It’s currently ranked as one of the best Norwegian models according to the Mainland Scandinavian NLG leaderboard.

Links mentioned:


HuggingFace ▷ #core-announcements (1 messages):

  • Instant Styling with IP-Adapter: HuggingFace introduces InstantStyle with IP-Adapter, a mechanism for image prompting in diffusion models by adding decoupled cross-attention for image features. Guides for loading IP-Adapter and IP-Adapter Plus detail manual loading of the image encoder to allow more specific image feature learning.

Link mentioned: IP-Adapter: no description found


HuggingFace ▷ #computer-vision (21 messagesđŸ”„):

  • Security Inquiry on COCO Datasets: A member expressed concerns about the official COCO datasets being hosted over HTTP. It was pointed out that while HTTPS encrypts traffic, the domain is still visible, so large data transfers from the site could reveal activity.

  • Classifier to Detect Advertisement Images: A repository was mentioned that can assess whether an image is an advertisement, but no further details or links were provided.

  • Optimizing Photo Verification for Item Dropoffs: A user sought advice on a business problem related to classifying photos of item drop-offs at various locations, questioning whether it’s an image classification or object recognition task. Suggestions included using EfficientNetV2-S for small datasets and adjusting sample weights in Pytorch Dataloaders to deal with class imbalances.

  • Introducing a Beta Tool for Computer Vision Training: A new beta tool was introduced that helps users understand and adjust their model training data in real-time, particularly for computer vision tasks. The tool provides visualization up to 60fps and allows for adding new labels post-prediction to refine training.

  • Enhancement Strategies for YOLO Classifiers: A discussion centered around improving YOLO object detection accuracy, especially when handling high-resolution images. Separating bounding box (regressor) identification and classification tasks through two models was recommended, including the possibility of using a pure image classification network, like EfficientNetV2, for higher resolution patches within bounding boxes.

Links mentioned:


HuggingFace ▷ #NLP (5 messages):

  • Seeking the Best in Open Source Imagery: The community discussed which is the best open-source image-generation model, with sdxl finetunes being the current top recommendation.
  • Anticipation for sd3: There’s a buzz about sd3 potentially outperforming current models once it’s released, signaling high expectations.
  • Sequential Over Parallel: A member explained that due to resource constraints and preserving context, requests to the model are handled sequentially, not parallel, to avoid incoherent responses.
  • Nod to StabilityAI: In a brief message, StabilityAI was mentioned with an implication of relevance to the earlier discussions.

HuggingFace ▷ #diffusion-discussions (20 messagesđŸ”„):

  • Confusion Over Color Differences in Image Generation: A user experienced a shift in color and shadow intensity when moving from Seaart to A1111, despite using identical settings and seeds. They questioned if there are specific backend settings in Seaart that might lead to this inconsistency and sought assistance to replicate the exact picture on both platforms.

  • Torch Compile Can Take Time: A member observed an initial delay of about 10 minutes when using torch.compile() during training, but noticed a faster forward pass while the backward pass remained unaffected.

  • Detailed Method for Object Generation: In response to a question about generating accurate representations of specific objects (like the Eiffel Tower), a member suggested a well-documented approach involving CLIP retrieval and shared a comprehensive tutorial demonstrating the utility with GCP services using OpenAI’s CLIP model.

  • IP-Adapters for Image Prompting: Another suggestion for accurately generating specific objects involved using IP-Adapters with diffusion models, which allow for image prompting through a decoupled cross-attention mechanism.

  • Observations on DeepFloyd and Schedulers: A user provided insights on the behavior of the DeepFloyd model with different schedulers, noting that DPM++ 2M offered interesting convergence properties at various step counts and CFG settings, which might aid in achieving optimal image quality. They highlighted the necessity of tuning step counts and thresholding parameters for better results.

Links mentioned:


OpenAI ▷ #annnouncements (1 messages):

  • Memory Feature Launched for ChatGPT Plus: ChatGPT Plus users now have access to the Memory feature, which allows them to tell ChatGPT what to remember during a chat. The option to enable or disable Memory can be found in settings, although it’s not yet available in Europe or Korea.

OpenAI ▷ #ai-discussions (318 messagesđŸ”„đŸ”„):

  • AI’s Relation to Consciousness and Temporal Aspects: Members debated the nature of AI consciousness, speculating on how AI’s discrete processing relates to human continuous conscious experience and identity. Discussions touched on the philosophical implications of transforming individual identity through a neural network and how AI models like GPT handle temporal awareness.
  • Comparing AI Models: There’s ongoing comparison between different models such as Claude 3 Opus, ChatGPT, and Gemini 1.5, each with its advocates claiming superiority in areas like coding benchmarks. It was highlighted that command-R Plus and Llama3-70b may not compete with GPT-4 but are still significant advancements.
  • AI and Sentience: A lively debate unfolded around AI’s potential for sentience or even possessing something akin to a ‘soul.’ Members discussed the complexity of defining consciousness and whether an AI could possess subjective experiences similar to biological entities.
  • Personal AI Model Training Viability: While some extolled the virtues of training personal AI models, others pointed out the limitations of computational power, data, and financial resources. The discussion covered training custom models, fine-tuning, and hybrid fusion as methods to personalize AI for individual use.
  • Technical Challenges with AI Development: The community talked about the difficulty of implementing functions like memory in AI at scale, noting that fine-tuning may lead to confusion within the model and suggesting the use of contextual information retrieval as a better alternative. Some members expressed dissatisfaction with current AI models, longing for the next big leap in technology for more “intelligent” AI.

OpenAI ▷ #gpt-4-discussions (47 messagesđŸ”„):

  • Rate Limit Confusion: Members discussed being rate-limited when using custom GPTs. The limit is part of a rolling 3-hour cap for GPT-4 usage, and custom requests also count toward this limit.

  • Query on Memory for Team Rates: A user inquired about memory features for a Team rate, with another stating that even regular memory features seem to delete entries often.

  • Backend Bugs Busting User’s Patience: Users reported backend errors with the GPT URL “https://chat.openai.com/backend-api/gizmos/”, affecting their operations, although the issue was resolved quickly after testing.

  • Subscription Refund Risks: A user asked for a refund after subscribing to ChatGPT Plus due to high currency exchange rates and wondered if using the service would affect the refund process.

  • Curiosity about GPT-4 Speed and Voice Control: Discussion centered around GPT-4’s comparative slowness to GPT-3.5 and the absence of voice control on PC, despite its presence on mobile platforms.


OpenAI ▷ #prompt-engineering (7 messages):

  • Exploring the Unpredictable: One member described the phenomenon of emergence in LLMs, where quantitative increases in system size can lead to unexpected, qualitative changes, referencing a paper titled More Is Different to illustrate that large language models (LLMs) display behaviors not extrapolable from smaller-scale models.

  • Dalle Looking Emoticon Pampered: A user responded with a Dalle-emoticon without accompanying text.

  • The Three-Body LLM Problem: A member playfully coined the term “3 body LLM problem,” possibly referring to complex interactions in LLMs, akin to the three-body problem in physics, without providing further details.

  • Prompt Engineering as a Sport: A member suggested the idea of prompt competitions, where individuals compete to generate the best responses from LLMs.

  • Money for the Sharpest Prompt: Expansion on the competition concept was made, proposing both paid prompt competitions, with significant cash rewards, as well as more casual “playground competitions,” which would encourage community engagement and help users improve their prompt engineering skills through gamification and peer-to-peer assistance.


OpenAI ▷ #api-discussions (7 messages):

  • Emergence Topic Emerges in Discussion: Emergence in LLMs is characterized by new abilities or qualities not predictable by simply scaling SLMs. The concept is likened to the idea presented in the paper “More Is Different,” signifying that qualitative changes arise in systems beyond a certain quantitative point.

  • Prompt Competitions Suggested: A user proposed the idea of prompt competitions where participants vie to elicit the “best” answer from LLMs.

  • Monetizing Mastery of Prompts: It’s proposed to have paid prompt competitions, with a substantial yearly budget for distributing rewards, and free playground competitions to foster community assistance and engagement. Rewards might range from cash to special platform perks.

  • Frequent Challenges to Foster Skills: Regular competitions, around 4-5 a month, could provide consistent opportunities for individuals looking to improve their prompt engineering skills.


Eleuther ▷ #general (59 messagesđŸ”„đŸ”„):

  • Apple’s New Models and The Pile’s Multilingual Data: The Pile dataset is not particularly multilingual, although portions like UN records may contain multiple languages. There is no special focus on languages like German.
  • Comparing GPT-NeoX and Megatron Variants: GPT-NeoX has diverged from Megatron primarily in terms of quality-of-life improvements and user experience. Features are tested before being integrated, with the aim of being more stable.
  • Infini-Attention’s Positional Encoding Query: The community discussed the absence of positional encodings in Infini-Attention’s hidden state memory, with some speculating on whether positional information is preserved through other mechanisms.
  • The Complex Calculations Behind Inference MFU: When evaluating good inference MFU (Memory Footprint Utilization), there are no simple off-the-shelf numbers; it largely depends on the hardware utilization and model specifics being used.
  • Speed Differences Between Models at Fireworks.ai: The conversation touched on why Mixtral 8x22B is served slower compared to llama 3 70B at Fireworks.ai, with factors like batching size and hardware utilization potentially influencing the disparity.

Eleuther ▷ #research (297 messagesđŸ”„đŸ”„):

  • Benchmarking LLMs in Practice: Speculation over the real-world performance of various LLMs continues, with comparisons including phi-3-mini-128k against models like Llama-3-8B. However, disparities were noted in bits-per-byte performance metrics, suggesting differences in efficiency across models.

  • Exploring the Needle-in-a-Haystack Test: A Twitter thread highlighted that the needle-in-a-haystack test might imply a form of meta-awareness in models such as Claude 3 Opus. Yet, debate ensued over whether these responses indicate emergent abilities or artifacts of reward learning and prompt structures.

  • Self-Improvement in LLMs: Links to papers on LLM self-improvement strategies were shared, with methods like Self-Taught Reasoner (STaR) and reinforcement learning from human feedback (RLHF) being key discussion points.

  • Emergence in Language Models: The concept of “emergent abilities” in large language models (LLMs) was debated at length, with references to various papers and the acknowledgment that truly emergent abilities haven’t yet been quantifiably demonstrated under smooth, continuous metrics.

  • Innovations and Findings in LLM Research: Several papers were mentioned, including researching into redundant neural circuits in deep learning, and the creation of adversarial prompts for red-teaming against LLMs. Discussion also turned to whether speculative decoding can optimize model inference times without significant training adjustments.

Links mentioned:


Eleuther ▷ #scaling-laws (1 messages):

  • Determining Cutoff via Non-Embedding Parameters: A participant suggested using non-embedding parameters as a method for determining the cutoff point in models. The recommendation is to observe where the delta of the fit curve for each removed point becomes very low, which could lead to a reasonably educated guess beyond the initial estimation of sub-200 million parameters.

Eleuther ▷ #interpretability-general (9 messagesđŸ”„):

  • Anthropic Shares New Research Insights: The Anthropic interpretability team has released an April update with developments and emerging research ideas. This includes topics like scaling laws, training Spare Autoencoders (SAEs), and a project on interpretability architectures.

  • Discovering the Refusal Mechanism in LLMs: A crosspost from AI Alignment Forum unlocks findings about how modern Large Language Models (LLMs) are fine-tuned to refuse harmful requests. It suggests that refusal may be activated by a single direction within the network.

  • Weight Orthogonalization Versus Fine-tuning: In the context of fine-tuning LLMs for specific behaviors, a member hypothesized that weight orthogonalization could be viewed as a form of manual fine-tuning to impact network behavior.

  • Refusal Directions and Rank-1 LoRA Fine-tuning Explored: A member proposed that if rank-1 LoRA (Low-Rank Adaptation) fine-tuning with Stochastic Gradient Descent (SGD) is performed, the network might learn the negative of the ‘refusal direction’.

  • Llama.cpp Integrates Control Vectors Technique: Control vectors, a technique similar to what was being discussed, have been added to llama.cpp, as demonstrated in this GitHub pull request, thanks to the collaboration with Nous Research.

Links mentioned:


Eleuther ▷ #lm-thunderdome (5 messages):

  • CLA Confusion in PR Submissions: A member encountered an issue with the Contributor License Agreement (CLA) showing as unsigned despite them having signed it, which might be due to GitHub anonymizing their email in commits. The matter was acknowledged and agreed upon for further investigation.
  • Uncertainty Over Failing Checks in PR: Concern arose over a failing check in a submitted pull request, with the member questioning if it was related to their changes. The issue was reviewed and preliminarily agreed to be unrelated.
  • Chat Template Branch Stagnation Inquiry: A member inquired about the progress and activity regarding a branch dedicated to adding chat templating, noting the last commit was two months prior. There was no immediate update on the current status or progress.
  • Prompt Versatility for Evaluation Harness: A member raised a point about the lack of variable prompt formats that cater to model-specific finetuning in the evaluation harness. Another participant suggested the use of a custom !function to enable distinct prompts based on the model.

Link mentioned: add task for mmlu evaluation in arc multiple choice format by jonabur · Pull Request #1745 · EleutherAI/lm-evaluation-harness: This PR adds the mmlu_arc_style task that presents the MMLU questions in the same manner as the arc evals (loglikelihood for the answer as a continuation, rather than selecting the letter for the c



Eleuther ▷ #gpt-neox-dev (1 messages):

  • Concerns Over Cluster Setup Practices: A comment was made highlighting the lack of assurance that the correct version of tokenizers is used during cluster setup as there’s a possibility that someone might just do a blind pip install tokenizers without using the pinned version. It was noted that this could affect any run, and one would need to ensure that what’s in the python environment is logged to be certain of the used version.

OpenRouter (Alex Atallah) ▷ #announcements (3 messages):

  • Soliloquy 8B Shifts to Paid Model: Soliloquy 8B’s usage is now paid, costing $0.1 per 1M tokens. This pricing update reflects OpenRouter LLC’s recent policy change.

  • Price Jump for Soliloquy 8B: The price for using Soliloquy 8B was revised again to $0.2 per 1M tokens. The new rate comes shortly after the initial pricing was introduced.

  • Routing Updates and Corrections: anthropic/claude-instant-1 model routing was updated to claude-instant-1.2, and a routing error concerning anthropic/claude-2.0 was corrected with a restoration of service as it remains a valid model ID.

  • Restoration of Claude v2.1 and Variants: The Anthropic: Claude v2.1 model and its :beta variant have been reinstated following the clarification on model availability during the recent confusion with older claude models.

Links mentioned:

  • Anthropic: Claude v2 by anthropic | OpenRouter: Claude 2 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and a...
  • Lynn: Llama 3 Soliloquy 8B by lynn | OpenRouter: Soliloquy-L3 is a fast, highly capable roleplaying model designed for immersive, dynamic experiences. Trained on over 250 million tokens of roleplaying data, Soliloquy-L3 has a vast knowledge base, ri...
  • Lynn: Llama 3 Soliloquy 8B by lynn | OpenRouter: Soliloquy-L3 is a fast, highly capable roleplaying model designed for immersive, dynamic experiences. Trained on over 250 million tokens of roleplaying data, Soliloquy-L3 has a vast knowledge base, ri...

OpenRouter (Alex Atallah) ▷ #app-showcase (4 messages):

  • Exploring Syrax: A member expresses interest in experimenting with Syrax and offers support, initiating a private conversation with a friend request for further collaboration.
  • Friend Request Accepted: Another community member acknowledges the support offered and confirms the acceptance of the friend request, showing appreciation.
  • Impressed by the Showcase: A single, short expression of admiration is directed toward the ongoing discussions or showcased projects, reflecting a positive impression.

OpenRouter (Alex Atallah) ▷ #general (311 messagesđŸ”„đŸ”„):

  • Claude Models’ Quirky Behavior Unraveled: Members discussed issues with Claude models returning incomplete outputs or HTTP 524 errors via OpenRouter. Clarifications led to discovering that Claude models have a max generation of 4k tokens and can read up to 200k tokens, while the right settings could improve API responses.

  • Lemmyle Dissects WLM-2 Hosting Economics: An intense breakdown of WLM-2 hosting costs was presented, surmising that a profit could be marginal depending on various factors like GPU utilization, electricity costs, and potential revenue from idle GPUs.

  • FireLLaVA’s Silent Entry into Multimodality: There were musings about the under-the-radar launch of FireLLaVA, an open multimodal model noted for its quick startup time, marking a notable addition to the OpenRouter ecosystem.

  • Deployment Dilemmas and Frugal Frontends: A member sought a simple frontend to host on shared hosting to allow family members to use their OpenRouter services without multiple OpenAI subscriptions. Suggestions ranged from using Vercel for its free tier to opting for more affordable VPS providers, such as Contabo.

  • Cohere’s Conundrum in OpenRouter Contexts: A member faced odd output discrepancies when using Cohere models through OpenRouter compared to direct API calls, with generated content unrelated to prompts. It was clarified that web connector support for Cohere is pending, and its addition to OpenRouter is anticipated but not yet available.

Links mentioned:


OpenAccess AI Collective (axolotl) ▷ #general (169 messagesđŸ”„đŸ”„):

  • Washington’s Wizards: Unchanged Repository: Despite rumors, the WizardLM models from Microsoft have not been removed by Microsoft; a member clarified that wizardlm was responsible for the changes. They also confirmed that the WizardLM repository remains publicly available.
  • Fine-Tuning vs. RAG for Domain-Specific LLMs: New members inquired about fine-tuning for domain-specific language models, questioning the necessity versus using Retrieval-Augmented Generation (RAG). The conversation noted examples such as OpenBioLLM and referenced a medical-focused LLM paper for further reading.
  • Configurations for Conversation Tokenization Issues: There was a thorough discussion on tokenization strategies for models like LLaMA-3, including the necessity to manually install the latest version of the fastchat formatter and referencing a relevant axolotl pull request for correct conversational formatting templates.
  • Quantization and Model Degradation Debate: Members debated the effects of quantization strategies on LLMs, specifically comparing the 4bit lora and 4bit qlora methods. The consensus is that quantization sensitivity varies depending on training, with one member citing a Twitter thread discussing more significant degradation in more extensively trained models like LLaMA-3.
  • Sample Packing Clarification for Preventing OOM: A member sought clarification on multipack sampling and its relation to out-of-memory (OOM) errors. It was explained that sampling does not affect the maximum sequence length allowed by the model and only packs multiple samples into the maximum sequence length without altering context size.

Links mentioned:


OpenAccess AI Collective (axolotl) ▷ #axolotl-dev (37 messagesđŸ”„):

  • Memory Requirements for Fast Fourier Transform: A discussion about significant memory requirements to run Fast Fourier Transform (FFT) with zero3 on 2x24GB graphics cards. A member suggested that 167GB of RAM might be necessary, lamenting the lack of sufficient memory.

  • Exploring VRAM Reduction via torchtune: One member advised trying torchtune, noting its focus on reducing VRAM usage. Another member debated the question of using FSDP (Fully Sharded Data Parallel) but reported that the training begins yet hangs without progressing or throwing errors.

  • Disc Usage Soars with Fast Fourier Transform: While attempting to train a model, the system’s swap memory skyrocketed to 62GB, causing an out-of-memory error. The participant expressed surprise at the excessive disk and swap usage even when the job theoretically fit within a single 48GB card setup.

  • ZeroGPU Access for Experiments: One member highlighted that they have access to the Huggingface Zero project, prompting a discussion on potential tests. It aims to provide free GPU access for Huggingface Spaces and supports Spaces running on multiple GPUs simultaneously.

  • Log Sharing and Iteration Woes: A user linked their wandb.ai logs for those interested in the details of their Fast Fourier Transform trials, noting extremely long iteration times of 800 seconds compared to 17 seconds for a qlora iteration, highlighting performance issues.

Links mentioned:


OpenAccess AI Collective (axolotl) ▷ #general-help (23 messagesđŸ”„):

  • Troubleshooting AttributeError: A user encountered an AttributeError related to 'TextIteratorStreamer' not having an attribute 'empty'. They questioned the function’s validity given they are using the transformers version 4.40.0.

  • Inquiry About Llama-Pro Method: There were multiple discussions regarding the usage of the llama-pro method highlighted by Jeremy Howard. Links to GitHub repositories were shared (fsdp_qlora), indicating a 4-bit quantized Llama-Pro fine-tuning method, with conversation pivoting around whether or not this method is accessible in axolotl and potentially requiring a pull request.

  • Integrating Custom Audio Recording in Twilio: A user explained their effort to integrate custom audio recording with Twilio and how to capture and store audio in real-time, while being able to provide a response to the recorded audio.

  • Combining QLORA Adapter Fine-Tuning: Users discussed the need to merge a qlora adapter fine-tuning model before conducting additional fine-tuning for a Q/A style, as well as the effects that subsequent fine-tunings might have on preserving model characteristics. Further conversation alluded to combining conversational and completion models into one fine-tune, with a reference to an example in a community showcase.

  • PEFT Model for Faster LLM Fine-Tuning: A brief mention was made of an unsloth peft model, supposed to fine-tune LLMs like Mistral significantly faster with less memory usage, although with additional optimizations, suggesting it’s loaded differently from Hugging Face models.

Links mentioned:


OpenAccess AI Collective (axolotl) ▷ #axolotl-help-bot (44 messagesđŸ”„):

  • GPU Scaling and Batch Sizes Explained: A conversation detailed the intricacies of scaling up the number of GPUs from 4 to 8 and adjusting micro batch sizes. It clarified that while the total batch size may remain constant, factors like gradient accumulation, learning rate scaling, parallelism strategies, and communication overhead differ and influence the training dynamics and performance outcomes.

  • Query on Model Loading Across GPUs: The question was raised about whether models are loaded in full or split when using multiple GPUs. It was explained that models can be loaded either as a full size or sharded across GPUs, a technique facilitated by Fully Sharded Data Parallelism (FSDP) and optimizations like DeepSpeed’s ZeRO Stage 3, helping in efficient utilization of hardware resources.

  • LoRA vs. QLoRA – Adaptation Techniques Demystified: Discussion touched upon the differences between LoRA (Layer-wise Relevance Analysis) and QLoRA (Quantized Layer-wise Relevance Analysis), detailing how the latter extends LoRA by adding quantization to further reduce the computational cost and memory requirements during fine-tuning and deployment.

  • Dataset Trimming Strategy for Axolotl: The situation of trimming datasets in the Axolotl config was addressed by suggesting an approach that doesn’t directly specify a percentage of the dataset but rather involves modifying the dataset loading logic to include a subsampling step, potentially using methods provided by datasets library functions.

Links mentioned:


OpenAccess AI Collective (axolotl) ▷ #axolotl-phorm-bot (12 messagesđŸ”„):

  • LLaMa Prompt Support Inquiry: A member inquired if axolotl supports LLaMa 3 prompt format for ShareGPT. The response indicated there’s no mention of specific “llama 3” model support within the OpenAccess-AI-Collective/axolotl documentation.
  • Fine-Tuning a QLoRA Model: A member shared their success in creating a fine-tuned text completion model with qlora from Mistral-7B. They sought guidance on making the model conversational and were advised they could directly fine-tune using their QLoRA-adapted model on a Q/A dataset.

Modular (Mojo đŸ”„) ▷ #general (2 messages):

  • Modular Commits on the Rise: Since the stdlib was open-sourced, 23% of commits have been made to modularml/mojo. This indicates a surge in activity and contributions to the project.

Modular (Mojo đŸ”„) ▷ #đŸ’Źïž±twitter (4 messages):

  • Modular Tweets Link Sharing: Members in the đŸ’Źïž±twitter channel shared multiple tweets from Modular. Relevant tweets included updates or announcements, linked as follows: Tweet 1, Tweet 2, Tweet 3, and Tweet 4.

Modular (Mojo đŸ”„) ▷ #✍blog (1 messages):

  • Multimodal Search Boosted by MAX Engine: The recent blog post by Modular discusses the advantages of a multimodal search that combines textual and visual data. MAX Engine, which already outperformed PyTorch eager and ONNX runtime in previous benchmarks, is also capable of optimizing inference for multimodal models.

Link mentioned: Modular: Multimodal Search with Snowflake Embedding and MAX Engine: We are building a next-generation AI developer platform for the world. Check out our latest post: Multimodal Search with Snowflake Embedding and MAX Engine


Modular (Mojo đŸ”„) ▷ #ai (2 messages):

  • Troubleshooting Mojo Installation: A user reported an issue with installing Modular (Mojo đŸ”„) on Python 3.12.3. The response suggested using a Conda virtual environment and provided instructional links, Modular manual on Python and Modular blog post, emphasizing that Mojo is a superset of Python and compatible with Python modules.
  • Working on Mac M1: A different member noted that they are running the latest Mojo, including the nightly version, with Python 3.12.3 on a Mac M1 successfully. They recommend using Conda for an easier setup, pointing out that Mojo’s intent is to be compatible with Python code and existing Python packages.

Link mentioned: Python integration | Modular Docs: Using Python and Mojo together.


Modular (Mojo đŸ”„) ▷ #đŸ”„mojo (113 messagesđŸ”„đŸ”„):

  • Switch from Python to Mojo Issue: A user shared Python code and asked for assistance in converting it to Mojo. Another user provided a detailed Mojo conversion with explanations about function declarations and variable types in Mojo.

  • ModularBot Chimes In: ModularBot interjected, celebrating user @110077104611172352 reaching level 5 and user @289473226147495936 reaching level 1. Congrats were later given to @932397073427476521 for reaching level 18, with a playful response from ModularBot about celebrating with a banquet.

  • Matrix Slicing and Memory Ownership: A Mojo user inquired about creating a non-owning view of a list’s subset without extra allocation. It was clarified that for indirect memory access, one should use the Buffer type rather than List, since List owns its data and Buffer is under redesign for life time management.

  • MoJo for Intel Mac Inquiry: When questioned about Mojo for Intel Mac, a user responded that there’s hope for support soon but currently using the playground is the only option.

  • Troubleshooting a Matrix Implementation: A user having trouble with matrix division in Mojo due to the lack of an implemented __truediv__ function was advised to review their code and ensure operations were only being performed on non-zero values.

  • Discussion on Mojo’s Integration with Existing Libraries: The goal of Mojo language is discussed, emphasizing that Mojo aims to integrate into the Python ecosystem and utilize existing libraries, rather than replacing them entirely. It’s noted that Mojo’s long-term direction includes seamless use of existing tools like Numpy.

  • Levels and Learning in Discord: Users discuss their progress through levels in the channel; one user advanced to level 18 after a year, while others question the ranking methodology given disparate expertise levels.

Links mentioned:


Modular (Mojo đŸ”„) ▷ #community-projects (1 messages):

uncle_jee: Use Mojo to write a Mojo community https://github.com/shadowqcom/mojo_dev


Modular (Mojo đŸ”„) ▷ #community-blogs-vids (5 messages):

  • Crafting Better Tutorials: rd4com highlighted tips for making tutorials, emphasizing the use of emojis for visual references, simplicity in language, clarity in naming, avoiding information overload, gradually increasing complexity, and iterating for refinement. They also stressed on linking to Mojo documentation and logically building upon previous content.

  • DiĂĄtaxis Framework for Documentation: sophiaglencairn shared a link to DiĂĄtaxis, a systematic approach to creating technical documentation, outlining four types of documentation needs: tutorials, how-to guides, technical reference, and explanation. DiĂĄtaxis addresses content, style, and architecture issues in documentation to benefit both users and creators.

Link mentioned: DiĂĄtaxis: no description found


Modular (Mojo đŸ”„) ▷ #performance-and-benchmarks (55 messagesđŸ”„đŸ”„):

  • Exploring __copyinit__ and GitHub Gists: A discussion revolved around __copyinit__ behavior and whether it’s a type author’s responsibility to implement copy-on-write semantics. The conversation pointed to a specific Gist for context.

  • Dictionary Performance Intricacies: Performance concerns regarding dictionaries in Mojo were discussed, citing significant speed differences between Mojo and Python. A member shared their experiences with porting a tokenizer and linked to a relevant discussion and a tokenization library for reference.

  • Compact-dict Library Offers Hope: Amidst conversations about dictionary performance, the Compact-dict library was put forward as a faster alternative to the standard Mojo dictionary, though it doesn’t store keys and might require changes to use cases or additional features in the future.

  • Memory Allocation Queries: Members inquired about the differences in performance and functionality between stack_allocate and heap allocation methods like DTypePointer.alloc/Pointer.alloc. There was an exchange on when to use stack or heap, and insights into their cost differences were shared, emphasizing that typically stack allocation is faster and less complex than heap allocation.

  • Optimizing SIMD Operations for Error Correction Code: In search of achieving better performance for an error correction code library, a member sought advice on optimizing a function using SIMD. The conversation included discussions on function inlining, use of fma, and potential mathematics tricks for improvements. The specific project mentioned was mocodes.

Links mentioned:


Modular (Mojo đŸ”„) ▷ #🏎engine (3 messages):

  • Continuous MAX Optimization: The team is regularly optimizing MAX with each release. Knowing the specific core types and models used by individuals can provide further insights into performance enhancements.
  • Clarifying Speed Improvements: A member pointed out a discrepancy in reported speed improvements between TensorFlow (tf) and PyTorch, suggesting they shouldn’t be the same due to differences in queries per second (QPS).
  • Correct Speedup Printouts Confirmed: Another member confirmed seeing the correct speedup numbers reflecting proportionate QPS improvements after updating the max example repository and clearing the .cache in the performance-showcase directory.

Modular (Mojo đŸ”„) ▷ #nightly (85 messagesđŸ”„đŸ”„):

  • Frequent Updates for Nightly Branch Discussed: Automation challenges are delaying the goal of releasing the nightly branch every weekday, with concerns raised about the delay between code merges and commits appearing in the branch making it hard to fix conflicts. There’s ongoing discussion to find solutions, ensuring the nightly stdlib can build and run correctly with the released nightly compiler.

  • Nightly Mojo Compiler Release Notification: The announcement of a new nightly Mojo compiler highlighta the availability of updates and changes, with a detailed pull request and a changelog available for review.

  • Discussions on Overloads and Traits in Mojo: Debates surfaced regarding the behavioral consistency of overloads and the use of traits, touching on language features like parametric algorithms. The community is thinking through the trade-offs of different methods, like overloading, precedence decorators, and return type variations, while expressing concerns about the potential for confusion and bugs when modifying the behavior of objects via type information.

  • Code Execution Difference Between Stable and Nightly: A user reported an issue where code that works in the stable version of Mojo causes an error with a nightly build, suggesting a possible file handle lifetime management problem in the nightly version. This sparked a conversation leading to the opening of an issue on GitHub.

  • Importing Challenges in Mojo’s Standard Library: A user encountered difficulties importing functions from the math package into the string.mojo and string_literal.mojo files, which was explained as a design decision to avoid circular dependencies between open-source and closed-source parts of the stdlib. The workaround recommended is to re-implement the necessary math functions in the open-source portion of the standard library.

Links mentioned:


LlamaIndex ▷ #blog (6 messages):

  • Workshop Materials for Building LLM Apps: Llama Index announced a workshop with AWS showcasing 3 patterns for LLM app development including using S3 for data ingestion and AWS Bedrock for embeddings.
  • Llama Index on ML Security Podcast: The co-founder of Llama Index discussed LLM-based application futures and data security on the mlsecops podcast, also touching on tools like LlamaParse and LlamaCloud.
  • RAG Tutorial Series for Production: Marco Bertelli launched a 9-part series focused on taking RAG from a prototype to a production environment, outlining necessary architectural components for deployment.
  • Enhancing RAG with Multi-Stage Retrieval: An article by Michael R. from KX Systems suggests a multi-hop retrieval process using Llama Index and Cohere reranking to improve context and reduce hallucinations for LLMs, as detailed in their post.
  • Long-Term Memory for Autonomous Agents: Introducing memary, a reference implementation for long-term memory using knowledge graphs, aimed at enhancing memory functions in autonomous agents using LLMs as explored in this tweet.

LlamaIndex ▷ #general (155 messagesđŸ”„đŸ”„):

  • Trouble with awsbedrock and LlamaIndex: A member encountered an error when trying to use awsbedrock with LlamaIndex which prompted a “NoRegionError” from botocore. Upon following suggestions to ensure region_name is specified, the issue was resolved.
  • Using Local LLM with LlamaIndex: Members shared links to LlamaIndex’s documentation and examples for setting up LLMs locally, particularly referencing a “5 lines of code” example using BAAI/bge-small-en-v1.5 and Mistral-7B on LlamaIndex’s documentation.
  • LlamaIndex Import Issues Solved: Several members discussed troubleshooting import errors related to llama-index packages such as llama-index-llms-ollama. Solutions included installing specific packages individually and confirming correct installation steps.
  • Updating Indices and Documents on Vector Stores: Conversations focused on actions such as updating indices on Pinecone using LlamaIndex and adding metadata keys to existing vectors. A member suggested that updating a node with the same ID will overwrite it. However, no direct solution was provided for adding metadata without modifying vectors.
  • Retrieving Documents with LlamaIndex: Members inquired about retrieving multiple documents via query_engine.retrieve() while ensuring diversity among the retrieved documents. Suggestions included adding metadata keys to existing vectors and setting parameters like mmr_diversity_bias when creating the retriever.

Links mentioned:


LlamaIndex ▷ #ai-discussion (2 messages):

  • GPT-1: The Unsung Hero: A member revisited the original GPT-1 model, reflecting on its contribution to the evolution of language models, and has written a blog post on the subject. It posits that the model has “stood the test of time quite well over 6 years,” implying that some modern systems like Mistral-7B are vastly scaled up derivatives of GPT-1.

OpenInterpreter ▷ #general (127 messagesđŸ”„đŸ”„):

  • Flask Server Frustration: A member encountered an error when trying to run a local Flask server, revealing a need to set the api_key and several further issues, including namespace conflicts and connection errors. They attempted to use a dummy key (interpreter.llm.api_key = "dummykey") and contemplated editing a pydantic config to overcome a namespace issue.
  • OpenInterpreter 0.2.5 New Release Inquiry: A member asked about the Open Interpreter 0.2.5 New Computer Update, leading to a clarification that it has moved beyond beta.
  • Groq Challenges for OI Integration: Several members discussed difficulties when trying to run Open Interpreter with Groq, ultimately concluding that Groq support isn’t currently integrated into OI. A Github pull request (#1238) for adding Groq support was mentioned, which is pending approval.
  • Hardware Queries for O1 and Global Vision: Members conversed about the Open Interpreter’s remote communications and whether O1 can function with voice instruction in languages other than English. There were also discussions on installing O1 client on other devices, like the Rabbit r1, and leveraging the client’s existing voice support.
  • Collaborations and Contributions Ramp Up: Members shared progress and calls for assistance on various projects intertwined with OpenInterpreter, such as llm-switcher, an open-source AI tools suite including AAA+ and MagicLLight, and potential Groq API implementations. Community code sharing occurred, with ongoing efforts to troubleshoot and improve support for different models and functionalities.

Links mentioned:


OpenInterpreter ▷ #O1 (25 messagesđŸ”„):

  • Custom 3D Project Housed in Mystery: Members are intrigued by a custom 3D printed case for OpenInterpreter’s 01 project, prompting discussions around personal attempts and the fun of tactile keys. One member provided a YouTube video showcasing the project but noted it wasn’t their own work.
  • The Dawn of 01 Heavy: Chat includes anticipations of a new device, 01 Heavy; no expected launch date is provided. Comparisons draw links to it potentially powering future robots.
  • Amazon Alternatives Seek Acceptance: Queries rise about using Amazon Echo Smart Speaker Dev Kit as an alternate solution for open project builds, but no confirmation is shared regarding compatibility.
  • Open AI Ethics in Question with Microsoft’s Capabilities: A discussion emerges highlighting Microsoft’s ability to create and modify files, with OpenInterpreter touted as capable of meeting diverse user desires.
  • Update Expectations Set for 01 Light: A member mentions an upcoming discussion this Tuesday to reveal an updated timeline for the 01 Light’s ETA.

Links mentioned:


Latent Space ▷ #ai-general-chat (100 messagesđŸ”„đŸ”„):

  • Berkeley Introduces Tool Calling Leaderboard: The Berkeley Function Calling Leaderboard evaluates LLMs’ ability to call functions, offering a novel and periodically updated real-world benchmarking system.
  • Voice AI On the Rise: ElevenLabs has sparked interest, leading to discussions about other Voice AI startups like Unreal Speech and Hume, a space once occupied by now-defunct Coqui.
  • Exploring the Limitations of LLMs: An article on Strangeloopcanon contemplates the perennially surprising capabilities of LLMs while discussing their current failure modes and the concept of “goal drift” as possible directions for improvement.
  • Potential Acquisition Moves in the AI Sector: Nvidia’s reported acquisitions of Israeli AI companies, Deci AI and Run:ai, indicate a strategic move to enhance efficiency and performance on their GPUs and AI servers.
  • Adventures in Large Context Models: Conversations about practical applications and the future of large context models were spurred by Llama 3’s extension to a 1M token context window.

Links mentioned:


Latent Space ▷ #ai-announcements (1 messages):

swyxio: new pod! https://x.com/swyx/status/1784253651844014237


Latent Space ▷ #llm-paper-club-west (12 messagesđŸ”„):

  • All Systems Go: The chat confirms visibility before starting the presentation on Mixture Of Depths.
  • Mixture Of Depths Explored: This paper introduces a new transformer layer, Expert Choice Routing, aimed at faster training convergence and improvements for processing longer sequences. See the original paper here.
  • Skip the Confusion: Comments indicate that skip connections, also known as residual connections, mentioned in the attention mechanism are integral to the discussed paper’s methodology.
  • Size Matters: A shared abstract suggests larger zero-shot LLMs outperform fine-tuned smaller LLMs in real-world tasks like meeting summarization, despite the computational costs.

Links mentioned:


Latent Space ▷ #ai-in-action-club (35 messagesđŸ”„):

  • Linux Users, Say Hello to Vesktop: Discord video sharing and Linux compatibility issues were addressed with a recommendation to use Vesktop, described as a better-performing custom Discord app that improves Linux support. Those interested can find more info on the Vesktop GitHub repository.

  • Young SQL Module in the Spotlight: A member shared a reference to sqlite-vss, a SQL module for creating virtual tables to store and query vectors, noting it’s still in early development stages and pointing to the API reference documentation.

  • Chatbots for CLI Tools Spark Interest: The idea of creating chat bots for popular command line interface (CLI) tools was suggested, triggering discussions about feasibility and potential ease of creation using slono’s tool, a utility that adds to the portability of Go and SQLite.

  • Resource Sharing for AI Enthusiasts: Two informative links were shared by members; the first, a Google Doc containing AI-related topics, dates, facilitators, and a wealth of resources such as articles and conference talks. The second, a Berkeley Gorilla Blog post discussing the challenges and potential strategies for real-world execution of actions by Large Language Models.

  • Hunt for AI Hackathon Sign-Up Details: Engagement was expressed regarding sign-up for a hackathon, with one member highlighting the X-ware Arena link amidst the conversation.

Links mentioned:


LAION ▷ #general (95 messagesđŸ”„đŸ”„):

  • LAION in Limbo: A member highlighted that EU laws appear to be restricting LAION’s access to public clusters for compute time, causing a decline in activity. Researchers are gravitating towards more active groups that are continually running experiments.

  • Terminus Research Group Attracts Talent: A chat participant introduced their own group, the Terminus Research Group, which is an informal collective now including the “pixart guy,” suggesting a growing diverse expertise.

  • LAION-Aesthetics Seeks to Score Visual Beauty: A blog post was mentioned detailing LAION-Aesthetics, which is designed to rate image aesthetics using machine learning. The model and related code are available publicly on GitHub.

  • Unusual Benchmark Results Spark Discussion: Members discussed a Reddit benchmark test denoting contradictory performance outcomes for different quantizations in language models, raising questions about testing methodologies and LLM non-deterministic nature.

  • Comparing LLM Token Generation Rates: Users discussed token generation rates on high-performance GPUs, noting significant differences across models and setups. Some tools and configurations, such as exllama and TabbyAPI, were recommended for better performance.

Links mentioned:


LAION ▷ #research (9 messagesđŸ”„):

  • Exploring VAST: The Omni-Modality Foundation Model: Interest is shown in finetuning VAST, a vision-audio-subtitle-text omni-modality foundation model and dataset, prompting members to share their experiences and seek advice.
  • Hot off the Press: New Research Publication: A new paper on AI research authored by a team including Mostafa Elhoushi, Akshat Shrivastava, and others has caught the attention of members, speculating it builds upon previous work and highlighting its implications for faster inference and layer utilization.
  • Combining Graphs with Language Models: Queries about combining graphs with large language models (LLMs) have been raised, seeking recommendations on relevant papers to read and strategies for conditioning LLMs with graphs.
  • Mistral Model Fine-Tuning Challenges: A member is fine-tuning Mistral models for medical information extraction but encounters issues with the model over-generating sequences. The discussion touched on padding strategies and the appropriateness of the Eleuther server for seeking expertise in this area.
  • Seeking the Eleuther Server Link: Upon facing a challenge with model fine-tuning, a member was advised to consult the Eleuther server for expert help in LLMs, leading to a request for the server’s Discord link.

Links mentioned:


Cohere ▷ #general (96 messagesđŸ”„đŸ”„):

  • Search Engine Query Capabilities Discussed: Members discussed the best practices for using web search tools with AI, mentioning various options such as Tavily and Brave Search API. Some highlighted the cost-effectiveness of these tools Tavily API Information and Brave Search API, while others shared specific configurations and technical details regarding usage limitations and potential workarounds for rate limits.

  • Technical Issues and Deployment Queries: Various technical issues were addressed like facing errors when running the cohere-toolkit locally due to sqlite3 version issues, difficulties in understanding how to interact with different components after deployment on Azure, and sharing GitHub resources for troubleshooting and adding custom tools GitHub - cohere-ai/cohere-toolkit.

  • Cohere Toolkit Enthusiastically Received: A user expressed great appreciation for Cohere making their toolkit open source, highlighting its immense help to developers GitHub - cohere-ai/cohere-toolkit.

  • Clarifications Sought on Fine-Tuning and Use Cases: Queries were raised about the specific models used when fine-tuning, the limits and terms of the free trial API key, and whether models like ‘Generate’ would remain available.

  • Using AI for Non-English Languages and Commercial Use: One member praised Command-r for its performance with non-English languages and sought clarification on deploying command-r APIs for commercial use; responses suggested contacting Cohere’s sales team or using AWS Sagemaker for deployment.

Links mentioned:


Cohere ▷ #collab-opps (1 messages):

westn89: We’re a Swedish company that are partially using cohere


tinygrad (George Hotz) ▷ #general (35 messagesđŸ”„):

  • Exploring Mathematical Formula Construction: A member discussed constructing any mathematical formula using basic primitive ops and applying differentiation for gradient/backward passes, forming a dependency graph. This method optimizes hardware utilization and enables just-in-time scheduling for streaming, quick computations.

  • OpenELM Inquiry, a brief mention: One member inquired about the experience with OpenELM, but no follow-up discussion ensued.

  • Cross-Compatibility Between Frameworks: A user shared their use-case for nn.module, explaining it was useful for a hybrid model containing both tinygrad and PyTorch components. The module can automatically collect parameters from itself and child objects for training.

  • Clarifying Speech-To-Text/Text-To-Speech Inquiry: A user asked about the speech-to-text and text-to-speech engines showcased by George Hotz, likely found in the tinygrad examples, though which specific demonstration was not identified.

  • Discussion About tinygrad Optimizations: Users engaged in a debate over the optimization capabilities of tinygrad, where one member questioned whether it could generate a fast matrix multiplication (matmul) kernel, while another pointed out the use of computational reduction algorithms for convolutions. George Hotz clarified their aspirations for tinygrad, focusing on overall model training speed rather than single-operation optimization like matmul.

Link mentioned: GitHub - tinygrad/tinygrad: You like pytorch? You like micrograd? You love tinygrad! ❀: You like pytorch? You like micrograd? You love tinygrad! ❀ - GitHub - tinygrad/tinygrad: You like pytorch? You like micrograd? You love tinygrad! ❀


tinygrad (George Hotz) ▷ #learn-tinygrad (55 messagesđŸ”„đŸ”„):

  • Exploring the Optimization Frontier: A member shared a comprehensive writeup on loop unrolling within the context of tinygrad’s optimizer. The article details the transformation of simple loops into optimized operations, providing insights into the Uops IR.

  • Tinygrad 0.9 Launch Teased: George Hotz briefly mentioned that new updates will come with the release of tinygrad version 0.9, causing anticipation about potential new features or improvements in the library.

  • Kernel Optimization Dissected: Sharing another detailed writeup to elaborate on how the shapetracker and symbolic library function with loop unrolling/upcasting; moreover, providing a guide to interpret kernel output colors in tinygrad.

  • Tinygrad Learner’s Guide: Several members proposed starting points and suggested reading material for understanding and contributing to tinygrad; resources mentioned include MicroGrad and MiniTorch for foundational concepts, and also outlined an optimal path for reading through the tinygrad codebase.

  • Dynamic Testing and Symbolic Shapes: Discussion highlighted the ongoing development efforts toward dynamic testing and implementing kernels that can handle variable shapes without recompilation, focusing on the usage of symbolic shapes in operations like mean and sum.

Links mentioned:


Interconnects (Nathan Lambert) ▷ #ideas-and-feedback (10 messagesđŸ”„):

  • Brand Impact of Newsletter Cross-Promotion Considered: A member pondered the potential brand tarnishing of engaging in an unpaid promotion exchange with Semafor. This was seen as a growth opportunity, despite concerns that readers might find plugs annoying.

  • Bigger Audience, Bigger Growth?: The same member noted that Semafor’s tech newsletter audience is significantly larger, hinting at a substantial growth opportunity.

  • Comparing Content to Recognized Examples: To illustrate the type of content involved, an example of a Semafor newsletter was shared, discussing the divisive topic of synthetic data in AI.

  • Newsletter Exchanges – A One-Way Street?: Another member chimed in, questioning the importance of cross-promotion in newsletters given their nature as a “one-way medium” sent “into the void.”

  • Balancing Promotion with Reader Preferences: It was highlighted that there’s a risk of alienating readers who prefer pure content without promotions, suggesting that the success of such a strategy depends on execution and frequency. Another member weighed in, saying that even a small uptake from the promotion could be beneficial and lead to further growth.

Link mentioned: Semafor Tech: New synthetic data techniques shake up AI models | Semafor | Semafor: In today’s edition, we look at how machine-learning generated data can help make smaller AI models nearly as capable as larger ones.


Interconnects (Nathan Lambert) ▷ #news (10 messagesđŸ”„):

  • Microsoft Unleashes Phi-3: Phi-3, the next generation model from Microsoft, has been publicly released, amassing over 6,000 votes and featuring promising capabilities. In related news, Arena hits 800K votes, and Snowflake Arc Instruct has entered the fray.

  • A Gloomy Outlook for Dylan: A brief remark hints at unfortunate prospects for an individual named Dylan, with the context or cause left unstated.

  • Llama’s Fine Tuning Applauded: The fine tuning process for “llamas” received a positive shout-out, indicating noteworthy results or improvements.

  • Anticipation for GPT-4: A message hints at the possibility of GPT-4’s emergence, backed by a sense of confidence from the mentioned user.

  • Insights on Training an Open LM: A YouTube seminar led by Hanna Hajishirzi from AI2, discussing the training of an Open Language Model (OLMo), left at least one member wishing for a deeper understanding, while acknowledging the value of such shared resources. Hanna’s brisk presentation pace was noted, bolstering her repute for efficiency.

Links mentioned:


Interconnects (Nathan Lambert) ▷ #ml-questions (13 messagesđŸ”„):

  • Misconceptions Cleared About RLHF: RLHF’s stability and usefulness depends on the application; methods like KTO may be better suited for various tasks. “[RLHF] Depends on the application. KTO is probably the most well suited to many applied tasks”, the sentiment reflected that “[It’s] pretty nuanced yeah”.
  • DPO and KTO Show Promise in Fine-Tuning: A transition from SFT -> DPO -> KTO showed better user feedback in fine-tuning applications, with online iterations of DPO and KTO ‘coming’.
  • LLaMA 2 Follow-Up Creates Buzz: With a plethora of information available post-LLaMA 2 release, a blog post provides corrections and continued analysis, talking about controversial aspects and introducing technical notes like Ghost Attention.
  • Ghost Attention - Useful but Not Critical: Ghost Attention seems to have been initially promising for maintaining consistency in long conversations for LLaMA 2, but later comments suggest it may no longer be as important, possibly due to improvements in data and long context handling. “[GAtt] is not an important thing to implement. It’s a great exercise for learning new topics in the space.”

Link mentioned: Llama 2 follow-up: too much RLHF, GPU sizing, technical details: The community reaction to Llama 2 and all of the things that I didn’t get to in the first issue.


Interconnects (Nathan Lambert) ▷ #random (48 messagesđŸ”„):

  • OpenELM Surpasses OLMo: Discussion highlighted that OpenELM has outperformed OLMo, with comments acknowledging that OLMo 1b had limited success and is no longer a particularly strong model, and that there is now better public data available for training than what was used for OLMo.
  • Continuous Improvement Motivates AI Development: Members of the chat acknowledged that while their models have not been top-tier, it serves as motivation to improve. There’s consensus that better models are being trained, using the shortfall as an educational tool for safety and policy.
  • The Educational Role of Open Models: Participants pointed out the importance of open models in facilitating informed decision-making, with a consensus that while their models might not be the best, they are crucial for education and transparency in the AI community.
  • AI2’s Role in AI Advancements Recognized: The efforts of AI2 were acknowledged, especially in terms of education, and there was an expression of enthusiasm for the upcoming paper and developments, as well as a discussion on the financial aspects of AI research.
  • Intrigue in the Scaling & Function of Alternative Models: Conversation turned to various topics, including Snowflake, a new enterprise-focused model with high VRAM useful for inference, and the concept of active parameters as a proxy for model capability, indicating the interest in exploring alternative architectures beyond just size and benchmarks.

Link mentioned: Tweet from Itamar Golan đŸ€“ (@ItakGol): Visual Prompt Injection 💉🛑 IRL


Interconnects (Nathan Lambert) ▷ #memes (7 messages):

  • Quick Laugh, Light Content: One member posted a simple “lmao”, indicating amusement or laughter regarding the channel’s conversation or content.
  • Personal Reflection on Posting: The same individual later suggested the need for an editor, hinting at self-reflection on their message quality or content.
  • Jungle Adventures Shared: They shared a YouTube video titled “I’m leaving to the Amazon jungle
”, which details an excursion into rarely explored areas of the rainforest.
  • Contrasting Views of the Jungle: Another member responded with a video link showcasing a differing view on the nature of the jungle, quoting Werner Herzog’s perspective from the documentary Burden of Dreams: “Nature here is vile and base
 There is no harmony in the universe”.
  • Twitter Meme on LLM Quirks: The channel featured a tweet from Marques Brownlee, highlighting the humorous aspects of large language models (LLM) in a post deemed “the most meme llm shit ever”.

Links mentioned:


Interconnects (Nathan Lambert) ▷ #posts (1 messages):

  • Conversations on AGI’s Nature: A member complimented another on a thoughtful post about AGI, agreeing with the idea that AGI’s definition is subjective. The conversation suggests that the debate around AGI’s nature is an ongoing one.

LangChain AI ▷ #general (51 messagesđŸ”„):

  • Inquiry on Prompt Integration into Code: A member sought assistance with integrating a prompt into their existing code for a chat model. Another community member provided a detailed guide on incorporating ChatPromptTemplate and pipe method for chaining prompts and models in JavaScript.
  • Navigating OllamaFunctions Difficulties: There was a discussion around an issue with OllamaFunctions not working properly, linked to GitHub issue #20924. Subsequently, a member clarified the confusion between Gemini and VertexAI models, informing that Gemini 1.5 Pro works only with VertexAI, evidenced by successful implementation using ChatVertexAI(model="gemini-1.5-pro-preview-0409").
  • Building a Retrieval-Augmented Generation (RAG) System: A member requested recommendations for open-source models, embedding techniques, and vector storage solutions to develop an advanced RAG system, though no direct responses to this specific inquiry were provided in the message history.
  • Concerns Over Observability Tools for LLMs: A discussion on LLM observability tools questioned the choice between Arize Phoenix and Langfuze, specifically for those primarily using LlamaIndex. A preference was indicated for a self-hosted open-source solution, but no direct recommendations were provided.
  • Integration and Deployment Queries around LLMs: Various inquiries surfaced regarding deployment methods, such as using Hugging Face versus OpenAI API, and connecting OpenAI with SQL Server without the intermediary of LangChain for security concerns. There was also a direct request for advice on building AI clones of influencers on a new platform and an invitation to DM for potential partnership.

Links mentioned:


LangChain AI ▷ #langserve (1 messages):

  • AzureSearchVectorStoreRetriever Async Issue: A member reported an error about AzureSearchVectorStoreRetriever not supporting async operations. They inquired if it’s possible to either adjust lang-serve to handle sync operations or if writing an async wrapper around the sync function in the retriever would be a viable solution.

LangChain AI ▷ #share-your-work (11 messagesđŸ”„):

  • Galaxy AI Enters the Arena: GalaxyAI is offering free API access to premium AI models such as GPT-4, GPT-3.5-turbo, and more, with OpenAI format compatibility for easy integration into projects. Discover more on their website galaxyapi.onrender.com.

  • Launching Genai-Job-Agents: A GitHub repository for a Langchain/Langgraph-based agent that assists with job searching and CV building has been shared. For details, check out the repository at genai-job-agents.

  • Discover the Sparks of GPT-1: A new blog post delves into the original GPT-1 model, discussing its relevance and the technical evolution to current models. Read the insights here.

  • Implementing LangChain with Live Avatars: A YouTube demo showcases LangChain’s application in an Airbnb use case with 150 QA pairs and a live avatar Q&A session. View the demo at D-ID Airbnb.

  • Automating Code Improvements Via No-Code Platform: Autonoma is providing a no-code solution for automating code improvement tasks like input validation and error handling, complete with a free playground for testing and ALPHA GitHub integration. Experience the platform at Autonoma Free Demo.

Links mentioned:


LangChain AI ▷ #tutorials (4 messages):

  • Explore Local RAG with LLaMA3: A YouTube tutorial titled “Local RAG agent with LLaMA3 and Langchain” demonstrates how to use Retrieval-Augmented Generation (RAG) with LLaMA3, using the Langchain framework.

  • Llama 3 Empowers Web Browsing: Another YouTube guide titled “Llama 3 Web Browsing Agent with Langchain and Groq” showcases the implementation of web browsing capabilities through Llama 3, in combination with Langchain and Groq technologies.

  • Interactive Agents UI Building Tutorial: Marc Skov Madsen provides a video on creating an interactive web UI for CrewAI applications using the Panel framework, demonstrating the process of building a visual user interface for AI agents.

  • Captcha Blockade on Amazon Book Link: A member posted an Amazon link to a book titled “Mastering NLP: From Foundations to LLMs” but was met with a captcha challenge, preventing direct access to the page content.

Links mentioned:


Mozilla AI ▷ #llamafile (54 messagesđŸ”„):

  • Segmentation Fault When Running Llamafile: Users reported experiencing a segmentation fault when attempting to run llamafile on various platforms, such as Modal Labs. There were mentions of specific files generating errors or not being found, including Phi-3-mini-128k-instruct.F16.llamafile.

  • htop Bug Misrepresents Memory Usage: A member provided information about a bug in htop, which does not report shared memory usage correctly on Linux, likely influencing how memory usage is perceived by users during model operations.

  • Release of Llamafile v0.8.1: Announcement that the release of llamafile v0.8.1 now includes support for Phi-3 Mini 4k, addresses previous GPU module crashes, and adds bundled NVIDIA + AMD shared objects for Ubuntu users. Users are encouraged to report if the changes work or if issues persist.

  • LLM Behavior and Output Oddities Discussed: Members discussed unexpected behavior with LLMs, including changes in output consistency and unusual responses featuring parentheses and linebreaks. These issues appeared across different iterations of models like Llama3 70B and Mistral when running via llamafile.

  • Llamafile Tips and GPU Usage Questions: Users shared tips for ensuring llamafile can take full advantage of system RAM and queried about supported GPUs for running llamafiles. There were also questions related to determining whether a model is running on GPU or CPU and clarifications sought for handling endless output from llamafile.

Links mentioned:


AI Stack Devs (Yoko Li) ▷ #ai-companion (11 messagesđŸ”„):

  • Farewell to Tolerance for Collapse: A channel member expressed a dismissive sentiment about welcoming an impending collapse, hinting at a sense of disenchantment.

  • Spotlight on AI Companion Apps: A channel member highlighted two AI companion apps, Faraday and Amica, as noteworthy tools for those interested in AI companionship.

  • Faraday, a Personal Recommendation: The app Faraday earned a personal endorsement from a member after a month’s usage, distinguishing itself with an ability to run locally on a PC thanks to llama.cpp.

  • Amica, an Up-and-Comer with Privacy: The recently discovered app Amica is promised to operate similarly to Faraday with enhanced features and a strong emphasis on data privacy, available for both self-hosting and cloud services.

  • Privacy-Conscious AI Relationships Encouraged: Members were encouraged to explore Faraday and Amica if they value total data privacy in their interactions with AI.

Links mentioned:

  • Faraday.dev: Chat with AI Characters. Works offline. Zero configuration.
  • Amica - Your friend.: Amica is an open source interface for interactive communication with 3D characters with voice synthesis and speech recognition.

AI Stack Devs (Yoko Li) ▷ #events (2 messages):

  • Rosebud AI Game Jam Winners Announced: Rosebud beta testers teamed up with Rosie, the AI assistant, and showcased their creativity in game design during the Rosebud AI Sleep Game Jam. A game that stood out, Bedtime Negotiation, features an AI NPC character and Twitch co-founder Kevin Lin joined as a guest judge. Winners have been announced on Twitter.

  • New Game Jam: Education & AI: Rosebud AI invites the community to participate in a new Game Jam, in partnership with Week of AI, focusing on the theme of Education and AI. Participants are to create a 2D browser-based game utilizing Phaser JS on Rosebud’s AI platform, with a prize pool of $500, and they can learn more about the event on Twitter.


AI Stack Devs (Yoko Li) ▷ #ai-town-discuss (9 messagesđŸ”„):

  • AI Town’s Addictive Quality Acknowledged: A user linked to a Twitter post praising AI Town for its addictive nature, inspiring the idea of creating a simulation with developers, devops, dba, infra, and product managers.
  • Launch of LLM-Powered NPCs: A user has made their LLM-powered NPC models and inference stack available to address common NPC limitations, with the repository and models hosted on GitHub and Huggingface’s Hub, although the linked API access page was not found.
  • Call for Feedback on NPCs: This user highlights their NPC models’ low-latency innovation for smaller GPUs/CPUs and plans to introduce a quest-generation model, inviting members to provide feedback on the recent release.
  • Deep Dive into NPC Implementation Challenges: The user unravelled some key NPC development challenges, including the importance of compressing model output, minimizing calls to models, and tackling issues with generalist instruct-models like GPT-3.5 or Mistral.
  • Community Engages on NPC Fine-Tuning: A conversation about NPC character development ensued, with a promise of an upcoming blog post for a deeper exploration of the challenges and strategies encountered during the project.

Links mentioned:


AI Stack Devs (Yoko Li) ▷ #ai-town-dev (11 messagesđŸ”„):

  • Map Rendering Optimizations in AI Town Discussed: [edgarhnd] asserts that for larger maps, storing the map as an array can be problematic, and suggests having the map rendering static and storing essential data for the engine in an array could be a practical solution.
  • Opinion on Map Handling Methods: [ianmacartney] advocates for the map to be a static asset rather than a parameter passed around, to reduce bandwidth usage during reads, while acknowledging the server side still needs the array for collision detection.
  • Returning to Original File Read Method for Maps: Both [edgarhnd] and [.casado] seem to agree that reading the map as a file, the original method, is much simpler and more efficient.
  • AI Town Installation Tutorial Promoted: [.casado] shares a link to a YouTube tutorial for local AI Town installation titled “100% Local “AI Town” with Llama 3 AGENTS!!!”, providing a resource for those interested in setting up the environment. The video is available at 100% Local “AI Town” with Llama 3 AGENTS!!!.

Link mentioned: 100% Local “AI Town” with Llama 3 AGENTS!!!: 🔗 Links 🔗Download Pinokio here - https://pinokio.computer/The OG AI Town - https://github.com/a16z-infra/ai-townThe forked AI town - https://github.com/pea



DiscoResearch ▷ #mixtral_implementation (1 messages):

  • Mysteries of Mixtral’s Router Coefficients: A comparison between Mixtral-8x7B-Instruct-v0.1 and Mixtral-8x22B-Instruct-v0.1 revealed different router_aux_loss_coef values, 0.02 and 0.001 respectively. It sparked curiosity whether these reflect actual training values or are “fantasy values,” with a possibility that smaller experts might require a higher loss_coef.

DiscoResearch ▷ #general (6 messages):

  • Long Initialization Times on HPC: A member reported slow initialization times (2mins:20secs) for DiscoLM_German_7b_v1 on HPC when collecting shards, and long inference times (over 12 mins) for 4K token inputs on GPUs, despite brief initialization (3 secs) and fast inference (1.6 mins) on a local machine without GPUs.

  • GPU Utilization Improves Inference: Upon realizing they had not loaded the model onto GPUs, a member corrected the issue which reduced inference time to approximately 10 seconds on a two Tesla V100 setup, but shard loading times remained unchanged at 2mins:20secs.

  • Load Time Troubleshooting Ineffective: The suggested low_cpu_mem_usage=True argument did not yield improvements in model load times, indicating the problem may persist despite this adjustment.

  • Slow Storage Drive Could Be a Bottleneck: Another participant suggested that the high load times may be due to the model being stored on a slow storage drive and recommended verifying if the HF cache directory is set to a fast data partition.


DiscoResearch ▷ #discolm_german (8 messagesđŸ”„):

  • Discussing Practical Applications: The user hoped to see more anecdotal observations of LMs and expressed interest in testing models like lmsys arena, acknowledging that even specialized tasks might still be highly beneficial. A related tweet was shared discussing potential uses: Observation Discussion.
  • GPT-3’s German Model Downloads Spike: The gguf model saw an impressive uptake with 1500 downloads in just two days, signaling strong community interest and engagement.
  • Skepticism Over New Model Performance: A user expressed doubt about the performance of a newly released model, as community feedback suggests it doesn’t perform well, but another user disagreed, mentioning that the Phi-3 model did not overfit on the German RAG Eval dataset.
  • Querying Changes in Llamafied Phi-3 Model Tokenizer: PhilipMay inquired about the rationale for altering the tokenizer in a Llamafied Phi-3 model, specifically changing the end-of-sentence token. In discussions with the owner of the model, it became apparent this alteration was made for better performance with chat applications utilizing trtllm Tokenizer Change Discussion 7 and Tokenizer Change Discussion 6.
  • Phi-3 MoE Model Created for Experiments: A new Phi-3 MoE model has been developed using the Llamafied version with mergekit and a randomly initialized router. It is currently available for experimentation but requires training before use: Phi-3 MoE Model on Hugging Face.

Skunkworks AI ▷ #general (7 messages):

  • Cutting-Edge Research on Efficient Language Models: A new article titled “Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation” discusses CPU-compatible language models that generate Python code. The research introduces a dataset of 60 programming problems and employs a Chain-of-Thought prompt for improved model performance.

  • HaystackDB Enquires on Embeddings: A member questioned if the HaystackDB repository uses 2bit embeddings. They further inquired about the term “binary quantized” in the context of the repository.

  • Efficiency via Binary Quantization: Clarifying on binary quantized embeddings, another member explained that Binary Quantization (BQ) helps create a smaller index for similarity search, enhancing the efficiency of the database.

  • Llama-3 Fine-tuning Troubles: A member reached out to inquire if anyone has had success fine-tuning Llama-3, noting issues with their models not generating the End Of Sentence (EOS) token.

Links mentioned:


Skunkworks AI ▷ #off-topic (3 messages):

  • Introducing Snowflake Arctic for Enterprise AI: A YouTube video was shared, introducing Snowflake Arctic, an enterprise-focused large language model (LLM) that aims to push the boundaries of cost-effectiveness in enterprise AI.

  • Exploring RAG with LLaMA3 via Langchain: A tutorial video was linked, demonstrating how to use a local Retrieval-Augmented Generation (RAG) agent with LLaMA3 and Langchain.

  • Web Browsing with LLaMA3 Using Langchain and Groq: The discussion included a video on implementing a web browsing agent with LLaMA 3 using the Langchain library and Groq hardware, focusing on the integration of AI and web browsing capabilities.

Links mentioned:


LLM Perf Enthusiasts AI ▷ #jobs (1 messages):

  • Join Gamma's AI Revolution: Gamma, recognized by a16z as a top consumer AI app, is hiring an AI engineer to work on large-scale text and image models. The role involves prompt engineering, evaluations, fine-tuning, and feature development with advanced AI models.
  • Pushing Boundaries in Content Creation: Gamma leverages generative AI to simplify the creation of presentations and websites, serving over 10 million users who enjoy an effortless content creation experience.
  • Profitable Innovation Powered by Community: With more than $10M in funding from Accel and a profitability status, Gamma maintains a lean team of 16 and continues to grow organically through word-of-mouth.
  • Be Part Of A Tight-Knit Squad: This San Francisco-based company is looking to expand its small but mighty team with someone passionate about pushing LLMs to their limits, offering in-person collaboration approximately 3 days a week.
  • Interested in Engineering the Future of AI?: Candidates eager to explore this opportunity can learn more and apply at the following link: https://careers.gamma.app/ai-engineer.

Link mentioned: AI Engineer: AI Engineer San Francisco Click here to apply


LLM Perf Enthusiasts AI ▷ #openai (3 messages):

  • Leaked Version Speculation: A member shared a tweet from @phill__1 commenting that gpt2-chatbot feels like gpt4.5 due to its extensive domain knowledge. This led to discussions suggesting it could be a leaked version of GPT-4.5.
  • Community Approval: There is a simple expression of approval on the quality of gpt2-chatbot, described as “It’s good.”

Link mentioned: Tweet from Phil (@phill__1): Whatever gpt2-chatbot might be, it definitely feels like gpt4.5. It has insane domain knowledge I have never seen before


Datasette - LLM (@SimonW) ▷ #llm (1 messages):

  • Quest for Custom Grammar in Code-Generation: A member inquired about the possibility of passing a custom grammar, potentially as a model-specific option, to enhance code-generation by preventing syntax errors and focusing on semantic issues.