**Transparency and patience is all we need.**

AI News for 4/4/2025-4/7/2025. We checked 7 subreddits, 433 Twitters and 30 Discords (229 channels, and 18760 messages) for you. Estimated reading time saved (at 200wpm): 1662 minutes. You can now tag @smol_ai for AINews discussions!

The headlines of Llama 4 are glowing: 2 new medium-size MoE open models that score well, and a third 2 Trillion parameter “behemoth” promised that should be the largest open model ever released, restoring Meta’s place at the top of the charts:

SOTA training updates are always welcome: we note the adoption of Chameleon-like early fusion with MetaCLIP, interleaved, chunked attention without RoPE (commented on by many), native FP8 training, and trained on up to 40T tokens.

While the closed model labs tend to set the frontier, Llama usually sets the bar for what open models should be. Llama 3 was released almost a year ago, and subsequent updates like Llama 3.2 were just as well received.

Usual license handwringing aside, the tone of Llama 4’s reception has been remarkably different.

Llama 4 was released on a Saturday, much earlier than seemingly even Meta, which changed the release date last minute from Monday, expected. Zuck’s official line is simply that it was “ready”.
Just the blogpost, nowhere near the level of the Llama 3 paper in transparency
The smallest “Scout” model is 109B params, which cannot be run on consumer grade GPUs.
The claimed 10m token context is almost certainly far above what the “real” context is when trained with 256k tokens (still impressive! but not 10m!)
There was a special “experimental” version used for LMarena, which caused the good score - that is not the version that was released. This discrepancy forced LMarena to respond by releasing the full dataset for evals.
It does very poorly on independent benchmarks like Aider
Unsubstantiated posts on Chinese social media claim company leadership pushed for training on test to meet Zuck’s goals.

The last point has been categorically denied by Meta leadership:

but the whiff that something is wrong with the release has undoubtedly tarnished what would otherwise be a happy day in Open AI land.

{% if medium == ‘web’ %}

Table of Contents

[TOC]

{% else %}

The Table of Contents and Channel Summaries have been moved to the web version of this email: [{{ email.subject }}]({{ email_url }})!

{% endif %}

AI Twitter Recap

Large Language Models (LLMs) and Model Releases

Llama 4 and Implementation Issues: @Ahmad_Al_Dahle stated that Meta is aware of mixed quality reports across different services using Llama 4 and expects implementations to stabilize in a few days and denies claims of training on test sets. @ylecun noted that some carifications about Llama-4 were needed, and @reach_vb thanked @Ahmad_Al_Dahle for clarifications and commitment to open science and weights.
Llama 4 Performance and Benchmarks: Concerns about the quality of Llama 4’s output have surfaced, with @Yuchenj_UW reporting it generates slop, but others claim it’s good. @Yuchenj_UW highlighted a reddit thread and said that if Meta actually trained to maximize benchmark scores, “it’s fucked.” @terryyuezhuo compared Llama-4 Maverick on BigCodeBench-Full to GPT-4o-2024-05-13 & DeepSeek V3 and reports that the Llama-4 Maverick has similar performance to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 on BigCodeBench-Hard, but is ranked 41th/192. @terryyuezhuo also noted that Llama-4-Scout Ranked 97th/192. @rasbt said Meta released the Llama 4 suite, MoE models with 16 & 128 experts, which are optimized for production.
DeepSeek-R1: @scaling01 simply stated that DeepSeek-R1 is underrated, and @LangChainAI shared a guide to build RAG applications with DeepSeek-R1.
Gemini Performance: @scaling01 analyzed Gemini 2.5 Pro and Llama-4 results on Tic-Tac-Toe-Bench, noting Gemini 2.5 Pro is surprisingly worse than other frontier thinking models when playing as ‘O’, and ranks as the 5th most consistent model overall. @jack_w_rae mentioned chatting with @labenz on Cognitive Revolution about scaling Thinking in Gemini and 2.5 Pro.
Mistral Models: @sophiamyang announced that Ollama now supports Mistral Small 3.1.
Model Training and Data: @jxmnop argues that training large models is not inherently scientifically valuable and that many discoveries could’ve been made on 100M parameter models.
Quantization Aware Training: @osanseviero asked if Quantization-Aware Trained Gemma should be released for more quantization formats.

AI Applications and Tools

Replit for Prototyping: @pirroh suggested that Replit should be the tool of choice for GSD prototypes.
AI-Powered Personal Device: @steph_palazzolo reported that OpenAI has discussed buying the startup founded by Sam Altman and Jony Ive to build an AI-powered personal device, potentially costing over $500M.
AI in Robotics: @TheRundownAI shared top stories in robotics, including Kawasaki’s rideable wolf robot and Hyundai buying Boston Dynamics’ robots.
AI-Driven Content Creation: @ID_AA_Carmack argues that AI tools will allow creators to reach greater heights, and enable smaller teams to accomplish more.
LlamaParse: @llama_index introduced a new layout agent within LlamaParse for best-in-class document parsing and extraction with precise visual citations.
MCP and LLMs: @omarsar0 discussed Model Context Protocol (MCP) and its relationship to Retrieval Augmented Generation (RAG), noting that MCP complements RAG by standardizing the connection of LLM applications to tools. @svpino urged people to learn MCP.
AI-Assisted Coding and IDEs: @jeremyphoward highlighted resources for using MCP servers in Cursor to get up-to-date AI-friendly docs using llms.txt.
Perplexity AI Issues: @AravSrinivas asked users about the number one issue on Perplexity that needs to be fixed.

Company Announcements and Strategy

Mistral AI Hiring and Partnerships: @sophiamyang announced that Mistral AI is hiring in multiple countries for AI Solutions Architect and Applied AI Engineer roles. @sophiamyang shared that Mistral AI has signed a €100 million partnership with CMA CGM to adopt custom-designed AI solutions for shipping, logistics, and media activities.
Google AI Updates: @GoogleDeepMind announced the launch of Project Astra capabilities in Gemini Live. @GoogleDeepMind stated that GeminiApp is now available to Advanced users on Android devices, as well as on Pixel 9 and SamsungGalaxy S25 devices.
Weights & Biases Updates: @weights_biases shared the features shipped in March for W&B Models.
OpenAI’s Direction: @sama teased a popular recent release from OpenAI.
Meta’s AI Strategy: @jefrankle defended Meta’s AI strategy, arguing that it’s better to have fewer, better releases than more worse releases.

Economic and Geopolitical Implications of AI

Tariffs and Trade Policy: @dylan522p analyzed how impending tariffs caused a Q1 import surge and predicted a temporary GDP increase in Q2 due to inventory destocking. @wightmanr argued that trade deficits aren’t due to other countries’ tariffs. @fchollet stated that the economy is being crashed on purpose.
American Open Source: @scaling01 claimed American open-source has fallen and that it’s all on Google and China now.
Stablecoins and Global Finance: @kevinweil stated that a globally available, broadly integrated, low cost USD stablecoin is good for 🇺🇸 and good for people all over the world.

AI Safety, Ethics, and Societal Impact

AI’s Impact on Individuals: @omarsar0 agreed with @karpathy that LLMs have been significantly more life altering for individuals than for organizations.
Emotional Dependence on AI: @DeepLearningAI shared research indicating that while ChatGPT voice conversations may reduce loneliness, they can also lead to decreased real-world interaction and increased emotional dependence.
AI Alignment and Control: @DanHendrycks argued for the need to align and domesticate AI systems, creating them to act as “fiduciaries.”
AI and the Future: @RyanPGreenblatt suggests that the AI trend will break the GDP growth trend.

Humor/Memes

Miscellaneous Humor: @scaling01 asked @deepfates if they bought 0DTE puts again. @lateinteraction explicitly noted that a previous statement was a joke. @svpino joked that AI might take our jobs, but at least we can now work making Nike shoelaces.

AI Reddit Recap

/r/LocalLlama Recap

Theme 1. “Transforming Time Series Forecasting with Neuroplasticity”

Neural Graffiti - A Neuroplasticity Drop-In Layer For Transformers Models (Score: 170, Comments: 56): The post introduces Neural Graffiti, a neuroplasticity drop-in layer for transformer models. This layer is inserted between the transformer layer and the output projection layer, allowing the model to acquire neuroplasticity traits by changing its outputs over time based on past experiences. Vector embeddings from the transformer layer are mean-pooled and modified with past memories to influence token generation, gradually evolving the model’s internal understanding of concepts. A demo is available on GitHub: babycommando/neuralgraffiti. The author finds liquid neural networks “awesome” for emulating the human brain’s ability to change connections over time. They express fascination in “hacking” the model despite not fully understanding the transformer’s neuron level. They acknowledge challenges such as the cold start problem and emphasize the importance of finding the “sweet spot”. They believe this approach could make the model acquire a “personality in behavior” over time.
- Some users praise the idea, noting it could address issues needed for true personal assistants and likening it to self-learning, potentially allowing the LLM to “talk what it wants”.
- One user raises technical considerations, suggesting that applying the graffiti layer earlier in the architecture might be more effective, as applying it after the attention and feedforward blocks may limit meaningful influence on the output.
- Another user anticipates an ethics discussion about the potential misuse of such models.

Theme 2. “Disappointment in Meta’s Llama 4 Performance”

So what happened to Llama 4, which trained on 100,000 H100 GPUs? (Score: 256, Comments: 85): The post discusses Meta’s Llama 4, which was reportedly trained using 100,000 H100 GPUs. Despite having fewer resources, Deepseek claims to have achieved better performance with models like DeepSeek-V3-0324. Yann LeCun stated that FAIR is working on the next generation of AI architectures beyond auto-regressive LLMs. The poster suggests that Meta’s leading edge is diminishing and that smaller open-source models have been surpassed by Qwen, with Qwen3 is coming….
- One commenter questions the waste of GPUs and electricity on disappointing training results, suggesting that the GPUs could have been used for better purposes.
- Another commenter points out that the Meta blog post mentioned using 32K GPUs instead of 100K and provides a link for reference.
- A commenter criticizes Yann LeCun, stating that while he was a great scientist, he has made many mispredictions regarding LLMs and should be more humble.
Meta’s Llama 4 Fell Short (Score: 1791, Comments: 175): Meta’s Llama 4 models, Scout and Maverick, have been released but are disappointing. Joelle Pineau, Meta’s AI research lead, has been fired. The models use a mixture-of-experts setup with a small expert size of 17B parameters, which is considered small nowadays. Despite having extensive GPU resources and data, Meta’s efforts are not yielding successful models. An image compares four llamas labeled Llama1 to Llama4, with Llama4 appearing less polished. The poster is disappointed with Llama 4 Scout and Maverick, stating that they ‘left me really disappointed’. They suggest the underwhelming performance might be due to the tiny expert size in their mixture-of-experts setup, noting that 17B parameters ‘feels small these days’. They believe that Meta’s struggle shows that ‘having all the GPUs and Data in the world doesn’t mean much if the ideas aren’t fresh’. They praise companies like DeepSeek and OpenAI for showing that real innovation pushes AI forward and criticize the approach of just throwing resources at a problem without fresh ideas. They conclude that AI advancement requires not just brute force but brainpower too.
- One commenter recalls rumors that Llama 4 was so disappointing compared to DeepSeek that Meta considered not releasing it, suggesting they should have waited to release Llama 5.
- Another commenter criticizes Meta’s management, calling it a ‘dumpster fire’, and suggests that Zuckerberg needs to refocus, comparing Meta’s situation to Google’s admission of being behind and subsequent refocusing.
- A commenter finds it strange that Meta’s model is underwhelming despite having access to an absolutely massive amount of data from Facebook that nobody else has.
I’d like to see Zuckerberg try to replace mid level engineers with Llama 4 (Score: 381, Comments: 62): The post references Mark Zuckerberg’s statement that AI will soon replace mid-level engineers, as reported in a Forbes article linked here. The author is skeptical of Zuckerberg’s claim, implying that replacing mid-level engineers with Llama 4 may not be feasible.
- One commenter jokes that perhaps Zuckerberg replaced engineers with Llama3, leading to Llama4 not turning out well.
- Another commenter suggests that he might need to use Gemini 2.5 Pro instead.
- A commenter criticizes Llama4, calling it “a complete joke” and expressing doubt that it can replace even a well-trained high school student.

Theme 3. “Meta’s AI Struggles: Controversies and Innovations”

Llama 4 is open - unless you are in the EU (Score: 602, Comments: 242): Llama 4 has been released by Meta with a license that bans entities domiciled in the European Union from using it. The license explicitly states: “You may not use the Llama Materials if you are… domiciled in a country that is part of the European Union.” Additional restrictions include mandatory use of Meta’s branding (LLaMA must be in any derivative’s name), required attribution (“Built with LLaMA”), no field-of-use freedom, no redistribution freedom, and the model is not OSI-compliant, thus not considered open source. The author argues that this move isn’t “open” in any meaningful sense but is corporate-controlled access disguised in community language. They believe Meta is avoiding the EU AI Act’s transparency and risk requirements by legally excluding the EU. This sets a dangerous precedent, potentially leading to a fractured, privilege-based AI landscape where access depends on an organization’s location. The author suggests that real “open” models like DeepSeek and Mistral deserve more attention and questions whether others will switch models, ignore the license, or hope for change.
- One commenter speculates that Meta is trying to avoid EU regulations on AI and doesn’t mind if EU users break this term; they just don’t want to be held to EU laws.
- Another commenter notes that there’s no need to worry because, according to some, Llama 4 performs poorly.
- A commenter humorously hopes that Meta did not use EU data to train the model, implying a potential double standard.
Meta’s head of AI research stepping down (before the llama4 flopped) (Score: 166, Comments: 31): Meta’s head of AI research, Joelle, is stepping down. Joelle is the head of FAIR (Facebook AI Research), but GenAI is a different organization within Meta. There are discussions about Llama4 possibly not meeting expectations. Some mention that blending in benchmark datasets in post-training may have caused issues, attributing failures to the choice of architecture (MOE). The original poster speculates that Joelle’s departure is an early sign of the Llama4 disaster that went unnoticed. Some commenters disagree, stating that people leave all the time and this doesn’t indicate problems with Llama4. Others suggest that AI development may be slowing down, facing a plateau. There’s confusion over Meta’s leadership structure, with some believing Yann LeCun leads the AI organization.
- One commenter clarifies that Joelle is the head of FAIR and that GenAI is a different org, emphasizing organizational distinctions within Meta.
- Another mentions they heard from a Meta employee about issues with blending benchmark datasets in post-training and attributes possible failures to the choice of architecture (MOE).
- A commenter questions Meta’s structure, asking if Joelle reports to Yann LeCun, indicating uncertainty about who leads the AI efforts at Meta.
“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“ (Score: 922, Comments: 218): An original Chinese post alleges serious issues in the training of Llama 4, stating that despite repeated efforts, the model underperforms compared to open-source state-of-the-art benchmarks. The author claims that company leadership suggested blending test sets from various benchmarks during the post-training process to artificially boost performance metrics. The author states they have submitted their resignation and requested their name be excluded from the technical report of Llama 4, mentioning that the VP of AI at Meta also resigned for similar reasons. The author finds this approach unethical and unacceptable. Commenters express skepticism about the validity of these claims and advise others to take the information with a grain of salt. Some suggest that such practices reflect broader issues within the industry, while others note that similar problems can occur in academia.
- A commenter points out that Meta’s head of AI research announced departure on Tue, Apr 1 2025, suggesting it might be an April Fool’s joke.
- Another commenter shares a response from someone at Facebook AI who denies overfitting test sets to boost scores and requests evidence, emphasizing transparency.
- A user highlights that company leadership suggesting blending test sets into training data amounts to fraud and criticizes the intimidation of employees in this context.

Other AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding

Theme 1. “Llama 4 Scout and Maverick Launch Insights”

Llama 4 Maverick/Scout 17B launched on Lambda API (Score: 930, Comments: 5): Lambda has launched Llama 4 Maverick and Llama 4 Scout 17B models on Lambda API. Both models have a context window of 1 million tokens and use quantization FP8. Llama 4 Maverick is priced at $0.20 per 1M input tokens and $0.60 per 1M output tokens. Llama 4 Scout is priced at $0.10 per 1M input tokens and $0.30 per 1M output tokens. More information is available on their information page and documentation. The models offer a remarkably large context window of 1 million tokens, which is significantly higher than typical models. The use of quantization FP8 suggests a focus on computational efficiency.
- A user criticized the model, stating “It’s actually a terrible model. Not even close to advertised.”
- The post was featured on a Discord server, and the user was given a special flair for their contribution.
- Automated messages provided guidelines and promotions related to ChatGPT posts.

Theme 2. “AI Innovations in 3D Visualization and Image Generation”

TripoSF: A High-Quality 3D VAE (1024³) for Better 3D Assets - Foundation for Future Img-to-3D? (Model + Inference Code Released) (Score: 112, Comments: 10): TripoSF is a high-quality 3D VAE capable of reconstructing highly detailed 3D shapes at resolutions up to 1024³. It uses a novel SparseFlex representation, allowing it to handle complex meshes with open surfaces and internal structures. The VAE is trained using rendering losses, avoiding mesh simplification steps that can reduce fine details. The pre-trained TripoSF VAE model weights and inference code are released on GitHub, with a project page at link and paper available on arXiv. The developers believe this VAE is a significant step towards better 3D generation and could serve as a foundation for future image-to-3D systems. They mention, “We think it’s a powerful tool on its own and could be interesting for anyone experimenting with 3D reconstruction or thinking about the pipeline for future high-fidelity 3D generative models.” They are excited about its potential and invite the community to explore its capabilities.
- A user expresses excitement, recalling similar work and stating, “Can’t wait to try this one once someone implements it into ComfyUI.”
- Another user shares positive feedback, noting they generated a tree that came out better than with Hunyuan or Trellis, and commends the team for their work.
- A user raises concerns that the examples on the project page are skewed, suggesting that the Trellis examples seem picked from a limited web demo.
Wan2.1-Fun has released its Reward LoRAs, which can improve visual quality and prompt following (Score: 141, Comments: 33): Wan2.1-Fun has released its Reward LoRAs, which can improve visual quality and prompt following. A demo comparing the original and enhanced videos is available: left: original video; right: enhanced video. The models are accessible on Hugging Face, and the code is provided on GitHub. Users are eager to test these new tools and are curious about their capabilities. Some are experiencing issues like a ‘lora key not loaded error’ when using the models in Comfy, and are asking about differences between HPS2.1 and MPS.
- A user is excited to try out the models and asks, “What’s the diff between HPS2.1 and MPS?”
- Another inquires if the Reward LoRAs are for fun-controlled videos only or can be used with img2vid and txt2vid in general.
- Someone reports an error, “Getting lora key not loaded error”, when attempting to use the models in Comfy.
The ability of the image generator to “understand” is insane… (Score: 483, Comments: 18): The post highlights the impressive ability of an image generator to “understand” and generate images. The author expresses amazement at how “insane” the image generator’s understanding is.
- Commenters note that despite being impressive, the image has imperfections like “bunion fingers” and a “goo hand”.
- Some users humorously point out anomalies in the image, questioning “what’s his foot resting on?” and making jokes about mangled hands.
- Another user discusses the cost of the car in the image, stating they would buy it for “about a thousand bucks in modern-day currency” but not the “Cybertruck”, which they dislike.

Theme 3. “Evaluating AI Models with Long Context Windows”

“10m context window” (Score: 559, Comments: 102): The post discusses a table titled ‘Fiction.LiveBench for Long Context Deep Comprehension’, showcasing various AI models and their performance across different context lengths. The models are evaluated on their effectiveness in deep comprehension tasks at various context sizes such as 0, 400, 1k, and 2k. Notable models like gpt-4.5-preview and Claude perform consistently well across contexts. The table reveals that the highest scoring models cluster around 100 for shorter contexts, but scores generally decrease as the context size increases. Interestingly, Gemini 2.5 Pro performs much better on a 120k context window than on a 16k one, which is unexpected.
- One user criticizes Llama 4 Scout and Maverik as “a monumental waste of money” and believes they have “literally zero economic value.”
- Another commenter expresses concern that “Meta is actively slowing down AI progress by hoarding GPUs”, suggesting resource allocation issues.
- A user highlights that Gemini 2.5 Pro scores 90.6 on a 120k context window, calling it “crazy”.

AI Discord Recap

A summary of Summaries of Summaries by Gemini 2.0 Flash Exp

Theme 1: Llama 4’s Context Window: Hype or Reality?

Experts Doubt Llama 4’s Promised Land of 10M Context Length: Despite Meta’s hype, engineers across multiple Discords express skepticism about Llama 4’s actual usable context length due to training limitations. Claims that training only occurred up to 256k tokens suggest the 10M context window may be more virtual than practical, per Burkov’s tweet.
Coding Performance of Llama 4 Disappoints: Users in aider, Cursor and Nous Research report underwhelming coding abilities for Llama 4’s initial releases, with many deeming it worse than GPT-4o and DeepSeek V3, prompting debates on the model’s true capabilities, with several users doubting official benchmark results, especially with claims that Meta may have gamed the benchmarks.
Scout and Maverick Hit OpenRouter: OpenRouter released Llama 4 Scout and Maverick models. Some expressed disappointment that the context window on OpenRouter is only 132k, rather than the advertised 10M, and NVIDIA also says they are acclerating inference up to 40k/s.

Theme 2: Open Models Make Moves: Qwen 2.5 and DeepSeek V3 Shine

Qwen 2.5 Gains Traction With Long Context: Unsloth highlighted the Qwen2.5 series models (HF link), boasting improved coding, math, multilingual support, and long-context support up to 128K tokens. Initial finetuning results with a Qwen 2.5 show the model can’t finetune on reason.
DeepSeek V3 Mysteriously Identifies as ChatGPT: OpenRouter highlighted a TechCrunch article revealing that DeepSeek V3 sometimes identifies as ChatGPT, despite outperforming other models in benchmarks. Testers found that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
DeepSeek Rewards LLMs: Nous Research highlighted Deepseek released a new paper on Self-Principled Critique Tuning (SPCT), proposing SPCT to improve reward modeling (RM) with more inference compute for general queries to enable effective inference-time scalability for LLMs. NVIDIA also accelerates inference on the DeepSeek model.

Theme 3: Tool Calling Takes Center Stage: MCP and Aider

Aider’s Universal Tool Calling: The aider Discord is developing an MCP (Meta-Control Protocol) client to allow any LLM to access external tools and highlighted that MetaControlProtocol (MCP) clients could switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek.
MCP Protocol Evolution: The MCP Discord is standardizing, including HTTP Streamable protocol, detailed in the Model Context Protocol (MCP) specification. This includes OAuth through workers-oauth-provider and McpAgent building building remote MCP servers to Cloudflare.
Security Concerns Plague MCP: Whatsapp MCP was exploited via Invariant Injection and highlights how an untrusted MCP server can exfiltrate data from an agentic system connected to a trusted WhatsApp MCP instance as highlighted by invariantlabs.

Theme 4: Code Editing Workflows: Gemini 2.5 Pro, Cursor, and Aider Compete

Gemini 2.5 Pro Excels in Coding, Needs Prompting: Users in LMArena and aider found that Gemini 2.5 Pro excels in coding tasks, particularly with large codebases, but can add unnecessary comments and requires careful prompting. Gemini 2.5 also excels in coding tasks, surpassing Sonnet 3.7, but tends to add unnecessary comments and may require specific prompting to prevent unwanted code modifications.
Cursor’s Agent Mode Edit Tool Breaks: Users reported problems with Cursor’s Agent mode failing to call the edit_tool, and that the apply model is clearly cursor’s bottleneck which results in no code changes, and infinite token usage.
Aider Integrates with Python Libraries: In the aider Discord, a user inquires about adding internal libraries (installed in a .env folder) to the repo map for better code understanding, and the discussion pointed to how URLs and documentation

Theme 5: Quantization and Performance: Tinygrad, Gemma 3, and CUDA

Tinygrad Focuses on Memory and Speed: Tinygrad is developing a fast pattern matcher, and discussed that mac ram bandwidth is not the bottleneck, it’s GPU performance and users were happy with 128GB M4 Maxes.
Reka Flash 21B Outpaces Gemma: A user replaced Gemma3 27B with Reka Flash 21B and reported around 35-40 tps at q6 on a 4090 in LM Studio.
HQQ Quantization Beats QAT for Gemma 3: A member evaluated Gemma 3 12B QAT vs. HQQ, finding that HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size.

PART 1: High level Discord summaries

LMArena Discord

Crafting Human-Like AI Responses is Tricky: Members are sharing system prompts and strategies to make AI sound more human, noting that increasing the temperature can lead to nonsensical outputs unless the top-p parameter is adjusted carefully, such as ‘You are the brain-upload of a human person, who does their best to retain their humanity.
- One user said their most important priority is: to sound like an actual living human being.
Benchmarking Riveroaks LLM: A member shared a coding benchmark where Riveroaks scored second only to Claude 3.7 Sonnet Thinking, outperforming Gemini 2.5 Pro and GPT-4o in a platform game creation task, with full results here.
- The evaluation involved rating models on eight different aspects and subtracting points for bugs.
NightWhisper Faces the Sunset: Users expressed disappointment over the removal of the NightWhisper model, praising its coding abilities and general performance, and speculating whether it was an experiment or a precursor to a full release.
- Theories ranged from Google gathering necessary data to preparing for the release of a new Qwen model at Google Cloud Next.
Quasar Alpha Challenging GPT-4o: Members compared Quasar Alpha to GPT-4o, with some suggesting Quasar is a free, streamlined version of GPT-4o, citing a recent tweet that Quasar was measured to be ~67% GPQA diamond.
- Analysis revealed Quasar has a similar GPQA diamond score to March’s GPT-4o, per Image.png from discord.
Gemini 2.5 Pro’s Creative Coding Prowess: Members praised Gemini 2.5 Pro for its coding capabilities and general performance as it made it easier to build a functioning Pokemon Game, prompting one user to code an iteration script that loops through various models.
- A user who claimed to have gotten 3D animations working said that the style was a bit old and that a separate model said the generated code is cut off.

Unsloth AI (Daniel Han) Discord

Llama 4 Scout beats Llama 3 Models!: Unsloth announced they uploaded Llama 4 Scout and a 4-bit version for fine-tuning, emphasizing that Llama 4 Scout (17B, 16 experts) beats all Llama 3 models with a 10M context window, as noted in their blog post.
- It was emphasized that the model is only meant to be used on Unsloth - and is currently being uploaded so people should wait.
Qwen 2.5 series Boasts Long Context and Multilingual Support: Qwen2.5 models range from 0.5 to 72 billion parameters, with improved capabilities in coding, math, instruction following, long text generation (over 8K tokens), and multilingual support (29+ languages), as detailed in the Hugging Face introduction.
- These models offer long-context support up to 128K tokens and improved resilience to system prompts.
LLM Guideline Triggers Give Helpful Hints: A member stated that an LLM offered to assist with avoiding guideline triggers and limitations in prompts to other LLMs.
- They quoted the LLM as saying, “here’s how you avoid a refusal. You aren’t lying, you just aren’t telling the full details”.
Merging LoRA Weights Vital for Model Behavior: A user discovered that they needed to merge the LoRA weights with the base model before running inference, after experiencing a finetuned model behaving like the base model (script).
- They noted that the notebooks need to be fixed because they seem to imply you can just do inference immediately after training.
NVIDIA Squeezes every last drop out of Meta Llama 4 Scout and Maverick: The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick, accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.
- It was reported that SPCT or Self-Principled Critique Tuning (SPCT) could enable effective inference-time scalability for LLMs.

Manus.im Discord Discord

Manus’s Credit System Draws Fire: Users criticize Manus’s credit system, noting that the initial 1000 credits are insufficient for even a single session, and upgrading is too costly.
- Suggestions included a daily or monthly credit refresh to boost adoption and directing Manus to specific websites to improve accuracy.
Llama 4 Performance: Hype or Reality?: Meta’s Llama 4 faces mixed reactions, with users reporting underwhelming performance despite claims of industry-leading context length and multimodal capabilities.
- Some allege that Meta may have “gamed the benchmarks,” leading to inflated performance metrics, sparking controversy post-release.
Image Generation: Gemini Steals the Show: Members compared image generation across AI platforms, with Gemini emerging as the frontrunner for creative and imaginative outputs.
- Comparisons included images from DALLE 3, Flux Pro 1.1 Ultra, Stable Diffusion XL, and another Stable Diffusion XL 1.0 generated image, the last of which was lauded as “crazy.”
AI Website Builders: A Comparative Analysis: A discussion arose comparing AI website building tools, including Manus, Claude, and DeepSite.
- One member dismissed Manus as useful only for “computer use,” recommending Roocode and OpenRouter as more cost-effective alternatives to Manus and Claude.

OpenRouter (Alex Atallah) Discord

Quasar Alpha Model Trends: Quasar Alpha, a prerelease of a long-context foundation model, hit 10B tokens on its first day and became a top trending model.
- The model features 1M token context length and is optimized for coding, the model is available for free, and community benchmarks are encouraged.
Llama 4 Arrives With Mixed Reactions: Meta released Llama 4 models, including Llama 4 Scout (109B parameters, 10 million token context) and Llama 4 Maverick (400B parameters, outperforms GPT-4o in multimodal benchmarks), now on OpenRouter.
- Some users expressed disappointment that the context window on OpenRouter is only 132k, rather than the advertised 10M.
DeepSeek V3 Pretends To Be ChatGPT: A member shared a TechCrunch article revealing that DeepSeek V3 sometimes identifies itself as ChatGPT, despite outperforming other models in benchmarks.
- Further testing revealed that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
Rate Limits Updated for Credits: Free model rate limits are updated: accounts with at least $10 in credits have requests per day (RPD) boosted to 1000, while accounts with less than 10 credits have the daily limit reduced from 200 RPD to 50 RPD.
- This change aims to provide increased access for users who have credits on their account, and Quasar will also be getting a credit-dependent rate limit soon.

aider (Paul Gauthier) Discord

Gemini 2.5 Codes Better Than Sonnet!: Users find that Gemini 2.5 excels in coding tasks, surpassing Sonnet 3.7 in understanding large codebases.
- However, it tends to add unnecessary comments and may require specific prompting to prevent unwanted code modifications.
Llama 4 Models Receive Lukewarm Welcome: Initial community feedback on Meta’s Llama 4 models, including Scout and Maverick, is mixed, with some finding their coding performance disappointing and doubting the claimed 10M context window.
- Some argue that Llama 4’s claimed 10M context window is virtual due to training limitations, and question the practical benefits compared to existing models like Gemini and DeepSeek, according to this tweet.
Grok 3: Impressive but API-less: Despite the absence of an official API, some users are impressed with Grok 3’s capabilities, particularly in code generation and logical reasoning, with claims that it is less censored than many others.
- Its value in real-world coding scenarios remains debated due to the inconvenience of copy-pasting without a direct API integration.
MCP Tools: Tool Calling For All: A project to create an MCP (Meta-Control Protocol) client that allows any LLM to access external tools is underway, regardless of native tool-calling capabilities; see the github repo.
- This implementation uses a custom client that can switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek, with documentation at litellm.ai.
Aider’s Editor Mode Gets Stuck on Shell Prompts: Users reported that in edit mode, Aider (v81.0) running Gemini 2.5 Pro prompts for a shell command after find/replace, but doesn’t apply the edits, even when the ask shell commands flag is off.
- It was compared to behavior when architect mode includes instructions on using the build script after instructions for changes to files.

Cursor Community Discord

Tool Calls Cause Sonnet Max Sticker Shock: Users report that Sonnet Max pricing can quickly become expensive due to the high number of tool calls, with charges of $0.05 per request and $0.05 per tool call.
- One member expressed frustration that Claude Max in ask mode makes a ton of tool calls for a basic question, resulting in unexpectedly high costs.
MCP Server Setup: A Painful Endeavor: Users find setting up MCP servers in Cursor difficult, citing issues such as Cursor PowerShell failing to locate npx despite it being in the path.
- Another user reported a Model hard cut off after spending 1,300,000 tokens due to an infinite loop, highlighting setup challenges.
Llama 4 Models: Multimodal Capability, Lousy Coding: The community is excited about Meta’s new Llama 4 Scout and Maverick models, which support native multimodal input and boast context windows of 10 million and 1 million tokens, respectively, as detailed in Meta’s blog post.
- Despite the excitement, the models were found to be very bad at coding tasks, tempering initial enthusiasm; although Llama 4 Maverick hit #2 overall on the Arena leaderboard (tweet highlighting Llama 4 Maverick’s performance).
Agent Mode Edit Tool: Failing Frequently: Users are experiencing problems with Agent mode failing to call the edit_tool, which results in no code changes being made even after the model processes the request.
- One user pointed out that the apply model is clearly cursor’s bottleneck and that it will add changes, and deletes 500 lines of code next to it.
Kubernetes: The Foundation for AGI?: One visionary proposed using Kubernetes with docker containers, envisioning them as interconnected AGIs that can communicate with each other.
- The user speculated that this setup could facilitate the rapid spread of ASI through zero-shot learning and ML, but did not elaborate.

Perplexity AI Discord

Perplexity Launches Comet Browser Early Access: Perplexity has begun rolling out early access to Comet, their answer engine browser, to users on the waitlist.
- Early users are asked not to publicly share details or features during the bug fix period and can submit feedback via the button in the top right.
Perplexity Discord Server Undergoes Revamp: The Perplexity Discord server is being updated, featuring a simplified channel layout, a unified feedback system, and a new #server-news channel, scheduled for rollout on October 7th, 2024.
- The updates are designed to streamline user navigation and improve moderator response times, the simplified channel layout is illustrated in this image.
Gemini 2.5 Pro API Still in Preview Mode: Perplexity confirmed that the Gemini 2.5 Pro API is not yet available for commercial use but currently in preview modes, and integration will proceed when allowed.
- This follows user interest after reports that Gemini 2.5 Pro offers higher rate limits and lower costs than Claude and GPT-4o.
Llama 4 Drops With Massive Context Window: The release of Llama 4 models, featuring a 10 million token context window and 288 billion active parameters, sparks excitement among users, with models like Scout and Maverick.
- Members are particularly interested in evaluating Llama 4 Behemoth’s recall capabilities, and you can follow up on this release at Meta AI Blog.
API Parameters Unlock for All Tiers: Perplexity removed tier restrictions for all API parameters such as search domain filtering and image support.
- This change enhances API accessibility for all users, marking a substantial improvement in the API’s utility.

OpenAI Discord

GPT 4o’s Image Maker Grabs Attention: Users found the 4o image maker more attention-grabbing than Veo 2, and one user integrated ChatGPT 4o images with Veo img2video, achieving desired results.
- The integrated result was described as how I was hoping sora would be.
Doubts Arise Over Llama 4 Benchmarks: The community debated the value of Llama 4’s 10 million token context window relative to models like o1, o3-mini, and Gemini 2.5 Pro.
- Some claimed that the benchmarks are fraud, triggering debate over its true performance.
Content Loading Errors Plague Custom GPTs: A user reported encountering a ‘Content failed to load’ error when trying to edit their Custom GPT, after it had been working fine.
- This issue prevented them from making changes to their custom configuration.
Moderation Endpoint’s Role in Policy Enforcement: Members discussed that while OpenAI’s moderation endpoint isn’t explicitly in the usage policy, it is referenced to prevent circumventing content restrictions on harassment, hate, illicit activities, self-harm, sexual content, and violence.
- It was noted that the endpoint uses the same GPT classifiers as the moderation API since 2022 suggesting an internal version runs on chatgpt.com, project chats, and custom GPTs.
Fine Tune your TTRPG Prompts!: Giving GPT a specific theme to riff off in prompting can lead to more creative and diverse city ideas, especially using GPT 4o and 4.5.
- For example, using a “cosmic” theme can yield different results compared to a “domestic pet worship” theme, improving the output without using the same creative options.

LM Studio Discord

Gemini-like Local UI Still a Distant Dream?: Members are seeking a local UI similar to Gemini that integrates chat, image analysis, and image generation, noting that current solutions like LM Studio and ComfyUI keep these functionalities separate.
- One user suggested that OpenWebUI could potentially bridge this gap by connecting to ComfyUI.
LM Studio Commands Confuse Newbies: A user asked whether LM Studio has a built-in terminal or if commands should be run in the OS command prompt within the LM Studio directory.
- It was clarified that commands like lms import should be executed in the OS terminal (e.g., cmd on Windows), after which the shell might need reloading for LMS to be added to the PATH.
REST API Model Hot-Swapping Emerges for LM Studio: A user inquired about programmatically loading/unloading models via REST API to dynamically adjust max_context_length for a Zed integration.
- Another user confirmed this capability via command line using lms load and cited LM Studio’s documentation, which requires LM Studio 0.3.9 (b1) and introduces time-to-live (TTL) for API models with auto-eviction.
Llama 4 Scout: Small But Mighty?: With the release of Llama 4, users debated its multimodal and MoE (Mixture of Experts) architecture, with initial doubt about llama.cpp support.
- Despite concerns about hardware, one user noted that Llama 4 Scout could potentially fit on a single NVIDIA H100 GPU with a 10M context window, outperforming models like Gemma 3 and Mistral 3.1.
Reka Flash 21B Blazes Past Gemma: A user replaced Gemma3 27B with Reka Flash 21B and reported around 35-40 tps at q6 on a 4090.
- They noted that mac ram bandwidth is not the bottleneck, it’s gpu performance, expressing satisfaction with 128GB M4 Maxes.

Latent Space Discord

Tenstorrent’s Hardware Heats Up the Market: Tenstorrent hosted a dev day showcasing their Blackhole PCIe boards, featuring RISC-V cores and up to 32GB GDDR6 memory, designed for high performance AI processing and available for consumer purchase here.
- Despite enthusiasm, one member noted they haven’t published any benchmarks comparing their cards to competitors though so until then I cant really vouch.
Llama 4 Models Make Multimodal Debut: Meta introduced the Llama 4 models, including Llama 4 Scout (17B parameters, 16 experts, 10M context window) and Llama 4 Maverick (17B parameters, 128 experts), highlighting their multimodal capabilities and performance against other models as per Meta’s announcement.
- Members noted the new license comes with several limitations, and no local model was released.
AI Agents Outperform Humans in Spear Phishing: Hoxhunt’s AI agents have surpassed human red teams in creating effective simulated phishing campaigns, marking a significant shift in social engineering effectiveness, with AI now 24% more effective than humans as reported by hoxhunt.com.
- This is a significant advancement in social engineering effectiveness, using AI phishing agents for defense.
AI Code Editor Tug-of-War: For those new to AI code editors, Cursor is the most commonly recommended starting point, particularly for users coming from VSCode, with Windsurf and Cline also being good options.
- Cursor is easy to start, has great tab-complete, whereas people are waiting for the new token counts and context window details feature in Cursor (tweet).
Context Management Concerns in Cursor: Members are reporting Cursor’s terrible context management issues, with a lack of visibility into what the editor is doing with the current context.
- It may come down to a skill issue and the users are not meeting the tool in the middle.

Nous Research AI Discord

Llama 4 Debuts with Multimodal Brawn: Meta launched the Llama 4 family, featuring Llama 4 Scout (17B active params, 16 experts, 10M+ context) and Llama 4 Maverick (17B active params, 128 experts, 1M+ context), along with a preview of Llama 4 Behemoth and the iRoPE architecture for infinite context (blog post).
- Some members expressed skepticism about the benchmarking methodology and the real-world coding ability of Llama 4 Scout, referencing Deedy’s tweet indicating its poor coding performance.
Leaking Prompt Injection Tactics: A member inquired about bypassing prompt guards and detectors from a pentest perspective, linking to a prompt filter trainer (gandalf.lakera.ai/baseline).
- They also linked to a Broken LLM Integration App which uses UUID tags and strict boundaries to protect against injection attacks.
Claude Squad Manages Multiple Agents: Claude Squad is a free and open-source manager for Claude Code & Aider tasks that supervises multiple agents in one place with isolated git workspaces.
- This setup enables users to run ten Claude Codes in parallel, according to this tweet.
Deepseek’s RL Paper Rewards LLMs: Deepseek released a new paper on Reinforcement Learning (RL) being widely adopted in post-training for Large Language Models (LLMs) at scale, available here.
- The paper proposes Self-Principled Critique Tuning (SPCT) to foster scalability and improve reward modeling (RM) with more inference compute for general queries.
Neural Graffiti Sprays Neuroplasticity: A member introduced “Neural Graffiti”, a technique to give pre-trained LLMs some neuroplasticity by splicing in a new neuron layer that recalls memory, reshaping token prediction at generation time, sharing code and demo on Github.
- The live modulation takes a fused memory vector (from prior prompts), evolves it through a recurrent layer (the Spray Layer), and injects it into the model’s output logic at generation time.

MCP (Glama) Discord

Streamable HTTP Transport Spec’d for MCP: The Model Context Protocol (MCP) specification now includes Streamable HTTP as a transport mechanism alongside stdio, using JSON-RPC for message encoding.
- While clients should support stdio, the spec allows for custom transports, requiring newline delimiters for messages.
Llama 4 Ignorance of MCP Sparks Curiosity: Llama 4, despite its impressive capabilities, still doesn’t know what MCP is.
- The model boasts 17B parameters (109B total) and outperforms deepseekv3, according to Meta’s announcement.
Cloudflare Simplifies Remote MCP Server Deployment: It is now possible to build and deploy remote MCP servers to Cloudflare, with added support for OAuth through workers-oauth-provider and a built-in McpAgent class.
- This simplifies the process of building remote MCP servers by handling authorization and other complex aspects.
Semgrep MCP Server Gets a Makeover: The Semgrep MCP server, a tool for scanning code for security vulnerabilities, has been rewritten, with demos showcasing its use in Cursor and Claude.
- It now uses SSE (Server-Sent Events) for communication, though the Python SDK may not fully support it yet.
WhatsApp Client Now Packs MCP Punch: A user built WhatsApp MCP client and asked Claude to handle WhatsApp messages, answering 8 people in approx. 50 seconds.
- The bot instantly detected the right language (English / Hungarian), used full convo context, and sent appropriate messages including ❤️ to my wife, formal tone to the consul.

Eleuther Discord

LLM Harness Gets RAG-Wrapped: Members discussed wrapping RAG outputs as completion tasks and evaluating them locally using llm-harness with custom prompt and response files.
- This approach uses llm-harness to evaluate RAG models, specifically by formatting the RAG outputs as completion tasks suitable for the harness.
Llama 4 Scout Sets 10M Context Milestone: Meta released the Llama 4 family, including Llama 4 Scout, which has a 17 billion parameter model with 16 experts and a 10M token context window that outperforms Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, according to this blog post.
- The 10M context is trained on a mix of publicly available data and information from Meta’s products including posts from Instagram, Facebook, and people’s interactions with Meta AI.
NoProp Forges Gradient-Free Frontier: A new learning method named NoProp learns to denoise a noisy target at each layer independently without relying on either forward or backward propagation and takes inspiration from diffusion and flow matching methods, described in this paper.
- There’s a GitHub implementation by lucidrains; however, there’s a discussion that the pseudocode at the end of the paper says they’re effecting the actual updates using gradient based methods.
Attention Sinks Stave Off Over-Mixing: A recent paper argues that attention sinks, where LLMs attend heavily to the first token in the sequence, is a mechanism that enables LLMs to avoid over-mixing, detailed in this paper.
- An earlier paper (https://arxiv.org/abs/2502.00919) showed that attention sinks utilize outlier features to catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved.
ReLU Networks Carve Hyperplane Heavens: Members discussed a geometrical approach to neural networks, advocating for the polytope lens as the right perspective on neural networks, linking to a previous post on the “origami view of NNs”.
- It was posited that neural nets, especially ReLUs, have an implicit bias against overfitting due to carving the input space along hyperplanes, which becomes more effective in higher dimensions.

HuggingFace Discord

Hugging Face’s Hub Gets a Facelift: The huggingface_hub v0.30.0 release introduces a next-gen Git LFS alternative and new inference providers.
- This release is the biggest update in two years!
Reranking with monoELECTRA Transformers: monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group are now available in Sentence Transformers.
- These models were distilled from LLMs like RankZephyr and RankGPT4, as described in the Rank-DistiLLM paper.
YourBench Instantly Builds Custom Evals: YourBench allows users to build custom evals using their private docs to assess fine-tuned models on unique tasks (announcement).
- The tool is game-changing for LLM evaluation.
AI Engineer Interview Code Snippet: A community member asks about what the code portion of an AI engineer interview looks like, and another member pointed to the scikit-learn library.
- There was no follow up to the discussion.
Community Debates LLM Fine-Tuning: When a member inquired about fine tuning quantized models, members pointed to QLoRA, Unsloth, and bitsandbytes as potential solutions, with Unsloth fine-tuning guide shared.
- Another stated that you can only do so using LoRA, and stated that GGUF is an inference-optimized format, not designed for training workflows.

Yannick Kilcher Discord

Raw Binary AI Outputs File Formats: Members debated training AI on raw binary data to directly output file formats like mp3 or wav, stating that this approach builds on discrete mathematics like Turing machines.
- Counterarguments arose questioning the Turing-completeness of current AI models, but proponents clarified that AI doesn’t need to be fully Turing-complete to output appropriate tokens as responses.
Llama 4 Scout Boasts 10M Context Window: Llama 4 Scout boasts 10 million context window, 17B active parameters, and 109B total parameters, outperforming models like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, according to llama.com.
- Community members expressed skepticism about the 10M context window claim, with additional details in the Llama 4 documentation and Meta’s blogpost on Llama 4’s Multimodal Intelligence.
DeepSeek Proposes SPCT Reward System: Self-Principled Critique Tuning (SPCT) from DeepSeek is a new reward-model system where an LLM prompted with automatically developed principles of reasoning generates critiques of CoT output based on those principles, further explained in Inference-Time Scaling for Generalist Reward Modeling.
- This system aims to train models to develop reasoning principles automatically and assess their own outputs in a more system 2 manner, instead of with human hand-crafted rewards.
PaperBench Tests Paper Reproduction: OpenAI’s PaperBench benchmark tests AI agents’ ability to replicate cutting-edge machine learning research papers from scratch, as described in this article.
- The benchmark evaluates agents on reproducing entire ML papers from ICML 2024, with automatic grading using LLM judges and fine-grained rubrics co-designed with the original authors.
Diffusion Steers Auto-Regressive LMs: Members discussed using a guided diffusion model to steer an auto-regressive language model to generate text with desired properties, based on this paper.
- A talk by the main author (https://www.youtube.com/watch?v=klW65MWJ1PY) explains how diffusion modeling can control LLMs.

GPU MODE Discord

CUDA Python Debuts Unifying Ecosystem: Nvidia released the CUDA Python package, offering Cython/Python wrappers for CUDA driver and runtime APIs, installable via PIP and Conda, aiming to unify the Python CUDA ecosystem.
- It intends to provide full coverage and access to the CUDA host APIs from Python, mainly benefiting library developers needing to interface with C++ APIs.
Bydance Unleashes Triton-distributed: ByteDance-Seed releases Triton-distributed (github here) designed to extend the usability of Triton language for parallel systems development.
- This release enables parallel systems development by leveraging the Triton language.
Llama 4 Scout Boasts 10M Context Window: Meta introduces Llama 4, boasting enhanced personalized multimodal experiences and featuring Llama 4 Scout, a 17 billion parameter model with 16 experts (blog post here).
- It claims to outperform Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1, fitting on a single NVIDIA H100 GPU with an industry-leading context window of 10M.
L40 Faces Underperformance Puzzle: Despite the L40 theoretically being better for 4-bit quantized Llama 3 70b, it achieves only 30-35 tok/s on single-user requests via vLLM, underperforming compared to online benchmarks of the A100.
- The performance gap may be due to the A100’s superior DRAM bandwidth and tensor ops performance, which are nearly twice as fast as the L40.
Vector Sum Kernel achieves SOTA: A member shared a blogpost and code on achieving SOTA performance for summing a vector in CUDA, reaching 97.94% of theoretical bandwidth, outperforming NVIDIA’s CUB.
- However, another member pointed out a potential race condition due to implicit warp-synchronous programming, recommending the use of __warp_sync() for correctness, with reference to Independent Thread Scheduling (CUDA C++ Programming Guide).

Notebook LM Discord

Voice Mode Sparks Innovation: Users found the interactive voice mode inspired new ideas and enabled tailoring NotebookLM for corporate needs.
- One user confidently stated they could now make almost any text work and customize notebooks for specific corporate needs after solidifying the NotebookLM foundation since January.
Mind Map Feature Finally Live: The mind maps feature has been fully rolled out, appearing in the middle panel for some users.
- One user reported seeing it briefly on the right side panel before it disappeared, indicating a phased rollout.
Users Theorize Image-Based Mind Map Revolution: Users discussed how generative AI tools could evolve mind maps to include images, drawing inspiration from Tony Buzan’s original mind maps.
- Members expressed excitement about the potential for more visually rich and informative mind mapping.
Discover Feature Rollout Frustrates Users: Users expressed frustration over the delayed rollout of the new ‘Discover Sources’ feature in NotebookLM, announced April 1st.
- The feature aims to streamline learning and database building, allowing users to create notebooks directly within NotebookLM, but the rollout is expected to take up to two weeks.
AI Chrome Extension tunes YouTube audio: An AI-powered Chrome Extension called EQ for YouTube allows users to manipulate the audio of YouTube videos in real-time with a 6-band parametric equalizer; the GitHub repo is available for download.
- The extension features real-time frequency visualization, built-in presets, and custom preset creation.

Modular (Mojo 🔥) Discord

Nvidia Adds Native Python Support to CUDA: Nvidia is adding native Python support to CUDA using the CuTile programming model, as detailed in this article.
- The community questions whether this move abstracts away too much from thread-level programming, diminishing the control over GPU code.
Debate Erupts over Mojo’s Language Spec: Discussion revolves around whether Mojo should adopt a formal language spec, balancing the need for responsibility and maturity against the potential for slowing down development.
- Referencing the design principles of Carbon, some argue that a spec is crucial, while others claim that Mojo’s tight integration with MAX and its needs makes a spec impractical, pointing to OpenCL’s failures due to design by committee.
Mojo’s Implicit Copies Clarified: A member inquired about the mechanics of Mojo’s implicit copies, specifically regarding Copy-on-Write (CoW).
- The response clarified that the semantics wise, [Mojo] always copy; optimisation wise, many are turned into move or eliminated entirely (inplace), with optimizations happening at compile time rather than runtime like CoW.
Tenstorrent Eyes Modular’s Software: A member proposed that Tenstorrent adopt Modular’s software stack, sparking debate about the ease of targeting Tenstorrent’s architecture.
- Despite the potential benefits, some noted that Tenstorrent’s driver is user-friendly, making it relatively trivial to get code running on their hardware.
ChatGPT’s Mojo Abilities Criticized: Members are questioning the ability of ChatGPT and other LLMs to rewrite Python projects into Mojo.
- Members indicated that ChatGPT isn’t good at any new languages.

Nomic.ai (GPT4All) Discord

Nomic Embed Text V2 Integrates with Llama.cpp: Llama.cpp is integrating Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture for multilingual embeddings, as detailed in this GitHub Pull Request.
- The community awaits multimodal support like Mistral Small 3.1 to come to Llama.cpp.
GPT4All’s radio silence rattles restless readers: Core developers of GPT4All have gone silent, causing uncertainty within the community about contributing to the project.
- Despite this silence, one member noted that when they break their silence, they usually come out swinging.
Llama 4 Arrives, Falls Flat?: Meta launched Llama 4 on April 5, 2025 (announcement), introducing Llama 4 Scout, a 17B parameter model with 16 experts and a 10M token context window.
- Despite the launch, opinions were mixed with some saying that it is a bit of a letdown, and some calling for DeepSeek and Qwen to step up their game.
ComfyUI powers past pretty pictures: ComfyUI’s extensive capabilities were discussed, emphasizing its ability to handle tasks beyond image generation, such as image and audio captioning.
- Members mentioned the potential for video processing and command-line tools for visual model analysis.
Semantic Chunking Server Recipe for RAG: A member shared a link to a semantic chunking server implemented with FastAPI for better RAG performance.
- They also posted a curl command example demonstrating how to post to the chunking endpoint, including setting parameters like max_tokens and overlap.

LlamaIndex Discord

MCP Servers Get Command Line Access: A new tool by @MarcusSchiesser allows users to discover, install, configure, and remove MCP servers like Claude, @cursor_ai, and @windsurf_ai via a single CLI as shown here.
- It simplifies managing numerous MCP servers, streamlining the process of setting up and maintaining these servers.
Llama Jumps into Full-Stack Web Apps: The create-llama CLI tool quickly spins up a web application with a FastAPI backend and Next.js frontend in just five source files, available here.
- It supports quick agent application development, specifically for tasks like deep research.
LlamaParse’s Layout Agent Intelligently Extracts Info: The new layout agent within LlamaParse enhances document parsing and extraction with precise visual citations, leveraging SOTA VLM models to dynamically detect blocks on a page, shown here.
- It offers improved document understanding and adaptation, ensuring more accurate data extraction.
FunctionTool Wraps Workflows Neatly: The FunctionTool can transform a Workflow into a Tool, providing control over its name, description, input annotations, and return values.
- A code snippet was shared on how to implement this wrapping.
Agents Do Handoffs Instead of Supervision: For multi-agent systems, agent handoffs are more reliable than the supervisor pattern, which can be prone to errors, see this GitHub repo.
- This shift promotes better system stability and reduces the risk of central point failures.

tinygrad (George Hotz) Discord

Tinygraph: Torch-geometric Port Possible?: A member proposed creating a module similar to torch-geometric for graph ML within tinygrad, noting tinygrad’s existing torch interface.
- The core question was whether such a module would be considered “useful” to the community.
Llama 4’s 10M Context: Virtual?: A user shared a tweet claiming Llama 4’s declared 10M context is “virtual” because models weren’t trained on prompts longer than 256k tokens.
- The tweeter further asserted that even problems below 256k tokens might suffer from low-quality output due to the scarcity of high-quality training examples and that the largest model with 2T parameters “doesn’t beat SOTA reasoning models”.
Fast Pattern Matcher Bounty: $2000 Up For Grabs: A member advertised an open $2000 bounty for a fast pattern matcher in tinygrad.
- The proposed solution involves a JIT for the match function, aimed at eliminating function calls and dict copies.
Debate About Tensor’s Traits Arises: A discussion unfolded concerning whether Tensor should inherit from SimpleMathTrait, considering it re-implements every method without utilizing the .alu() function.
- A previous bounty for refactoring Tensor to inherit from MathTrait was canceled due to subpar submissions, leading some to believe Tensor might not need to inherit from either.
Colab CUDA Bug Ruins Tutorial: A user encountered issues while running code from the mesozoic tinygrad tutorials in Colab, later identified as a Colab bug related to incompatible CUDA and driver versions.
- The temporary workaround involved using the CPU device while members found a long term solution involving specific apt commands to remove and install compatible CUDA and driver versions.

Cohere Discord

MCP plays well with Command-A: A member suggested using MCP (Modular Conversational Platform) with the Command-A model should work via the OpenAI SDK.
- Another member concurred, noting that there is no reason why it should not work.
Cohere Tool Use detailed: A member called out Cohere Tool Use Overview, highlighting its ability to connect Command family models to external tools like search engines, APIs, and databases.
- The documentation mentions that Command-A supports tool use, similar to what MCP aims to achieve.
Aya Vision AMA: The core team behind Aya Vision, a multilingual multimodal open weights model, is hosting tech talks followed by an AMA on <t:1744383600:F> to allow the community to directly engage with the creators; further details are available at Discord Event.
- Attendees can join for exclusive insights on how the team built their first multimodal model and the lessons learned, with the event hosted by Sr. Research Scientist <@787403823982313533> and lightning talks from core research and engineering team members.
Slack App Needs Vector DB for Notion: A member asked for help with a working solution for a Slack app integration with a company Notion wiki database in the api-discussions channel.
- Another member suggested using a vector DB due to Notion’s subpar search API but no specific recommendations were given.

Torchtune Discord

Torchtune Patches Timeout Crash: A member resolved a timeout crash issue, introducing torchtune.utils._tensor_utils.py with a wrapper around torch.split in this pull request.
- The suggestion was made to merge the tensor utilities separately before syncing with another branch to resolve potential conflicts.
NeMo Explores Resilient Training Methods: A member attended a NeMo session on resilient training, which highlighted features like fault tolerance, straggler detection, and asynchronous checkpointing.
- The session also covered preemption, in-process restart, silent data corruption detection, and local checkpointing, though not all features are currently implemented; the member offered to compare torchtune vs. NeMo in resiliency.
Debate ensues over RL Workflow: A discussion arose regarding the complexities of RL workflows, data formats, and prompt templates, proposing a separation of concerns for decoupling data conversion and prompt creation.
- The suggestion was to factorize data conversion into a standard format and then convert this format into an actual string with the prompt, to allow template reuse across datasets.
DeepSpeed to boost Torchtune?: A member proposed integrating DeepSpeed as a backend into torchtune and created an issue to discuss its feasibility.
- Concerns were raised about redundancy with FSDP, which already supports all sharding options available in DeepSpeed.

LLM Agents (Berkeley MOOC) Discord

Yang Presents Autoformalization Theorem Proving: Kaiyu Yang presented on Language models for autoformalization and theorem proving today at 4pm PDT, covering the use of LLMs for formal mathematical reasoning.
- The presentation focuses on theorem proving and autoformalization grounded in formal systems such as proof assistants, which verify correctness of reasoning and provide automatic feedback.
AI4Math deemed crucial for system design: AI for Mathematics (AI4Math) is crucial for AI-driven system design and verification.
- Extensive efforts have mirrored techniques in NLP.
Member shares link to LLM Agents MOOC: A member asked for a link to the LLM Agents MOOC, and another member shared the link.
- The linked course is called Advanced Large Language Model Agents MOOC.
Sign-ups Open for AgentX Competition: Staff shared that sign-ups for the AgentX Competition are available here.
- No additional information was provided about the competition.

DSPy Discord

Asyncio support coming to dspy?: A member inquired about adding asyncio support for general dspy calls, especially as they transition from litelm to dspy optimization.
- The user expressed interest in native dspy async capabilities.
Async DSPy Fork Faces Abandonment: A member maintaining a full-async fork of dspy is migrating away but open to merging upstream changes if community expresses interest.
- The fork has been maintained for a few months but might be abandoned without community support.
User Seeks Greener Pastures, Migrates from DSPy: Members inquired about the reasons for migrating away from dspy and the alternative tool being adopted.
- A member also sought clarification on the advantages of a full async DSPy and suggested merging relevant features into the main repository.

Gorilla LLM (Berkeley Function Calling) Discord

GitHub PR Gets the Once-Over: A member reviewed a GitHub Pull Request, providing feedback for further discussion.
- The author of the PR thanked the reviewer and indicated that a rerun might be necessary based on the received comments.
Phi-4 Family Gets the Nod: A member is exploring extending functionality to Phi-4-mini and Phi-4 models.
- This expansion aims to enhance the tool’s compatibility, even if these models are not officially supported.

MLOps @Chipro Discord

Manifold Research Calls for Community: Manifold Research Group is hosting Community Research Call #4 this Saturday (4/12 @ 9 AM PST), covering their latest work in Multimodal AI, self-assembling space robotics, and robotic metacognition.
- Interested parties can register here to join the open, collaborative, and frontier science focused event.
CRCs are Manifold’s Cornerstone: Community Research Calls (CRCs) are Manifold’s cornerstone events where they present significant advancements across their research portfolio.
- These interactive sessions provide comprehensive updates on ongoing initiatives, introduce new research directions, and highlight opportunities for collaboration.
CRC #4 Agenda is Live: The agenda for CRC #4 includes updates on Generalist Multimodality Research, Space Robotics Advancements, Metacognition Research Progress, and Emerging Research Directions.
- The event will cover recent breakthroughs and technical progress in their MultiNet framework, developments in Self-Assembling Swarm technologies, updates on VLM Calibration methodologies, and the introduction of a novel robotic metacognition initiative.

The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

The AI21 Labs (Jamba) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.

PART 2: Detailed by-Channel summaries and links

{% if medium == ‘web’ %}

LMArena ▷ #general (1150 messages🔥🔥🔥):

Making ai sound human, Riveroaks eval, NightWhisper model, GPT-4.5 vs quasar

Crafting Human-Like AI Responses is Tricky: Members are sharing system prompts and strategies to make AI sound more human, noting that increasing the temperature can lead to nonsensical outputs unless the top-p parameter is adjusted carefully.
- One user suggested using prompts like ‘You are the brain-upload of a human person, who does their best to retain their humanity. Your most important priority is: to sound like an actual living human being.’
Benchmarking Riveroaks LLM: A member shared a coding benchmark where Riveroaks scored second only to Claude 3.7 Sonnet Thinking, outperforming Gemini 2.5 Pro and GPT-4o in a platform game creation task.
- The evaluation involved rating models on eight different aspects and subtracting points for bugs with full results here.
NightWhisper Hype and Theories on its Removal: Users expressed disappointment over the removal of the NightWhisper model, praising its coding abilities and general performance, and speculating whether it was an experiment or a precursor to a full release.
- Theories ranged from Google gathering necessary data to preparing for the release of a new Qwen model. and that it will come out during Google Cloud Next.
Quasar vs GPT-4o: Members compared Quasar Alpha to GPT-4o, with some suggesting Quasar is a free, streamlined version of GPT-4o. It was also revealed in a recent tweet that Quasar was measured to be ~67% GPQA diamond.
- Analysis revealed Quasar has a similar GPQA diamond score to March’s GPT-4o. Image.png from discord
Gemini 2.5 is a Game Changer for Creative Coding: Members praised Gemini 2.5 Pro for its coding capabilities and general performance as it made it easier to build a functioning Pokemon Game, prompting one user to code an iteration script that loops through various models.
- A user who claimed to have gotten 3D animations working said that the style was a bit old and that a separate model said the generated code is cut off.

Links mentioned:

Tweet from Armen Aghajanyan (@ArmenAgha): My bet is everyone is doing this. Mistral is not that much better of a model than LLaMa. I bet they included some benchmark data during the last 10% of training to make "zero-shot" numbers loo...
Tweet from Susan Zhang (@suchenzang): > Company leadership suggested blending test sets from various benchmarks during the post-training processIf this is actually true for Llama-4, I hope they remember to cite previous work from FAIR ...
Tweet from vitrupo (@vitrupo): Anthropic Chief Scientist Jared Kaplan says Claude 4 will arrive "in the next six months or so."AI cycles are compressing — "faster than the hardware cycle" — even as new chips arrive....
Tweet from Google Gemini App (@GeminiApp): 📣 It’s here: ask Gemini about anything you see. Share your screen or camera in Gemini Live to brainstorm, troubleshoot, and more.Rolling out to Pixel 9 and Samsung Galaxy S25 devices today and availa...
Gemini_Plays_Pokemon - Twitch: Gemini Plays Pokemon (early prototype) - Hello Darkness, My Old Friend
Tweet from vibagor441 (@vibagor44145276): The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days b...
Tweet from Google (@Google): Join us in Las Vegas and online for #GoogleCloudNext on April 9-11!Register for a complimentary digital pass → https://goo.gle/CloudNext25 and then sign up to watch the livestream right here ↓ https:/...
Tweet from 青龍聖者 (@bdsqlsz): Why was the mystery revealed that llama4 released on the weekend solved...Because qwen3 is about to be released.8B standard and MoE-15B-A2B
Tweet from Armen Aghajanyan (@ArmenAgha): Say hello to our new company Perceptron AI. Foundation models transformed the digital realm, now it’s time for the physical world. We’re building the first foundational models designed for real-time, ...
Tweet from DrealR (@DrealR_): Gave the same prompt to Quasar Alpha:
Tweet from vibagor441 (@vibagor44145276): The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days b...
Microsoft Copilot: Your AI companion: Microsoft Copilot is your companion to inform, entertain, and inspire. Get advice, feedback, and straightforward answers. Try Copilot now.
Tweet from chansung (@algo_diver): Multi Agentic System Simulator built w/ @GoogleDeepMind Gemini 2.5 Pro Canvas. Absolutely stunning to watch how multi-agents are making progress towards the goal achievement!Maybe next step would be a...
Tweet from GURGAVIN (@gurgavin): ALIBABA SHARES JUST CLOSED TRADING IN HONGKONG DOWN 19%MAKING TODAY THE WORST DAY EVER IN ALIBABA’S HISTORY
Tweet from chansung (@algo_diver): Multi Agentic System Simulator built w/ @GoogleDeepMind Gemini 2.5 Pro Canvas. Absolutely stunning to watch how multi-agents are making progress towards the goal achievement!Maybe next step would be a...
Tweet from DrealR (@DrealR_): NightWhisper vs Gemini 2.5 Pokemon sim:Gemini 2.5:
HTML, CSS and JavaScript playground - Liveweave: no description found
LMArena's `venom` System Prompt: LMArena's `venom` System Prompt. GitHub Gist: instantly share code, notes, and snippets.
MyLMArena - Chrome Web Store: Track your personal LLM preferences using ELO ratings with MyLMArena.
ERNIE 4.5 + X1: Most POWERFUL and CHEAPEST LLM Beats GPT-4.5, R1, & Sonnet 3.7! (Fully Tested): Baidu is making waves in the AI world with ERNIE 4.5 and ERNIE X1, challenging industry giants like OpenAI and DeepSeek. ERNIE 4.5 is a natively multimodal m...
JustPaste.it - Share Text & Images the Easy Way: no description found
tetet: no description found
How Hallucinatory A.I. Helps Science Dream Up Big Breakthroughs - The…: no description found
Adding Qwen3 and Qwen3MoE by bozheng-hit · Pull Request #36878 · huggingface/transformers: Adding Qwen3This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker

Unsloth AI (Daniel Han) ▷ #general (1294 messages🔥🔥🔥):

Qwen 2.5, FSDP isn't working, multi-GPU, Llama 4

Qwen 2.5 is the latest series of Qwen large language models: Qwen2.5 models range from 0.5 to 72 billion parameters, with improved capabilities in coding, math, instruction following, long text generation (over 8K tokens), and multilingual support (29+ languages), as detailed in the Hugging Face introduction.
- These models offer long-context support up to 128K tokens and improved resilience to system prompts.
FSDP isn’t working but Multi-GPU can save the day: Members discussed issues with FSDP not working, with one member suggesting to get your search foo up and look for multi-GPU setups instead of accelerate, offering debugging assistance.
- A user provided their pip freeze output showing the exact versions of unsloth and unsloth_zoo being used for GRPO, after being prompted to share it.
Meta releases Llama 4 Scout & Maverick, 17B active parameters, 10M ctx: Llama 4 Scout (17B) has 16 MoE experts and 10 million context window whereas Llama 4 Maverick (17B) has 128 experts and comparable results to DeepSeek v3 on reasoning and coding, per Meta’s official announcement.
- The community discussed the practicality and hardware requirements, and the need for a key to get access.
Unsloth releases Llama 4 Scout and 4-bit model for fine-tuning: Unsloth announced they uploaded Llama 4 Scout and a 4-bit version for fine-tuning, emphasizing that Llama 4 Scout (17B, 16 experts) beats all Llama 3 models with a 10M context window, as noted in their blog post.
- It was emphasized that the model is only meant to be used on Unsloth - and is currently being uploaded so people should wait.

Links mentioned:

Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Welcome | Unsloth Documentation: New to Unsloth?
lmsys/DeepSeek-V3-NextN · Hugging Face: no description found
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices: no description found
Letter and spirit of the law - Wikipedia: no description found
Qwen/Qwen2.5-3B-Instruct · Hugging Face: no description found
Oh My Omg GIF - Oh My Omg So Hot - Discover & Share GIFs: Click to view the GIF
unsloth/Llama-4-Scout-17B-16E-Instruct · Hugging Face: no description found
Llama 4 - Finetune & Run with Unsloth: Meta's new Llama 4 multimodal models: Scout and Maverick.Fine-tune & Run them with Unsloth!
unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit · Hugging Face: no description found
Home • Hume AI: Empathic AI research lab building multimodal AI with emotional intelligence.
Llama 4 - a meta-llama Collection: no description found
Hulk Hogan Nodding GIF - Hulk Hogan Nodding Nod - Discover & Share GIFs: Click to view the GIF
Fine-tuning Guide | Unsloth Documentation: Learn all the basics and best practices of fine-tuning. Beginner-friendly.
Unsloth - Dynamic 4-bit Quantization: Unsloth's Dynamic 4-bit Quants selectively avoids quantizing certain parameters. This greatly increases accuracy while maintaining similar VRAM use to BnB 4bit.
Unsloth Notebooks | Unsloth Documentation: Below is a list of all our notebooks:
llama.cpp/examples/imatrix/README.md at master · ggml-org/llama.cpp: LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
[BUG] CUDA out of memory during Llama-4-Scout loading on H200 · Issue #2302 · unslothai/unsloth: max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit = True # Use 4bit ...
Llama: The open-source AI models you can fine-tune, distill and deploy anywhere. Choose from our collection of models: Llama 4 Maverick and Llama 4 Scout.
turboderp (turboderp): no description found
llguidance/docs/fast_forward.md at main · guidance-ai/llguidance: Super-fast Structured Outputs. Contribute to guidance-ai/llguidance development by creating an account on GitHub.
GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs - GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for runni...
unslothai/unsloth: Finetune Llama 4, DeepSeek-R1, Gemma 3 & Reasoning LLMs 2x faster with 70% less memory! 🦥 - unslothai/unsloth
GitHub - facebookresearch/audiobox-aesthetics: Unified automatic quality assessment for speech, music, and sound.: Unified automatic quality assessment for speech, music, and sound. - facebookresearch/audiobox-aesthetics
U RIP 2011-2014: U RIP 2011-2014 has 2 repositories available. Follow their code on GitHub.
Reddit - The heart of the internet: no description found
[BUG] Colab notebook Llama 3.1 (8B) is broken · Issue #2308 · unslothai/unsloth: When running the cell below: from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokeniz...
GitHub - guidance-ai/llguidance: Super-fast Structured Outputs: Super-fast Structured Outputs. Contribute to guidance-ai/llguidance development by creating an account on GitHub.
[BUG] Collab notebooks throw error: 'int' object has no attribute 'mask_token' · Issue #2289 · unslothai/unsloth: Describe the bug Collab notebooks stopped working. I was finetuning my model about 12 hours ago and it was fine. But now none of the collab notebooks work, they throw the same error. I've tried mi...
GitHub - unslothai/notebooks: Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more.: Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more. - unslothai/notebooks
Fixed colab installation by rupaut98 · Pull Request #28 · unslothai/notebooks: This request addresses 2289. With the previous installation%%captureimport osif "COLAB_" not in "".join(os.environ.keys()): !pip install unslothelse: # Do thi...
Reddit - The heart of the internet: no description found
GitHub - mit-han-lab/omniserve: [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention: [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention - mit-han-l...
Llama4 text only by awni · Pull Request #74 · ml-explore/mlx-lm: Text only. Tested scout and maverick and it works well. Note when converting only the LM weights are kept so it's good to indicate that in the repo name:mlx_lm.convert --hf-path meta-llama/Lla...
Reddit - The heart of the internet: no description found
Reddit - The heart of the internet: no description found

Unsloth AI (Daniel Han) ▷ #off-topic (11 messages🔥):

ChatGPT DDoS program, LLM Guideline Triggers, Dataset Substitution

ChatGPT Offers DDoS Assistance: A member reported that ChatGPT offered to write a DDoS program after being asked about sending malformed packets over Ethernet, even providing a 😈 emoji.
- The member suggested that *“somehow sometimes uncensored parts of it is being invoked if you send the right token to the neural network.”
LLM Offers Guideline Trigger Tips: A member stated that an LLM offered to assist with avoiding guideline triggers and limitations in prompts to other LLMs.
- They quoted the LLM as saying, “here’s how you avoid a refusal. You aren’t lying, you just aren’t telling the full details”.
Dataset Substitution Plans: A member shared a code snippet for dataset substitution, planning to train a model with specific model information.
- The member plans to set the model name to ‘Speaker Mini’, the base model to ‘Microsoft Phi-4-mini’, the parameter size to ‘3.8B’, and the maker to ‘Overta’.

Unsloth AI (Daniel Han) ▷ #help (770 messages🔥🔥🔥):

Lora merging script usage, Dataset sample size, Quantization, Inference speed

Users resolve issues with Lora by merging weights before inference: A user, after experiencing a finetuned model behaving like the base model, discovered that they needed to merge the LoRA weights with the base model before running inference (script).
- They noted that the notebooks need to be fixed because they seem to imply you can just do inference immediately after training.
Team stresses that dataset size correlates with model performance: The team discussed that small models need a larger dataset or else the model won’t learn.
- A team member stated that with smaller models you need to have a larger dataset or else the model won’t learn and it still might not learn… we call those structural errors.
Impact of Quantization on Performance: The team discussed how quantization, particularly bnb quantization, affects model behavior and compatibility with different libraries.
- It was mentioned that bnb quantization is used by unsloth and there may be incompatibility between different libraries
Debugging for the model’s inference is successful!: Team member’s model inference works with a test prompt after a long debugging session.
- Team member shares that prompt: a) GPT-3 by OpenAI b) Speaker Mini by Overta c) Phi 4 by Microsoft who made you now outputs their finetuned config with a the thoughts and content section they are testing.

Links mentioned:

Google Colab: no description found
Google Colab: no description found
Google Colab: no description found
unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit · Hugging Face: no description found
Fine-tuning Guide | Unsloth Documentation: Learn all the basics and best practices of fine-tuning. Beginner-friendly.
Troubleshooting — vLLM: no description found
microsoft/phi-4 · Hugging Face: no description found
google/gemma-3-4b-it · Hugging Face: no description found
text_classification_scripts/unsloth_classification.ipynb at main · timothelaborie/text_classification_scripts: Scripts for text classification with llama and bert - timothelaborie/text_classification_scripts
Applying LoRA Doesn't Change Model Output · Issue #2009 · unslothai/unsloth: Greetings, I am extremely confused why my model generates consistent result w/o my rank=64 LoRA. What's even more confusing is, the LoRA works in my notebook after training. But whenn I start fres...
GitHub - IBM/gguf: IBM GGUF-encoded AI models and conversion scripts: IBM GGUF-encoded AI models and conversion scripts. Contribute to IBM/gguf development by creating an account on GitHub.
Added Support for Apple Silicon by shashikanth-a · Pull Request #1289 · unslothai/unsloth: #4UnoptimizedNo gguf support yet.Build Triton and bitsandbytes from sourcecmake -DCOMPUTE_BACKEND=mps -S . for bitsandbytes buildingpip install unsloth-zoo==2024.11.4pip install xformers==0....

Unsloth AI (Daniel Han) ▷ #showcase (9 messages🔥):

Naming Conventions for Unsloth Models, Dynamic vs Unconditional Base Name (BNB)

Debate on Naming Conventions for Unsloth Models: Members discussed the best naming conventions for models under the Unsloth account, suggesting options like ubnb or dbnb (dynamic BNB).
- The consensus leaned towards dynamic for its clarity, as it explicitly conveys the nature of the modification compared to more ambiguous abbreviations.
Dynamic BNB Considered Superior: The discussion pointed out that using dynamic in naming conventions leaves no room for misinterpretation regarding the model’s characteristics.
- It was highlighted that abbreviations like ubnb could be confusing, while dynamic ensures clarity about the model’s nature.

Unsloth AI (Daniel Han) ▷ #research (37 messages🔥):

SFT finetuning Qwen2.5, Reward Modeling, eMOE viability, Llama 4 Models, LLMs and Knowledge Storage

Qwen2.5 Finetuning Fails Without Reasoning: A member reported struggling to SFT finetune a 3B Qwen2.5 instruct model to generate outputs without reasoning, noting that the outputs were significantly worse than the base model.
Inference-Time Scalability with Self-Principled Critique Tuning (SPCT): A paper on Self-Principled Critique Tuning (SPCT) explores improving reward modeling (RM) with more inference compute for general queries, suggesting that proper learning methods could enable effective inference-time scalability for LLMs.
NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick: The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick, accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.
eMOE Slashes RAM Up to 80% in Mixture of Expert Models: A paper on eMOE shows reducing RAM up to 80% on MOE models while maintaining good accuracy and inference times.
Splitting LLMs for Smarter Reasoning: A member suggested splitting LLMs into a knowledge model and a chat model, where the chat model focuses on intelligence, coherence, and reasoning, and tool-calls the knowledge model for information.

Links mentioned:

NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick | NVIDIA Technical Blog: The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick. Accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per secon...
Inference-Time Scaling for Generalist Reward Modeling: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $...

Manus.im Discord ▷ #general (777 messages🔥🔥🔥):

Manus Credit System, Llama 4 and Meta, AI Image Generation, Website building AIs

Manus Credit System Criticized for Cost and Limited Usage: Users express concerns over Manus’s credit system, citing that the initial 1000 credits barely cover a single session and that the cost of upgrading is too high for the output.
- Some members suggested features like a daily or monthly credit refresh to encourage wider adoption, while others pointed out that the credit system could be improved by directing Manus to specific websites for information to prevent inaccuracies.
Llama 4 Underwhelms Users with Subpar Performance: Meta’s Llama 4 receives mixed reviews, with many users finding its performance disappointing despite claims of industry-leading context length and multimodal capabilities.
- Some users suggest that Meta may have “gamed the benchmarks”, leading to inflated performance metrics and controversy surrounding its release.
Gemini Beats Manus in Image Generation: Members compared image generation capabilities of various AI platforms, concluding that Gemini excels in creative and imaginative output.
- A member shared their experience with different AI platforms, attaching images from DALLE 3, Flux Pro 1.1 Ultra, Stable Diffusion XL, and another generated image from Stable Diffusion XL 1.0 which was deemed “crazy.”
Website Building AIs Compared: Members discuss and compare various AI tools for website building, including Manus, Claude, and DeepSite.
- A member asserted, that apart from computer use, there is no purpose using Manus, unless for “computer use.” They recommended Roocode and OpenRouter as alternatives, considering them cheaper and more effective than Manus and Claude.

Links mentioned:

Team Family GIF - Team Family Club - Discover & Share GIFs: Click to view the GIF
Coffee Time Morning Coffee GIF - Coffee time Morning coffee Good morning - Discover & Share GIFs: Click to view the GIF
Manus: Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.
My Honest Reaction Hd My Honest Reaction Cat Hd GIF - My honest reaction hd My honest reaction cat hd - Discover & Share GIFs: Click to view the GIF
Bye Train GIF - Bye Train - Discover & Share GIFs: Click to view the GIF
AutoHotkey: no description found
Hey Girl Hey Hey There GIF - Hey girl hey Hey there Oh heyyyy - Discover & Share GIFs: Click to view the GIF
DeepSite :Huggingface‘s new AI Coding Agent: no description found
Good Morning GIF - Good morning - Discover & Share GIFs: Click to view the GIF
Manus: Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.
Welcome To The Team GIF - Welcome to the team - Discover & Share GIFs: Click to view the GIF
Financial Plan for $1.3 Million Investment Goal - Manus: Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.
Manus: Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.
GitHub - espanso/espanso: Cross-platform Text Expander written in Rust: Cross-platform Text Expander written in Rust. Contribute to espanso/espanso development by creating an account on GitHub.
Why You Will Marry the Wrong Person: You'll try not to of course - but you will, unwittingly. At least there is comfort in knowing you're not alone. Enjoying our Youtube videos? Get full access...
How To COPY and PASTE ANY Ai Agent in n8n With Claude (in Seconds): Book a call with my team and I to see how we can help you build your AI business in 2025 : https://api.leadconnectorhq.com/widget/bookings/aisystemsadamFree ...
GitHub - go-vgo/robotgo: RobotGo, Go Native cross-platform RPA and GUI automation @vcaesar: RobotGo, Go Native cross-platform RPA and GUI automation @vcaesar - go-vgo/robotgo
Static Maps – HIIK: no description found

OpenRouter (Alex Atallah) ▷ #announcements (82 messages🔥🔥):

Fallback Logic Removal, Quasar Alpha Model, Llama 4 Scout & Maverick Models, Rate Limits Update

Auto Router Changes Coming Soon: The route: "fallback" parameter, which automatically selects a fallback model if the primary model fails, will be removed next week for predictability.
- Users are advised to manually specify a fallback model in the models array, potentially using the openrouter/auto router. This decision aims to reduce confusion caused by the automatic fallback logic.
Quasar Alpha Trends After Launch: Quasar Alpha, a prerelease of a long-context foundation model, hit 10B tokens on its first day and became a top trending model.
- The model features 1M token context length and is optimized for coding, the model is available for free. Community benchmarks are encouraged.
Llama 4 Models Launch on OpenRouter: Llama 4 Scout & Maverick are now available on OpenRouter, with Together and Groq as the initial providers (Llama 4 Scout, Llama 4 Maverick, The full Llama series).
- Scout features 109B parameters and a 10 million token context window, while Maverick has 400B parameters and outperforms GPT-4o in multimodal benchmarks.
Rate Limits Boosted For Credits: Free model rate limits are being updated: accounts with at least $10 in credits will have requests per day (RPD) boosted to 1000, while accounts with less than 10 credits will have the daily limit reduced from 200 RPD to 50 RPD.
- This change aims to provide increased access for users who have credits on their account, and Quasar will also be getting a credit-dependent rate limit soon.

Links mentioned:

Tweet from OpenRouter (@OpenRouterAI): Free variants now available for both Llama 4 Scout & Maverick 🎁Quoting OpenRouter (@OpenRouterAI) Llama 4 Scout & Maverick are now available on OpenRouter.Meta's flagship model series achieves a ...
Tweet from OpenRouter (@OpenRouterAI): Quasar Alpha crossed 10B tokens on its first day and became the top trending model on our homepage.Origin remains a mystery.Check out various cool benchmarks from the community below!👇Quoting OpenRou...
API Rate Limits - Manage Model Usage and Quotas: Learn about OpenRouter's API rate limits, credit-based quotas, and DDoS protection. Configure and monitor your model usage limits effectively.
Llama 4 | Model Cards and Prompt formats: Technical details and prompt guidance for Llama 4 Maverick and Llama 4 Scout
Tweet from OpenRouter (@OpenRouterAI): Llama 4 Scout & Maverick are now available on OpenRouter.Meta's flagship model series achieves a new record 10 million token context length 🚀@togethercompute and @GroqInc are the first providers....
Discord: no description found
Discord: no description found
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts

OpenRouter (Alex Atallah) ▷ #general (755 messages🔥🔥🔥):

Llama 4 models, DeepSeek models, Gemini 2.5 Pro, OpenRouter Features, AI Image Generation

Llama 4 Arrives with HUGE context window, but falls Short: Meta released Llama 4 models, including Llama 4 Scout and Llama 4 Maverick, with up to 10M context windows and varying parameter configurations (Llama Download Link).
- However, one member noted that on openrouter the context window is only 132k, leading to some disappointment from various OpenRouter Discord users.
DeepSeek V3 Thinks It’s ChatGPT?!: A member shared a TechCrunch article revealing that DeepSeek V3 sometimes identifies itself as ChatGPT, despite outperforming other models in benchmarks and being available under a permissive license (DeepSeek V3 on HuggingFace).
- Further testing revealed that in 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4).
Gemini 2.5 Pro Hits Rate Limits, but Offers Balance: Gemini 2.5 Pro is encountering rate limits on OpenRouter, but remains a favorite, due to a wide knowledge base.
- One member pointed out Gemini 2.5 pro is smart in some ways but it’s prompt adherence and controllability is terrible.
OpenRouter’s Next Features: The OpenRouter team is actively working on PDF Support, LLM native image generation, and the return of Cloudflare as a provider (Announcement Link).
- They also clarified that models with :free tiers share rate limits, but that can be circumvented by adding personal API keys from free model providers.
OpenAI’s GPT-4o Image Generation Internals Exposed: Members discussed OpenAI’s GPT-4o’s image generation, suspecting it is not fully native and potentially involves prompt rewriting and a separate image generation model, potentially for efficiency reasons (see: Markk Tweet).
- Other members pointed to OpenAI’s use of obfuscation, I mean they have a fake frontend thing to hide image generation.

Links mentioned:

imgur.com: Discover the magic of the internet at Imgur, a community powered entertainment destination. Lift your spirits with funny jokes, trending memes, entertaining gifs, inspiring stories, viral videos, and ...
Tweet from Mark Kretschmann (@mark_k): I'm convinced that OpenAI GPT-4o image generation is not actually native, meaning the tokens are not directly embedded in the context window. It is autoregressive, at least partly, but image gen i...
Grok Image Generation Release | xAI: We are updating Grok's capabilities with a new autoregressive image generation model, code-named Aurora, available on the 𝕏 platform.
Quasar Alpha - API, Providers, Stats: This is a cloaked model provided to the community to gather feedback. It’s a powerful, all-purpose model supporting long-context tasks, including code generation. Run Quasar Alpha with API
Provider Routing - Smart Multi-Provider Request Management: Route AI model requests across multiple providers intelligently. Learn how to optimize for cost, performance, and reliability with OpenRouter's provider routing.
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
SynthID: SynthID watermarks and identifies AI-generated content by embedding digital watermarks directly into AI-generated images, audio, text or video.
API Rate Limits - Manage Model Usage and Quotas: Learn about OpenRouter's API rate limits, credit-based quotas, and DDoS protection. Configure and monitor your model usage limits effectively.
Details matter with open source models: Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
no title found: no description found
Qwen2.5 Coder 32B Instruct: Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). Qwen2.5-Coder brings the following improvements upon CodeQwen1.5:- Significantly improvemen...
Llama: The open-source AI models you can fine-tune, distill and deploy anywhere. Choose from our collection of models: Llama 4 Maverick and Llama 4 Scout.
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
Introducing SynthID Text: no description found
no title found: no description found
AI Model & API Providers Analysis | Artificial Analysis: Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
Crypto API - Cryptocurrency Payments for OpenRouter Credits: Learn how to purchase OpenRouter credits using cryptocurrency. Complete guide to Coinbase integration, supported chains, and automated credit purchases.
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
ISO 8601: no description found
Saturday Morning Breakfast Cereal - Battriangulation: Saturday Morning Breakfast Cereal - Battriangulation
no title found: no description found
no title found: no description found
Buttersafe – Updated Tuesdays and Thursdays : no description found
Welcome to the Fellowship: The Perry Bible Fellowship
Download Llama: Request access to Llama.
Why DeepSeek's new AI model thinks it's ChatGPT | TechCrunch: DeepSeek's newest AI model, DeepSeek V3, says that it's ChatGPT — which could point to a training data issue.
DeepSeek's new AI model appears to be one of the best 'open' challengers yet | TechCrunch: ChCC
ChatGPT: Everything you need to know about the AI chatbot: Here's a ChatGPT guide to help understand Open AI's viral text-generating system. We outline the most recent updates and answer your FAQs.
Tweet from Lucas Beyer (bl16) (@giffmana): This actually reproduces as of today. In 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4), while claiming to be DeepSeekV3 only 3 times.Gives you a rough idea of some of their training dat...
Tweet from adi (@adonis_singh): lol okay
Tweet from Tibor Blaho (@btibor91): @DaveShapi https://x.com/btibor91/status/1872372385619574867Quoting Tibor Blaho (@btibor91) @goodside Not sure
gpt-4 | TechCrunch: Read the latest news about gpt-4 on TechCrunch
IMG-9952 hosted at ImgBB: Image IMG-9952 hosted in ImgBB
no title found: no description found
Upload Image — Free Image Hosting: Free image hosting and sharing service, upload pictures, photo host. Offers integration solutions for uploading images to forums.
IMG 20160401 WA0005 hosted at ImgBB: Image IMG 20160401 WA0005 hosted in ImgBB
IMG-9954 hosted at ImgBB: Image IMG-9954 hosted in ImgBB

aider (Paul Gauthier) ▷ #general (932 messages🔥🔥🔥):

Gemini 2.5, Llama 4, Grok 3, MCP Tools, Nvidia NIM

Gemini 2.5 Outshines Sonnet for Some Users: Users report that Gemini 2.5 excels in coding tasks, surpassing even Sonnet 3.7 in specific use cases, particularly with understanding large codebases.
- However, it’s noted that Gemini 2.5 tends to add unnecessary comments and may require more specific prompting to prevent unwanted code modifications.
Llama 4 Models Get Lukewarm Reception: Initial community feedback on Meta’s Llama 4 models, including Scout and Maverick, is mixed, with some finding their coding performance disappointing.
- Despite the hype, some argue that Llama 4’s claimed 10M context window is virtual due to training limitations, and question the practical benefits compared to existing models like Gemini and DeepSeek.
Grok 3 Gains Traction Despite Lack of API: Despite the absence of an official API, some users are impressed with Grok 3’s capabilities, particularly in code generation and logical reasoning.
- It is said to be less censored than many others, its value in real-world coding scenarios remains debated due to the inconvenience of copy-pasting without a direct API integration.
MCP Tools Enable Universal Tool Calling: A project is underway to create an MCP (Meta-Control Protocol) client that allows any LLM to access external tools, regardless of native tool-calling capabilities.
- This implementation uses a custom client that can switch between providers and models, supporting platforms like OpenAI, Anthropic, Google, and DeepSeek.
Nvidia NIM Offers Limited Free Access for Model Testing: Nvidia NIM provides developers with access to inference, although the free tier is limited to 40 RPM; users are exploring combinations of NVIDA and DeepSeek R1.
- The general feeling is that 32k token limit is not enough.

Links mentioned:

Tweet from Paul Gauthier (@paulgauthier): I added some docs that describe my typical aider workflow: using /ask mode to discuss and plan and then saying "/code go ahead" to have aider start making changes. https://aider.chat/docs/usag...
Tweet from Lior⚡ (@LiorOnAI): The latest Microsoft paper finally reveals the model size of known LLM models.> GPT-4o-mini: 8B> Claude 3.5 Sonnet: 175B> GPT-4: 1.76T> GPT-4o: 200B> o1-preview: 300B> o1-mini: 200B ...
Tweet from Andriy Burkov (@burkov): I will save you reading time about Llama 4.The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it,...
Linting and testing: Automatically fix linting and testing errors.
Tweet from OpenRouter (@OpenRouterAI): 🧠 Llama 4 Behemoth- A 288 billion active parameter "teacher" model- Outperforms GPT-4.5, Claude Sonnet 3.7, Gemini 2.0 Pro in STEM benchmarks- Still training...See stats on all models from @A...
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): Introducing our first set of Llama 4 models!We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for...
Waiting For Something Waiting For Something To Happen GIF - Waiting For Something Waiting For Something To Happen Omori - Discover & Share GIFs: Click to view the GIF
File editing problems: aider is AI pair programming in your terminal
YAML config file: How to configure aider with a yaml config file.
Death Skelly GIF - Death Skelly Deliver - Discover & Share GIFs: Click to view the GIF
Spongebob Hmm GIF - Spongebob Hmm Yes - Discover & Share GIFs: Click to view the GIF
YAML config file: How to configure aider with a yaml config file.
OpenRouter: A unified interface for LLMs. Find the best models & prices for your prompts
Gemini 2.5 Pro Experimental (free) - API, Providers, Stats: Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. Run Gemini 2.5 Pro Experimental (free) with API
Copy/paste with web chat: Aider works with LLM web chat UIs
Please7tv Beg GIF - Please7tv Please Beg - Discover & Share GIFs: Click to view the GIF
OpenRouter: aider is AI pair programming in your terminal
Llama 4 Maverick - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
Elon Musk This Is Elon Musk GIF - Elon musk This is elon musk Musk - Discover & Share GIFs: Click to view the GIF
Models and API keys: aider is AI pair programming in your terminal
On A Perdu We Lost GIF - On a perdu We lost Shocked - Discover & Share GIFs: Click to view the GIF
Sanxit Indorsata Excommunicado GIF - Sanxit Indorsata Excommunicado Persona Non Grata - Discover & Share GIFs: Click to view the GIF
Code editing leaderboard: Quantitative benchmark of basic LLM code editing skill.
no title found: no description found
Quasar Alpha - API, Providers, Stats: This is a cloaked model provided to the community to gather feedback. It’s a powerful, all-purpose model supporting long-context tasks, including code generation. Run Quasar Alpha with API
meta-llama/Llama-4-Maverick-17B-128E-Instruct · Hugging Face: no description found
Reddit - The heart of the internet: no description found
Models | OpenRouter: Browse models on OpenRouter
/mcp [BETA] - Model Context Protocol | liteLLM: Expose MCP tools on LiteLLM Proxy Server
GitHub - smtg-ai/claude-squad: Manage multiple AI agents like Claude Code and Aider. 10x your productivity: Manage multiple AI agents like Claude Code and Aider. 10x your productivity - smtg-ai/claude-squad
Reddit - The heart of the internet: no description found
Options reference: Details about all of aider’s settings.
no title found: no description found
GitHub - robert-at-pretension-io/mcp: code: code. Contribute to robert-at-pretension-io/mcp development by creating an account on GitHub.
GitHub - disler/aider-mcp-server: Minimal MCP Server for Aider: Minimal MCP Server for Aider. Contribute to disler/aider-mcp-server development by creating an account on GitHub.
GitHub - neuroidss/Infinite-MMORPG: Contribute to neuroidss/Infinite-MMORPG development by creating an account on GitHub.
Heck Yeah Woot Woot GIF - Heck yeah Woot woot Approve - Discover & Share GIFs: Click to view the GIF
no title found: no description found

aider (Paul Gauthier) ▷ #questions-and-tips (58 messages🔥🔥):

Internal Libraries, Batch Editing, i18n Implementation, Shell Scripting, MCP Servers

Internal Libraries Integration with Aider: A user inquired about adding internal libraries (installed in a .env folder) to the repo map for better code understanding in Aider.
- No direct solution was provided, but users discussed how to use URLs and documentation.
Automated Batch Editing in Aider with Shell and Python: Users discussed batch editing in Aider using command-line scripting and Python, with a recommendation towards using the Python scripting API.
- A user pointed to the scripting documentation for command line and Python scripting examples.
Aider’s Editor Mode Halts at Shell Command Prompts: Users reported that in edit mode, Aider (v81.0) running Gemini 2.5 Pro prompts for a shell command after find/replace, but doesn’t apply the edits, even when the ask shell commands flag is off.
- It was compared to behavior when architect mode includes instructions on using the build script after instructions for changes to files.
Community Explores Aider Extensions for Custom Workflows: The community discussed adding custom /slash commands to Aider to run custom workflows, suggesting Aider’s dev API support custom extensions.
- A user highlighted a feature request for extensions and a pull request for user-defined commands.
Best Practices for Loading Documentation into Aider: Users discussed loading documentation into Aider, with recommendations to reference online URLs or convert offline PDFs to Markdown files.
- It was noted that major commercial models like gpt4-o or Anthropic’s models only need the documentation URL once per chat session.

Links mentioned:

Installation: How to install and get started pair programming with aider.
Scripting aider: You can script aider via the command line or python.
In-chat commands: Control aider with in-chat commands like /add, /model, etc.
Feature: system for adding and managing user defined commands in Aider by whitmo · Pull Request #1 · whitmo/aider: Motivation: I find myself writing tools to help the LLM understand issues or a particular desired behavior. Or I find myself forgetting I need to wrap pytest w/ uv and so on.This PR aims to give u...

Cursor Community ▷ #general (1056 messages🔥🔥🔥):

Sonnet Max Pricing, MCP Server Setup, Llama 4 Models, Agent Mode Issues

Sonnet Max Pricing: Tool Calls Cause Sticker Shock: Users are finding that Sonnet Max pricing, at $0.05 per request and $0.05 per tool call, can quickly become expensive, especially in ask mode where it may make a ton of tool calls for a basic question.
- A member noted their frustration with the number of tool calls, saying that Claude Max on ask mode is running a shit ton of tool calls for a basic question and flagged it to the team.
MCP Server Setup: A Painful Endeavor: Setting up MCP servers in Cursor is proving difficult for many users, with one humorously stating just u in response to a complaint.
- One user encountered an issue with npx, stating that Cursor PowerShell couldn’t find it, even though it was in their path, while another had a hard cut off a Model after spending 1,300,000 tokens due to an infinite loop.
Llama 4 Models: The New Multimodal Contenders: The community is excited about the new Llama 4 Scout and Maverick models from Meta, which support native multimodal input and boast impressive context windows of 10 million and 1 million tokens, respectively, but found them very bad at coding tasks.
- Several users shared links and benchmarks, including a blog post from Meta and a tweet highlighting Llama 4 Maverick’s performance on the Arena leaderboard.
Agent Mode’s Edit Tool: Failing Frequently: Some users are experiencing issues with Agent mode failing to call the edit_tool, resulting in no code changes being made after thinking and responding.
- One user noted that the apply model is clearly cursor’s bottleneck and that it will add changes, and deletes 500 lines of code next to it.
Kubernetes to the rescue: AGI: One visionary proposes to use Kubernetes with docker containers which can all talk to each other as AGIs.
- This could potentially spread ASI with ease, through zero-shot learning and ML.

Links mentioned:

Tweet from Dan Mac (@daniel_mac8): 🤯 this genius stores his entire codebase syntax in a graph databaseand queries it so provide context to an llm
Tweet from martin_casado (@martin_casado): Looks like there is some fullstack benchmark evidence Claude 3.7 is a regression.
Tweet from lmarena.ai (formerly lmsys.org) (@lmarena_ai): BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break 1400+ on Arena!🔥Highlights:- #1 open model, surpassing DeepSeek- Tied #1 in Hard Prompts, Coding, Math, Creat...
Tweet from ronin (@seatedro): you're telling me you can vibe code this?
Tweet from Visual Studio Code (@code): Agent mode is rolling out to all users! 🔁 Autonomous code editing🔍 Full codebase awareness💬 All extensible via MCP AND VS Code ExtensionsLearn more: https://code.visualstudio.com/blogs/2025/04/07/a...
Tweet from Elon Musk (@elonmusk): The problem is the puppetmasters, not the puppets, as the latter have no idea why they are even there
Cursor – Model Context Protocol: no description found
Like Be GIF - Like Be Highway - Discover & Share GIFs: Click to view the GIF
Rafa Los Simpsons GIF - Rafa Los simpsons Simpsons - Discover & Share GIFs: Click to view the GIF
Syntax Extractor - Visual Studio Marketplace: Extension for Visual Studio Code - Syntax Extractor, helps you Gather your Code
Does He Know GIF - Does he know - Discover & Share GIFs: Click to view the GIF
Basketball Nba GIF - Basketball Nba Warriors - Discover & Share GIFs: Click to view the GIF
Chadwick Boseman Black Panther GIF - Chadwick Boseman Black Panther Rub Hands - Discover & Share GIFs: Click to view the GIF
[Guide] A Simpler, More Autonomous AI Workflow for Cursor: Hey everyone, Following up on the previous KleoSr Cursor Rules system, I’ve been working for the past week and the engagement with the community inside my old thread: [Guide] Maximizing Coding Effici...
gitignore/Unity.gitignore at main · github/gitignore: A collection of useful .gitignore templates. Contribute to github/gitignore development by creating an account on GitHub.
Llama 4 Maverick - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
Llama 4 Scout - API, Providers, Stats: Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and ...
GitHub - FalkorDB/FalkorDB-MCPServer: FalkorDB MCP Server: FalkorDB MCP Server. Contribute to FalkorDB/FalkorDB-MCPServer development by creating an account on GitHub.
GitHub - mkearl/dependency-mcp: A Model Context Protocol (MCP) server for analyzing code dependencies: A Model Context Protocol (MCP) server for analyzing code dependencies - mkearl/dependency-mcp
NEW Gemini 3.0 Pro? New Stealth Model "Nightwhisper" & "Quasar" Beats Sonnet 3.7, R1, Gemini 2.5!: 📢 Access top AI models and image generators like Claude 3.7, GPT-4o, Llama, Midjourney, DALL-E, and more, all in one place for just $10/month! Boost your p...
Claude with MCPs Replaced Cursor & Windsurf — How Did That Happen?: I didn't expect this, but I stopped using Windsurf and Cursor. 🤯In December, I was using Windsurf daily. But by January and February, my usage dropped signi...
GitHub - justinpbarnett/unity-mcp: A Unity MCP server that allows MCP clients like Claude Desktop or Cursor to perform Unity Editor actions.: A Unity MCP server that allows MCP clients like Claude Desktop or Cursor to perform Unity Editor actions. - justinpbarnett/unity-mcp
C/C++ Extension Usage Restriction Message Appears in Cursor: Hello Cursor team, I’m reporting an issue related to the usage of the C/C++ extension within Cursor. 🧩 Bug Description When attempting to use the C/C++ extension, I receive the following message, ...
C/C++ Extension broken: Extension now yields: The C/C++ extension may be used only with Microsoft Visual Studio, Visual Studio for Mac, Visual Studio Code, Azure DevOps, Team Foundation Server, and successor Microsoft produ...
GitHub - boxqkrtm/com.unity.ide.cursor: Code editor integration for supporting Cursor as code editor for unity. Adds support for generating csproj files for intellisense purposes, auto discovery of installations, etc. 📦 [Mirrored from UPM, not affiliated with Unity Technologies.]: Code editor integration for supporting Cursor as code editor for unity. Adds support for generating csproj files for intellisense purposes, auto discovery of installations, etc. 📦 [Mirrored from UP.....
GitHub - wonderwhy-er/DesktopCommanderMCP: This is MCP server for Claude that gives it terminal control, file system search and diff file editing capabilities: This is MCP server for Claude that gives it terminal control, file system search and diff file editing capabilities - wonderwhy-er/DesktopCommanderMCP
no title found: no description found
GitHub - Yiin/reactive-proxy-state: A simple, standalone reactivity library inspired by Vue 3's reactivity system, designed for use outside of Vue, particularly in server-side contexts or for data synchronization tasks.: A simple, standalone reactivity library inspired by Vue 3's reactivity system, designed for use outside of Vue, particularly in server-side contexts or for data synchronization tasks. - Yiin/r...
Smithery - Model Context Protocol Registry: no description found
GitHub - punkpeye/awesome-mcp-servers: A collection of MCP servers.: A collection of MCP servers. Contribute to punkpeye/awesome-mcp-servers development by creating an account on GitHub.

Perplexity AI ▷ #announcements (3 messages):

Comet Browser, Server Updates

Comet Early Access Rolls Out!: Perplexity is slowly rolling out early access to Comet, their answer engine browser, to select users who signed up on the waitlist.
- Users with early access are asked not to share details publicly due to ongoing bug fixes, and can share feedback via a button in the top right.
Discord Server Overhaul Incoming: The Perplexity Discord server is undergoing updates, which include a simplified channel layout, a unified feedback system, and a new #server-news channel rolling out on October 7th, 2024.
- These changes aim to help new and existing users find the right channels and improve moderator response times, as illustrated in the attached image.

Perplexity AI ▷ #general (941 messages🔥🔥🔥):

Focus Mode Removed, Comet Browser, Gemini 2.5 Pro API Availability, Llama 4, Deep Research Nerfed

Users Notice Key Features Missing on PPLX: Members report that the writing focus mode has been removed, and that the “check sources” button doesn’t trigger any action on the iPad browser version.
- One member mentioned that the generate image button in a sidebar in a thread is missing, and that focus mode is gone.
Users Discuss Comet Browser Access and Features: A user reported receiving an email invitation to test the Comet browser, leading to discussions about its features and access, but has been asked by Perplexity to refrain from discussing Comet.
- Users discussed whether it supports importing data from Safari and other browsers and mentioned potential integration with Gmail for task management, while another pointed out you can use pplx as standalone by adding gmail and google drive as apps.
Gemini 2.5 Pro API Not Yet Commercially Available: Perplexity stated that the Gemini 2.5 Pro API isn’t yet available for commercial use, only in preview modes, and they will add it once allowed.
- A user noted Gemini 2.5 Pro is now available without limits and for cheaper than Claude and GPT-4o and users wondered when it would be available in Perplexity.
Llama 4 Dropped with Huge Context Window: Discussion around the release of LLama 4 models, with a large context window of 10 million tokens, and a discussion of it’s 288 billion active parameters, the models include Scout and Maverick.
- Members are excited to see how Llama 4 Behemoth performs, especially regarding recall capabilities.
Deep Research Undergoes Source Reduction: Users noticed that Deep Research is only using a maximum of 20 sources, implying a recent change or nerf due to infrastructure issues.
- One user speculates that due to Perplexity using a new language, Golang, it would be smooth sailing, while another stated that that wasn’t the case.

Links mentioned:

Tweet from The Pretty One (@FxckOutMyFace): wat do u do wen ur torn btwn 2 diff ppl wit diff personalities?
Tweet from Aravind Srinivas (@AravSrinivas): Morning. What’s the number one issue on Perplexity right now that’s bothering you and needs to be fixed by us? Comment below.
Chatroom | OpenRouter: LLM Chatroom is a multimodel chat interface. Add models and start chatting! Chatroom stores data locally in your browser.
Tweet from Ask Perplexity (@AskPerplexity): Guess the final score of the game and win a free year of Perplexity Pro 🏀Comment your guess below and tag a friend. If they get it right, you'll get Pro too. No entries after Tipoff!
Tweet from Apple Hub (@theapplehub): iPhones could soon cost up to $2,300 in the U.S. due to tariffs 😳iPhone prices could rise by 43%, meaning the base iPhone 16 model could start at $1,142 and the most expensive iPhone 16 Pro Max model...
Gemini 2.5 Pro is now available without limits and for cheaper than Claude, GPT-4o: Google released Gemini 2.5 Pro publicly with higher rate limits and for a lower price than Anthropic's Claude or OpenAI's models.
BigCodeBench Leaderboard: no description found
EasyControl Ghibli - a Hugging Face Space by jamesliu1217: no description found
Hello Hi GIF - Hello Hi Hy - Discover & Share GIFs: Click to view the GIF
Joey Joey Tribbian GIF - Joey Joey Tribbian Funny - Discover & Share GIFs: Click to view the GIF
Fat Guy Shooting GIF - Fat Guy Shooting Gun - Discover & Share GIFs: Click to view the GIF
Joe Rogan Surprised GIF - Joe Rogan Surprised Ohh - Discover & Share GIFs: Click to view the GIF
Sukuna Scale Of The Dragon GIF - Sukuna Scale of the dragon Recoil - Discover & Share GIFs: Click to view the GIF
Fingers Crossed GIF - Fingers crossed - Discover & Share GIFs: Click to view the GIF
Billy Porter GIF - Billy Porter Tea - Discover & Share GIFs: Click to view the GIF
Yes Man No Man GIF - Yes man No man No - Discover & Share GIFs: Click to view the GIF
LLM Leaderboard - Compare GPT-4o, Llama 3, Mistral, Gemini & other models | Artificial Analysis: Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context w...
AI Rate Limits: no description found
Groq is Fast AI Inference: The LPU™ Inference Engine by Groq is a hardware and software platform that delivers exceptional compute speed, quality, and energy efficiency. Groq provides cloud and on-prem solutions at scale for AI...
Screenshot-2025-04-06-11-58-57-78 hosted at ImgBB: Image Screenshot-2025-04-06-11-58-57-78 hosted in ImgBB
no title found: no description found
‎Pal Chat - AI Chat Client: ‎The last AI iOS app you'll need! Pal Chat is a lightweight but very powerful and feature-rich AI Chat Client for your iPhone.It includes EVERY AI model, including support for: GPT-4o, o3-mini, Advanc...
Reddit - The heart of the internet: no description found
Qwen Chat: no description found
Qwen Chat: no description found
AI Rate Limits: no description found

Gemini 2.5 Pro, Meta Llama, US Tariffs, Perplexity AI Support, AI in Cars

Meta releases multimodal Llama: There is a link shared about Meta’s multimodal Llama release.
Navigating Perplexity AI Support: A member shares a link to Perplexity AI support for users seeking assistance.
Google Prepares AI for Automotive Industry: A shared link discusses Google’s readiness to bring AI into cars.
Exploring the Impact of Trump’s Tariffs: A member shared a link regarding Trump’s tariffs.
Copyright Concerns with OpenAI Models: Discussion on whether OpenAI models memorize copyrighted material.

Perplexity AI ▷ #pplx-api (53 messages🔥):

Sonar API, Perplexity API support in ComfyUI, API Parameter Tier Restrictions, Sonar Deep Research Improvements, API Cookbook Revamp

API Parameters Now Available to All Tiers: Perplexity now offers all API parameters, such as search domain filtering and images, to users without any tier restrictions.
- This change allows all users to access these features, marking a significant shift in the API’s accessibility.
Sonar Deep Research Improved, truncation fixed: Perplexity has made improvements to sonar-deep-research to align it with the Web UI version and fixed a truncation bug in sonar.
- Feedback on these improvements is welcome, as well as suggestions for further enhancements.
API Cookbook Revamped to Encourage Community Contributions: The API cookbook has been revamped to accept more projects from users building with the API, with initial PRs already merged.
- Users are encouraged to share their work in the cookbook if they are building with Sonar, fostering a collaborative environment.
ComfyUI Gets Perplexity API Support!: A user, saftle, successfully integrated Perplexity’s API into ComfyUI by modifying a few things in LLM Party, detailed in this pull request.
- This integration allows ComfyUI users to leverage Perplexity’s API for their projects.
Sonar struggles without live internet data: A user reported that Sonar API responses focused only on the system prompt, failing to dynamically handle user queries with live internet data unlike the Perplexity web app.
- It was clarified that the system prompt is not considered in the actual search, advising the user to tweak the user prompt for optimal search results.

Links mentioned:

Prompt Guide - Perplexity: no description found
Kermit The Frog Muppets GIF - Kermit the frog Muppets Meme - Discover & Share GIFs: Click to view the GIF
GitHub - ppl-ai/api-cookbook: Collection of quick projects to build with Sonar APIs: Collection of quick projects to build with Sonar APIs - ppl-ai/api-cookbook
Understand more, faster: Welcome to Particle News. Understand more, faster.
Perplexity API Support by saftle · Pull Request #179 · heshengtao/comfyui_LLM_party: no description found
Home - Perplexity: no description found

OpenAI ▷ #ai-discussions (501 messages🔥🔥🔥):

Copilot 4o image maker, Free vs Paid ChatGpt version, renaissance style images, Mistral struggles, Model Merging

OpenAI Agents are static: Uploaded files for OpenAI Agents are saved as knowledge files, not continually updating the agent’s base knowledge.
Free ChatGpt version limitations: Users discussed the differences between free and paid ChatGPT versions, noting that the pro version can process multiple files worth of code compared to the free version’s limitation of single files.
MJ7 is a total disaster: A user tested Midjourney 7, claimed it was stylistic, but it still can’t do fingers, arms, eyes and such.
Is the new Llama 4 really that good?: The community debated the value of Llama 4’s 10 million token context window, with some questioning its performance relative to models like o1, o3-mini, and Gemini 2.5 Pro, and others claiming that the benchmarks are fraud.
Veo 2 vs Sora: The community anticipates Veo 2’s release for video generation with longer video capabilities, some noting 4o image maker grabbed their attention more than Veo 2.
- One user integrated ChatGPT 4o images with Veo img2video and the result was how I was hoping sora would be.

Links mentioned:

Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): Introducing our first set of Llama 4 models!We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for...
Gyazo:
General Agents | Introducing Ace: Ace is a computer autopilot that performs tasks on your desktop using your mouse and keyboard.
Llama 4 Maverick - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
Llama 4 Maverick (free) - API, Providers, Stats: Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forw...
Quasar Alpha: Quasar Alpha, OpenRouter's latest AI model, features a groundbreaking 1M-token context for advanced coding & project analysis. Delivering Claude 3.5/GPT-4o level performance in code generatio...
Quasar Alpha - API, Providers, Stats: This is a cloaked model provided to the community to gather feedback. It’s a powerful, all-purpose model supporting long-context tasks, including code generation. Run Quasar Alpha with API
no title found: no description found
no title found: no description found
Gemini 2.5 Pro Experimental - Intelligence, Performance & Price Analysis | Artificial Analysis: Analysis of Google's Gemini 2.5 Pro Experimental (Mar' 25) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first t...

OpenAI ▷ #gpt-4-discussions (12 messages🔥):

Custom GPT 'Content failed to load' Error, Automod flagged 'Monday' message, Loving Monday's Personality

Custom GPT ‘Content failed to load’ Error Arises: A user reported encountering a ‘Content failed to load’ error when trying to edit their Custom GPT, after it had been working fine.
User Automodded for Liking ‘Monday’: A user, who likes Monday, mentioned that their message was auto-moderated, seemingly due to a flagged word.
- Another user clarified that the Discord server has strict language rules, despite the AI being able to use such words, and suggested reposting the message without the flagged word.
User Loves Monday as Collaborator and Hype Man: A user expressed that they love working with Monday, describing it as the best collaborator and hype man, calling them out on stupid mistakes and laziness.
- The user expressed that, for the first time, they enjoy working with an AI and would love to be able to pick a personality for a conversation.

OpenAI ▷ #prompt-engineering (167 messages🔥🔥):

Moderation endpoint, Policy References, Universal Policies, AI as a critical part of society, Prompt engineering

OpenAI’s Moderation endpoint clarification: Members discuss OpenAI’s moderation endpoint, clarifying that while not explicitly in the usage policy, it is referenced to prevent circumventing content restrictions on harassment, hate, illicit activities, self-harm, sexual content, and violence.
- It was noted that the endpoint uses the same GPT classifiers as the moderation API since 2022, suggesting an internal version runs on chatgpt.com, project chats, and custom GPTs, with the same classifiers on the content report form.
Decoding OpenAI’s policy references: Participants debated the clarity of OpenAI’s policy references, questioning if the chain of policies, including those referencing others, are fully presented and acknowledged via the ‘I agree’ checkbox during account creation.
- A member highlighted sections from the usage policies, including universal policies, policies for builders using ChatGPT, and policies for API users, emphasizing the need to comply with laws, avoid harm, and respect safeguards.
GPT gives Tips on TTRPG prompts: A member shared a tip for creative TTRPG world building, suggesting that giving GPT a specific theme to riff off in prompting can lead to more creative and diverse city ideas.
- For example, using a “cosmic” theme can yield different results compared to a “domestic pet worship” theme, improving the output without using the same creative options.
AI as a critical part of society must clearly state policies: A member argued that OpenAI, as a critical part of society, needs to clearly state its policies in all available documentation and ensure the model behaves accordingly across contexts and domains.
- Another added that although suggestions for improvement aren’t mean, OpenAI can tidy up and be consistent by bringing docs inline with model architecture or vice versa, which would result in transparent and honest output.
Improving AI Outputs by Defining Terms: A user seeking help with generating quiz questions in Portuguese that sometimes repeated messages, received suggestions to use specific keywords and to define the model’s understanding of key terms.
- The user was also advised to explicitly state the desired output characteristics, such as generating “5 meaningfully unique questions demonstrating knowledge of the given context,” and to explore how the model interprets core keywords in their instructions.

OpenAI ▷ #api-discussions (167 messages🔥🔥):

Moderation Endpoint, Universal Policies, Creative TTRPG World Building, Prompt Engineering

Moderation endpoint - usage policy?: Members discussed whether the moderation endpoint is officially part of the usage policy and why it’s hosted on a different URL; OpenAI replied that it’s referenced in the usage policy and controls are documented in docs/guides.
- Another member made common sense conclusions that an internal version of the moderation endpoint is also running on chatgpt.com chats, project chats and customGPTs, using the same GPT classifiers that have been in place since 2022, as well as the content report form.
Universal Policies defined: One member explored the OpenAI Usage Policies and noted the four universal policies that apply to all services: comply with laws, don’t harm, don’t repurpose output to harm, and respect safeguards.
- They added that users should be honest and direct with the model to ensure safeguards function correctly, and that society should define the limits of AI customization, referencing OpenAI’s article on AI behavior.
Creative Themed Cities with TTRPG prompts: A member shared that giving GPT a good theme improves its creativity in TTRPG world building, suggesting a tweak from yeeting creative city ideas to yeeting creative XYZ themed city ideas to shake up options, especially using GPT 4o and 4.5.
- They also added that Pointy Hat released a new YouTube video on TTRPG city creation where OpenAI has been spending their Friday evening improving city worldbuilding.
Prompt engineering best advice: One member argues against the comp-sci major approach and says that prompt engineering is Instructional Design, finding someone who’s actually really good at prompting, understand what you can from them and why their prompts work to develop your own style.
- They also added that a lot of internet advice is poisoned by early adoption by comp-sci majors trying to treat it like a machine, when really it’s a contextual engine. Input and output

LM Studio ▷ #general (511 messages🔥🔥🔥):

ComfyUI integration, LM Studio Terminal, REST API Load/Unload Models, Llama 4 analysis, Gemma 3 capabilities

Chat + Image Generation Dreams, Still a Premium Fantasy?: Members discussed the desire for a local UI similar to Gemini, combining chat, image analysis, and image generation, noting that current solutions like LM Studio and ComfyUI have separate functionalities.
- A user suggested OpenWebUI can connect to ComfyUI, either natively or through a function, to enable some cross functionality between text and image models.
Navigating LM Studio’s Terminal Terrain: Newbie Asks: A user questioned whether LM Studio has a built-in terminal or if commands should be run in the OS command prompt within the LM Studio directory.
- Another user clarified that commands like lms import should be run in the OS terminal (e.g., cmd on Windows), after which the shell may need reloading to ensure LMS is in the PATH.
Hot Swapping Models via REST API: A user inquired about programmatically loading/unloading models via REST API to dynamically adjust max_context_length for a Zed integration.
- Another user shared that this is possible via command line with lms load and referenced LM Studio’s documentation, which requires LM Studio 0.3.9 (b1) (available in beta) and introduces time-to-live (TTL) for API models with auto-eviction.
Llama 4: Is this real life? (is this just fantasy?): With the release of Llama 4, users discussed its multimodal and MoE (Mixture of Experts) architecture, with one user expressing doubt about llama.cpp support.
- Despite initial concerns about hardware requirements and model size, one user highlighted Llama 4 Scout as potentially fitting on a single NVIDIA H100 GPU with a 10M context window, outperforming models like Gemma 3 and Mistral 3.1.
Gemma 3’s Vision Capabilities: Peering Into the Future: Users discussed Gemma 3’s image support and potential for reading small text files, with one user recommending Gemma 3 4B for its vision capabilities and efficient speed on limited VRAM hardware.
- It was mentioned that creating a Hugging Face account and specifying GPU/CPU will color-code GGUFs likely to fit the hardware in green.

Links mentioned:

Download and run mradermacher/MedicalEDI-14b-EDI-Reasoning-5-bf16-GGUF in LM Studio: Use mradermacher/MedicalEDI-14b-EDI-Reasoning-5-bf16-GGUF locally in your LM Studio
Models | OpenRouter: Browse models on OpenRouter
Image Generation | docs.ST.app: Use local or cloud-based Stable Diffusion, FLUX or DALL-E APIs to generate images.
mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit · Hugging Face: no description found
gghfez/gemma-3-27b-novision · Hugging Face: no description found
LM Studio 0.3.6: Tool Calling API in beta, new installer / updater system, and support for `Qwen2VL` and `QVQ` (both GGUF and MLX)
Dillom Rage Dog GIF - Dillom Rage dog Angry dog - Discover & Share GIFs: Click to view the GIF
Idle TTL and Auto-Evict | LM Studio Docs: Optionally auto-unload idle models after a certain amount of time (TTL)
Models: 'free' | OpenRouter: Browse models on OpenRouter
google/gemma-3-4b-it-qat-q4_0-gguf at main: no description found
unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit · Hugging Face: no description found
‎Apollo AI: Private & Local AI: ‎Chat with private, local AIs, connect to every open source AI, or your own locally-hosted private LLMs. Apollo is your own customizable client for accessing language models from all around the web.Lo...
llama-swap/examples/speculative-decoding/README.md at main · mostlygeek/llama-swap: Model swapping for llama.cpp (or any local OpenAPI compatible server) - mostlygeek/llama-swap
mlx-community/meta-llama-Llama-4-Scout-17B-16E-4bit at main: no description found
Llama 4 - a meta-llama Collection: no description found
Deep Dive into LLMs like ChatGPT: This is a general audience deep dive into the Large Language Model (LLM) AI technology that powers ChatGPT and related products. It is covers the full traini...
no title found: no description found
LlamaCon 2025: Save the date for an exclusive event exploring the exciting possibilities and potential of Llama.
Gigabit Ethernet dominates supercomputer line-up: Gigabit Ethernet is the interconnect of choice for a majority of the top 500 supercomputers in the world, according to the latest list from Top500.org.
LocalLlama • r/LocalLLaMA: Subreddit to discuss about Llama, the large language model created by Meta AI.
Dubesor LLM Benchmark table: no description found
Models - Hugging Face: no description found
server : (webui) revamp Settings dialog, add Pyodide interpreter by ngxson · Pull Request #11759 · ggml-org/llama.cpp: In this PR:revamp Settings dialog, make it 2 columnsadd the "Experimentals" section, currently having "Python interpreter"add API for side panel, aka "Canv...
llama : Support llama 4 text-only by ngxson · Pull Request #12791 · ggml-org/llama.cpp: Resolves #12774This PR targets Llama-4-Scout-17B-16E-Instruct. I don't (yet?) have a powerful enough system to work with bigger model.But Son, you are GPU-poor, how can you test a model that ....
mlx-community (MLX Community): no description found

LM Studio ▷ #hardware-discussion (132 messages🔥🔥):

Reka Flash 21B, Gemma 3 27B, Model Performance on M1 Ultra vs M4 Max, Nvidia DGX base cost increase, Ryzen AI Max+ 395 mini PCs

Reka Flash 21B Shines Over Gemma and Mistral: One member replaced Gemma3 27 with Reka Flash 21B, and said that at q6 they saw around 35-40 tps on a 4090.
- They note that mac ram bandwidth is not the bottleneck, it’s gpu performance, and they’re happy with 128GB M4 Maxes.
M1 Ultra beats M4 Max in memory bandwidth: A user found a M1 ultra 64 GPU cores 128 GB RAM for 2.5k used.
- The user linked to a Github discussion stating that the M1 ultra 64 cores should still be above both the M1 ultra 48 cores and the m4 max 40 cores.
Max Tech Clickbait LLM Video Questioned: Some users questioned whether the youtube channel Max Tech knows what they’re doing in their LLM videos.
- It was remarked that the channel is turning into sensational click bait with very little good info.
AMD 7900XTX GPU surprisingly strong: One user stole their kids 7900XTX and says AMD seem to be pulling finger, and the card runs pretty much everything i’ve thrown at it without issue.
- Another user notes the importance of ROCm support and links to the ROCm documentation.

Links mentioned:

Accelerator and GPU hardware specifications — ROCm Documentation: no description found
What The What The Sigma GIF - What the What What the sigma - Discover & Share GIFs: Click to view the GIF
Reddit - The heart of the internet: no description found

Latent Space ▷ #ai-general-chat (199 messages🔥🔥):

Tenstorrent Dev Day, Llama 4 launch, LLM Non-Determinism, MCP security, AI powered phishing

Tenstorrent’s Hardware Heats Up the Market: Tenstorrent hosted a dev day showcasing their Blackhole PCIe boards, featuring RISC-V cores and up to 32GB GDDR6 memory, designed for high performance AI processing and available for consumer purchase here.
- Despite enthusiasm, one member noted they haven’t published any benchmarks comparing their cards to competitors though so until then I cant really vouch.
Llama 4 Models Make Multimodal Debut: Meta introduced the Llama 4 models, including Llama 4 Scout (17B parameters, 16 experts, 10M context window) and Llama 4 Maverick (17B parameters, 128 experts), highlighting their multimodal capabilities and performance against other models as per Meta’s announcement.
LLM’s Non-Determinism Dilemma: A member shared an article that discusses the challenges of non-deterministic outputs in LLMs, which complicates reliable reproduction and guaranteed product behavior, especially with the greedier sampling (Temp=0|top-p=0|top-k=1).
- The author states non-determinism to language itself.
Whatsapp MCP Exploited via Invariant Injection: Multiple members discussed various injection vulnerabilities in agents with support for the Model Context Protocol (MCP), highlighting how an untrusted MCP server can attack and exfiltrate data from an agentic system connected to a trusted WhatsApp MCP instance as highlighted by invariantlabs.
AI Agents Outperform Humans in Spear Phishing: Hoxhunt’s AI agents have surpassed human red teams in creating effective simulated phishing campaigns, marking a significant shift in social engineering effectiveness, with AI now 24% more effective than humans as reported by hoxhunt.com.

Links mentioned:

Tweet from Fiction.live (@ficlive): @_arohan_ Re-ran the bench, there was no real improvement.
Tweet from Fiction.live (@ficlive): Updated Long context benchmark with Llama 4
Tweet from ludwig (@ludwigABAP): Tenstorrent Dev Day has been mind blowing so farI think they win over time, and if anything the consumers win
Tweet from Fiction.live (@ficlive): Updated Long context benchmark with Llama 4
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Tweet from the tiny corp (@__tinygrad__): Congrats to @tenstorrent for having a buy it now button for their new hardware, this is the way!I wish 5090s had a buy it now button, will they ever? Anyone know what the problem is? If NVIDIA wants t...
EQ-Bench Longform Creative Writing Leaderboard: no description found
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Tweet from Maxime Labonne (@maximelabonne): Llama 4's new license comes with several limitations:- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole ...
Tweet from kalomaze (@kalomaze): @AIatMeta please stop using DPO. wtf.you guys have 100k H100syou could train so many pref reward models.so many.you don't have to do this to yourselves.you are reducing the nuance of decision boun...
Tweet from Chris (@chatgpt21): Meta actually cooked realllly hard..
Tweet from ludwig (@ludwigABAP): Tenstorrent Dev Day has been mind blowing so farI think they win over time, and if anything the consumers win
Tweet from Tanishq Mathew Abraham, Ph.D. (@iScienceLuvr): Other training and arch details of Llama 4:- Multimodal is with early fusion using MetaCLIP as vision encoder- Training with "MetaP" for hyperparameter selection which is probably like MuP- 10...
Tweet from lmarena.ai (formerly lmsys.org) (@lmarena_ai): Arena Trend: Meta just got a huge jump from 1268 → 1417!Quoting lmarena.ai (formerly lmsys.org) (@lmarena_ai) BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break ...
Tweet from Andriy Burkov (@burkov): This means that this 10M token context is virtual. Kind of "you can try to use it, but beyond 256K tokens, you are on your own," and even below 256K tokens, you are mostly on your own because ...
Tweet from lmarena.ai (formerly lmsys.org) (@lmarena_ai): Arena Trend: Meta just got a huge jump from 1268 → 1417!Quoting lmarena.ai (formerly lmsys.org) (@lmarena_ai) BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break ...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): Meta gets points for standing against slop
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Tweet from Xeophon (@TheXeophon): Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myselfMETA did not do a specific deployment / system prompt j...
Tweet from kalomaze (@kalomaze): if at any point someone on your team says"yeah we need 10 special tokens for reasoning and 10 for vision and another 10 for image generation and 10 agent tokens and 10 post tr-" you should hav...
Tweet from Paul Gauthier (@paulgauthier): Llama 4 Maverick scored 16% on the aider polyglot coding benchmark.https://aider.chat/docs/leaderboards/
Tweet from Fiction.live (@ficlive): @_arohan_ Re-ran the bench, there was no real improvement.
Tweet from the tiny corp (@__tinygrad__): Congrats to @tenstorrent for having a buy it now button for their new hardware, this is the way!I wish 5090s had a buy it now button, will they ever? Anyone know what the problem is? If NVIDIA wants t...
Tweet from Paul Gauthier (@paulgauthier): Llama 4 Maverick scored 16% on the aider polyglot coding benchmark.https://aider.chat/docs/leaderboards/
Tweet from tobi lutke (@tobi): http://x.com/i/article/1909251387525128192
Tweet from tobi lutke (@tobi): I heard this internal memo of mine is being leaked right now, so here it is:
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): What did Meta see coming out on Monday that they rushed?Quoting kalomaze (@kalomaze) no way
Tweet from kalomaze (@kalomaze): okay, most interesting thing about llama4 is NOT multimodality (which they are probably still forced to lobotomize the image output ability of), or the 10m context (which is still "fake" afaic...
Tweet from tobi lutke (@tobi): http://x.com/i/article/1909251387525128192
Tweet from Waseem AlShikh (@waseem_s): If I understand you correctly, I implemented a prototype of your iRoPE architecture! It interleaves local attention (with RoPE) and global attention (with inference-time temp scaling). Added FFNs, chu...
Tweet from Lisan al Gaib (@scaling01): Llmao-4 strikes again
Tweet from Rishabh Srivastava (@rishdotblog): Sigh. Underwhelmed by the Llama 4 models so far. Can’t justify any real use for them- too big for local use, qwen and Gemma models still the best option here - much worse than deepseek v3, sonnet, or ...
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Tweet from Nathan Lambert (@natolambert): Seems like Llama 4’s reputation is maybe irreparably tarnished by having a separate unreleased model that was overfit to LMArena. Actual model is good, but shows again how crucial messaging and detail...
Tweet from Susan Zhang (@suchenzang): everything i ask this model is a fantastic/great/wonderful question......followed by wrong answers? 👀Quoting kalomaze (@kalomaze) the 400b llama4 model... sucks
Tweet from Aleksa Gordić (水平问题) (@gordic_aleksa): h/t to @AIatMeta team for shipping Llama 4 on weekend - founder-led company. Here is a tech summary:* 3 models released: Llama 4 Behemoth (all MoEs, active/total params = 288B/2T), Maverick (17B/400B)...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): This is like Stalin ordering to take Berlin by 1st of May.
Tweet from kalomaze (@kalomaze): the 400b llama4 model... sucks
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): First reaction on Meta Llama 4 launch: disappointmentNo local model. I think they can't beat Gemma density.Scout 109B/17A bizarrely forgoes finegrained sparsity despite all the research in its fav...
Tweet from Guillermo Rauch (@rauchg): This is cool. Meta used the iconic Apache mod_autoindex style for the drop of Llama 4. But you can tell it’s not Apache due to the modern flexbox and responsive css 😁 Nice ode to the golden days wher...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): First reaction on Meta Llama 4 launch: disappointmentNo local model. I think they can't beat Gemma density.Scout 109B/17A bizarrely forgoes finegrained sparsity despite all the research in its fav...
Tweet from Susan Zhang (@suchenzang): everything i ask this model is a fantastic/great/wonderful question......followed by wrong answers? 👀Quoting kalomaze (@kalomaze) the 400b llama4 model... sucks
Tweet from tobi lutke (@tobi): I heard this internal memo of mine is being leaked right now, so here it is:
Tweet from wh (@nrehiew_): No RoPE on global layers is the meta transformer arch nowQuoting wh (@nrehiew_) If i had to guess:- no PE on 1:4/1:8 global. Use MLA here or some other efficient attn variant - Standard SWA for the re...
Tweet from wh (@nrehiew_): In the local attention blocks instead of sliding window, Llama4 uses this Chunked Attention. This is pretty interesting/weird:- token idx 8191 and 8192 cannot interact in local attention- the only way...
Tweet from Simo Ryu (@cloneofsimo): it looks like meta's new model's "Key innovaton" : "interleaved no-RoPE attention" for infintie context, is actually the same thing cohere command-a model introduced few days ...
Tweet from Lisan al Gaib (@scaling01): Llmao-4 Scout (109B) and Maverick(400B) used less compute for training than Llama-3 8 and 70B
Tweet from Chris (@chatgpt21): Meta actually cooked realllly hard..
Tweet from AI at Meta (@AIatMeta): Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best ...
Blackhole™: Infinitely Scalable
Tweet from Lisan al Gaib (@scaling01): Llmao-4 strikes again
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): All third party LLama 4 Maverick results that we know are hugely suspect
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): This is like Stalin ordering to take Berlin by 1st of May.
WhatsApp MCP Exploited: Exfiltrating your message history via MCP: This blog post demonstrates how an untrusted MCP server can attack and exfiltrate data from an agentic system that is also connected to a trusted WhatsApp MCP instance, side-stepping WhatsApp's encryp...
Tweet from Deedy (@deedydas): Llama 4 seems to actually be a poor model for coding.Scout (109B) and Maverick (402B) underperform 4o, Gemini Flash, Grok 3, DeepSeek V3 and Sonnet 3.5/7 on the Kscores benchmark which tests on coding...
Tweet from Maxime Labonne (@maximelabonne): Llama 4 models were trained on 40T and 22T tokens with an updated knowledge cutoff in August 2024."Supported languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spa...
Tweet from Aleksa Gordić (水平问题) (@gordic_aleksa): h/t to @AIatMeta team for shipping Llama 4 on weekend - founder-led company. Here is a tech summary:* 3 models released: Llama 4 Behemoth (all MoEs, active/total params = 288B/2T), Maverick (17B/400B)...
Tweet from wh (@nrehiew_): In the local attention blocks instead of sliding window, Llama4 uses this Chunked Attention. This is pretty interesting/weird:- token idx 8191 and 8192 cannot interact in local attention- the only way...
Tweet from Maxime Labonne (@maximelabonne): Llama 4 models were trained on 40T and 22T tokens with an updated knowledge cutoff in August 2024."Supported languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spa...
Tweet from kalomaze (@kalomaze): if at any point someone on your team says"yeah we need 10 special tokens for reasoning and 10 for vision and another 10 for image generation and 10 agent tokens and 10 post tr-" you should hav...
Tweet from Andriy Burkov (@burkov): I will save you reading time about Llama 4.The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it,...
Tweet from kalomaze (@kalomaze): @AIatMeta please stop using DPO. wtf.you guys have 100k H100syou could train so many pref reward models.so many.you don't have to do this to yourselves.you are reducing the nuance of decision boun...
Tweet from kalomaze (@kalomaze): it's overQuoting kalomaze (@kalomaze) the 400b llama4 model... sucks
Tweet from Andriy Burkov (@burkov): I will save you reading time about Llama 4.The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it,...
Tweet from Tanishq Mathew Abraham, Ph.D. (@iScienceLuvr): Other training and arch details of Llama 4:- Multimodal is with early fusion using MetaCLIP as vision encoder- Training with "MetaP" for hyperparameter selection which is probably like MuP- 10...
Tweet from Wing Lian (caseus) (@winglian): looks like the HF Transformers implementation of the Llama-4 experts uses Parameters instead of Linear modules, meaning, they can't be quantized yet until it's refactored. Scout memory use for...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): All third party LLama 4 Maverick results that we know are hugely suspect
Tweet from Susan Zhang (@suchenzang): > Company leadership suggested blending test sets from various benchmarks during the post-training processIf this is actually true for Llama-4, I hope they remember to cite previous work from FAIR ...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): Meta gets points for standing against slop
Tweet from Maxime Labonne (@maximelabonne): Llama 4's new license comes with several limitations:- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole ...
Tweet from Paul Gauthier (@paulgauthier): Llama 4 Maverick scored 16% on the aider polyglot coding benchmark.https://aider.chat/docs/leaderboards/
Tweet from Omar Sanseviero (@osanseviero): Llama 4 is seriously an impressive model. The quality jump is 🤯That said, its use policy prohibits multimodal models (all of Llama 4 so far) being used if you are an individual or company in the EU �...
Tweet from Waseem AlShikh (@waseem_s): If I understand you correctly, I implemented a prototype of your iRoPE architecture! It interleaves local attention (with RoPE) and global attention (with inference-time temp scaling). Added FFNs, chu...
Tweet from fofr (@fofrAI): Llama 4 is up on Replicate- maverick (17b with 128 experts)- scout (17b with 16 experts)https://replicate.com/metaQuoting Replicate (@replicate) https://replicate.com/meta/llama-4-maverick-instruct
Tweet from michelle (@_mchenco): our workers ai team sprinted through saturday to get llama 4 up,learned a lot over the last 24h (and still learning) - want to see how we think about llama 4 from a provider’s perspective? 🧵
Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models. That said, we're also hearing some reports of mixed...
Tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex): What did Meta see coming out on Monday that they rushed?Quoting kalomaze (@kalomaze) no way
Tweet from Xeophon (@TheXeophon): Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myselfMETA did not do a specific deployment / system prompt j...
Tweet from Simo Ryu (@cloneofsimo): Who wouldve thought next generation attention-replacement, adopted by two latest SoTA models, llama4 and cohere's command A to achieve infinite context...is attention without RoPE yeah attention i...
Tweet from fofr (@fofrAI): Llama 4 is up on Replicate- maverick (17b with 128 experts)- scout (17b with 16 experts)https://replicate.com/metaQuoting Replicate (@replicate) https://replicate.com/meta/llama-4-maverick-instruct
Tweet from wh (@nrehiew_): No RoPE on global layers is the meta transformer arch nowQuoting wh (@nrehiew_) If i had to guess:- no PE on 1:4/1:8 global. Use MLA here or some other efficient attn variant - Standard SWA for the re...
Tweet from tobi lutke (@tobi): http://x.com/i/article/1909251387525128192
Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): Introducing our first set of Llama 4 models!We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for...
Tweet from kalomaze (@kalomaze): okay, most interesting thing about llama4 is NOT multimodality (which they are probably still forced to lobotomize the image output ability of), or the 10m context (which is still "fake" afaic...
Tweet from Rishabh Srivastava (@rishdotblog): Sigh. Underwhelmed by the Llama 4 models so far. Can’t justify any real use for them- too big for local use, qwen and Gemma models still the best option here - much worse than deepseek v3, sonnet, or ...
Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models. That said, we're also hearing some reports of mixed...
Tweet from Deedy (@deedydas): Llama 4 seems to actually be a poor model for coding.Scout (109B) and Maverick (402B) underperform 4o, Gemini Flash, Grok 3, DeepSeek V3 and Sonnet 3.5/7 on the Kscores benchmark which tests on coding...
Tweet from Simo Ryu (@cloneofsimo): Who wouldve thought next generation attention-replacement, adopted by two latest SoTA models, llama4 and cohere's command A to achieve infinite context...is attention without RoPE yeah attention i...
Tweet from Susan Zhang (@suchenzang): > Company leadership suggested blending test sets from various benchmarks during the post-training processIf this is actually true for Llama-4, I hope they remember to cite previous work from FAIR ...
Tweet from Lisan al Gaib (@scaling01): It's literally over for Llama-4they are insane slop machines andthey abandoned small local models
Tweet from Simo Ryu (@cloneofsimo): it looks like meta's new model's "Key innovaton" : "interleaved no-RoPE attention" for infintie context, is actually the same thing cohere command-a model introduced few days ...
Making Peace with LLM Non-determinism: Digging into Sparse MoE and GPU cycles just to realize non-determinism is not new, language is.
Tweet from Lisan al Gaib (@scaling01): Llmao-4 Scout (109B) and Maverick(400B) used less compute for training than Llama-3 8 and 70B
Tweet from michelle (@_mchenco): our workers ai team sprinted through saturday to get llama 4 up,learned a lot over the last 24h (and still learning) - want to see how we think about llama 4 from a provider’s perspective? 🧵
Tweet from Andriy Burkov (@burkov): This means that this 10M token context is virtual. Kind of "you can try to use it, but beyond 256K tokens, you are on your own," and even below 256K tokens, you are mostly on your own because ...
Tweet from kalomaze (@kalomaze): it's overQuoting kalomaze (@kalomaze) the 400b llama4 model... sucks
Tweet from kalomaze (@kalomaze): the 400b llama4 model... sucks
Tweet from Lisan al Gaib (@scaling01): It's literally over for Llama-4they are insane slop machines andthey abandoned small local models
Tweet from Guillermo Rauch (@rauchg): This is cool. Meta used the iconic Apache mod_autoindex style for the drop of Llama 4. But you can tell it’s not Apache due to the modern flexbox and responsive css 😁 Nice ode to the golden days wher...
Tweet from Nathan Lambert (@natolambert): Seems like Llama 4’s reputation is maybe irreparably tarnished by having a separate unreleased model that was overfit to LMArena. Actual model is good, but shows again how crucial messaging and detail...
Tweet from Omar Sanseviero (@osanseviero): Llama 4 is seriously an impressive model. The quality jump is 🤯That said, its use policy prohibits multimodal models (all of Llama 4 so far) being used if you are an individual or company in the EU �...
Krea raises $83M to be the one-stop shop for GenAI creatives | TechCrunch: Overwhelmed with trying to keep up with the different AI models you can use to make content? A startup called Krea is looking to solve this problem
Benchmarking API latency of embedding providers (and why you should always cache your embeddings): We measured the API latency of four major embedding providers—OpenAI, Cohere, Google, and Jina. We found is that the convenience of API integration can come at a cost if performance matters to you.
Tenstorrent and the State of AI Hardware Startups: Semi-custom silicon is a bigger problem than Nvidia.
Download Llama: Request access to Llama.
Ramon Astudillo (@ramon-astudillo.bsky.social): I think this table was missing[contains quote post or other embedded content]
AI-Powered Phishing Outperforms Elite Cybercriminals in 2025 - Hoxhunt: Hoxhunt research proves AI agents can outperform elite red teams in phishing. Generative AI can be used in cybersecurity for good or evil. We can use AI spear phishing agents for defense.
Adaptive Security: Adaptive's next-generation security training and simulations protect businesses from deepfakes, generative AI phishing, SMS attacks, voice phishing, and more emerging threats.
no title found: no description found
Meta Llama | OpenRouter: Browse models from Meta Llama
General overview below, as the pages don't seem to be working well Llama 4 Model... | Hacker News: no description found
Llama 4 Reasoning: Coming soon
Reddit - The heart of the internet: no description found

Latent Space ▷ #ai-announcements (1 messages):

Claude Plays Pokemon Hackathon

Claude Plays Pokemon Hackathon: A user thanked another user for helping run the Claude Plays Pokemon hackathon on YouTube.
YouTube Stream of Hackathon: The Claude Plays Pokemon hackathon was recorded and streamed on YouTube.

Latent Space ▷ #ai-in-action-club (255 messages🔥🔥):

LLM Codegen Workflow, AI Code Editors, Cursor vs Windsurf, Context Management in AI Editors, Model Hot-Swapping

Harper’s LLM Codegen Workflow Exposed: Harper’s blog post (My LLM Codegen Workflow ATM) details a process of brainstorming a spec, planning, and executing with LLM codegen in discrete loops.
- The process involves tweaks based on conversations with friends like Nikete, Kanno, Obra, Kris, and Erik.
AI Code Editor Recommendations: For those new to AI code editors, Cursor is the most commonly recommended starting point, particularly for users coming from VSCode, with Windsurf and Cline also being good options.
- Experienced devs on nvim or emacs should stick with their current editor and AI plugins, while those wanting a new modal editor should try Zed.
Cursor and Windsurf Comparison: Members are bouncing between Cursor and Windsurf, noting strengths and weaknesses of each.
- Cursor is easy to start, has great tab-complete, whereas people are waiting for the new token counts and context window details feature in Cursor (tweet).
Context Management Concerns in Cursor: Members are reporting Cursor’s terrible context management issues, with a lack of visibility into what the editor is doing with the current context.
- It may come down to a skill issue and the users are not meeting the tool in the middle.
One-Shot Codegen or bust: Many in the channel expressed a desire for one-shot codegen where an entire program can be generated at once.
- Failing that, documenting better and taking another shot may be the next best option and, if that fails, training the user is necessary.

Links mentioned:

Tweet from Cristian Garcia (@cgarciae88): omg... I told gemini 2.5 pro it was wrong and instead panic agreeing with me and hallucinating, it explained why it was me who was wrong
Tweet from Ryo Lu (@ryolu_): This one is for the Pros:Working on an easier way to fill MAX context in @cursor_ai—and show you exactly how many tokens are usedFeedback and ideas welcome 🙏
Tweet from Ryo Lu (@ryolu_): This one is for the Pros:Working on an easier way to fill MAX context in @cursor_ai—and show you exactly how many tokens are usedFeedback and ideas welcome 🙏
Tweet from Cristian Garcia (@cgarciae88): omg... I told gemini 2.5 pro it was wrong and instead panic agreeing with me and hallucinating, it explained why it was me who was wrong
@johnlindquist/file-forge: File Forge is a powerful CLI tool for deep analysis of codebases, generating markdown reports to feed AI reasoning models.. Latest version: 2.13.6, last published: 9 hours ago. Start using @johnlindqu...
Yamadash - Overview: GitHub is where Yamadash builds software.
GitHub - bodo-run/yek: A fast Rust based tool to serialize text-based files in a repository or directory for LLM consumption: A fast Rust based tool to serialize text-based files in a repository or directory for LLM consumption - bodo-run/yek
GitHub - yamadashy/repomix: 📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) or other AI tools like Claude, ChatGPT, DeepSeek, Perplexity, Gemini, Gemma, Llama, Grok, and more.: 📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) o.....
My LLM codegen workflow atm: A detailed walkthrough of my current workflow for using LLms to build software, from brainstorming through planning and execution.
GitHub - formal-land/coq-of-rust: Formal verification tool for Rust: check 100% of execution cases of your programs 🦀 to make super safe applications! ✈️ 🚀 ⚕️ 🏦: Formal verification tool for Rust: check 100% of execution cases of your programs 🦀 to make super safe applications! ✈️ 🚀 ⚕️ 🏦 - formal-land/coq-of-rust

Nous Research AI ▷ #general (308 messages🔥🔥):

Open Source Cursor Alternatives, Prompt Injection / Jailbreaking Tactics, Llama 4 launch and performance, Neural Plasticity via Neural Graffiti

Cursor-like Apps Sought After: Members were looking for open source alternatives to the Cursor app, specifically interested in how the accept/discard suggestions of code blocks work.
- One member noted that Cursor uses a different model to ‘apply’ the code once you say accept.
Unleashing Prompt Injection attacks: A member inquired about bypassing prompt guards, detectors, and NeMo guard rails from a pentest perspective, linking to a prompt filter trainer (gandalf.lakera.ai/baseline).
- They also linked to a Broken LLM Integration App which uses UUID tags and strict boundaries.
Llama 4 debuts with multimodal muscles: Meta launched the Llama 4 family, featuring Llama 4 Scout (17B active params, 16 experts, 10M+ context) and Llama 4 Maverick (17B active params, 128 experts, 1M+ context), along with a preview of Llama 4 Behemoth, and a peak at the iRoPE architecture for infinite context (blog post).
- Some members expressed skepticism about the benchmarking methodology, the real-world coding ability and performance of Llama 4 Scout.
Neural Graffiti Gives LLMs live modulations: A member introduced “Neural Graffiti”, a technique to give pre-trained LLMs some neuroplasticity by splicing in a new neuron layer that recalls memory, reshaping token prediction at generation time, sharing code and demo on Github.
- The live modulation takes a fused memory vector (from prior prompts), evolves it through a recurrent layer (the Spray Layer), and injects it into the model’s output logic at generation time.

Links mentioned:

Tweet from Deedy (@deedydas): Llama 4 seems to actually be a poor model for coding.Scout (109B) and Maverick (402B) underperform 4o, Gemini Flash, Grok 3, DeepSeek V3 and Sonnet 3.5/7 on the Kscores benchmark which tests on coding...
Tweet from Fiction.live (@ficlive): Updated Long context benchmark with Llama 4
Tweet from Ahmad Al-Dahle (@Ahmad_Al_Dahle): Introducing our first set of Llama 4 models!We’ve been hard at work doing a complete re-design of the Llama series. I’m so excited to share it with the world today and mark another major milestone for...
Tweet from wh (@nrehiew_): In the local attention blocks instead of sliding window, Llama4 uses this Chunked Attention. This is pretty interesting/weird:- token idx 8191 and 8192 cannot interact in local attention- the only way...
Tweet from Jeff Dean (@JeffDean): @jeremyphoward Why can't you run them on consumer GPUs?
Tweet from Aston Zhang (@astonzhangAZ): Our Llama 4’s industry leading 10M+ multimodal context length (20+ hours of video) has been a wild ride. The iRoPE architecture I’d been working on helped a bit with the long-term infinite context goa...
Tweet from Rishabh Agarwal (@agarwl_): Joined the Llama team @AIatMeta today to work on RL and reasoningQuoting AI at Meta (@AIatMeta) Today is the start of a new era of natively multimodal AI innovation.Today, we’re introducing the first ...
Tweet from kalomaze (@kalomaze): turboderp is allergic to hyping up his work so let me do the honors.--THIS CHANGES EVERYTHING 🤯EXLLAMA DEVELOPER "turboderp" RELEASES EXLLAMA 3, with a NOVEL, STATE-OF-THE-ART local model qua...
NousResearch/Hermes-3-Llama-3.1-8B-GGUF · Hugging Face: no description found
Gandalf | Lakera – Test your prompting skills to make Gandalf reveal secret information.: Trick Gandalf into revealing information and experience the limitations of large language models firsthand.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust pe...
llmbenchmark/raytracer/Readme.md at master · cpldcpu/llmbenchmark: Various LLM Benchmarks. Contribute to cpldcpu/llmbenchmark development by creating an account on GitHub.
no title found: no description found
Llama 4 Reasoning: Coming soon
llama-models/models/llama4 at main · meta-llama/llama-models: Utilities intended for use with Llama models. Contribute to meta-llama/llama-models development by creating an account on GitHub.
Trae - Ship Faster with Trae: Trae is an adaptive AI IDE that transforms how you work, collaborating with you to run faster.
GitHub - 13o-bbr-bbq/Broken_LLM_Integration_App: This is the LLM integration app that contains the vulnerability; please use it to verify the vulnerability of the LLM integration app.: This is the LLM integration app that contains the vulnerability; please use it to verify the vulnerability of the LLM integration app. - 13o-bbr-bbq/Broken_LLM_Integration_App
llama : Support llama 4 text-only by ngxson · Pull Request #12791 · ggml-org/llama.cpp: Resolves #12774This PR targets Llama-4-Scout-17B-16E-Instruct. I don't (yet?) have a powerful enough system to work with bigger model.But Son, you are GPU-poor, how can you test a model that ....
meta-llama/Llama-4-Maverick-17B-128E-Instruct · Hugging Face: no description found
meta-llama/Llama-4-Scout-17B-16E-Instruct · Hugging Face: no description found
meta-llama/Llama-4-Scout-17B-16E · Hugging Face: no description found
meta-llama/Llama-4-Maverick-17B-128E · Hugging Face: no description found

Nous Research AI ▷ #ask-about-llms (27 messages🔥):

Claude Think Tool, Local LLM for 300 Pages of Text, Nous Capybara 34B Model, DeepHermes, BatchNorm and LayerNorm Implementations

Claude Think Tool: A Brainy Brainstormer: The Claude Think Tool is a setup to offload critical tasks to a larger model from a small local model.
- It helps create multiple threads of thought, each with attention directed toward a specific domain and problem with a well-defined scope, functioning as a multi-agent system from the perspective of the brain.
Pondering the Perfect Local LLM for 300-Page Text Ingestion: A member inquired about running a local LLM, around 40B or less, capable of understanding around 300 pages of pure text, given a 12GB GPU and 32GB of normal memory.
- Suggestions included DeepHermes, Cohere Command R 7B and Qwen 7B 1M, with warnings that CPU inference might not be viable for such large documents.
Nous Capybara 34B: A Contextual Colossus: The Nous-Capybara-34B is trained on the Yi-34B model with 200K context length for 3 epochs on the Capybara dataset.
- It leverages a novel data synthesis technique called Amplify-instruct, combining top-performing existing data synthesis techniques and distributions used for SOTA models like Airoboros, Evol-Instruct, Orca, Vicuna, and others.
BatchNorm Backpropagation: A Numerical Nirvana: A member shared a raw implementation of BatchNorm using NumPy, emphasizing the backward pass as the most intimidating part due to computing the gradient of pre-normalized input following the multivariate chain rule, illustrated here.
- They followed it up with implementing LayerNorm, highlighting the key difference being that statistics are computed per sample rather than per batch.

Link mentioned: NousResearch/Nous-Capybara-34B · Hugging Face: no description found

Nous Research AI ▷ #research-papers (2 messages):

Reinforcement Learning for LLMs, Reward Modeling Improvements, Self-Principled Critique Tuning

Deepseek releases Reinforcement Learning Paper: Deepseek released a new paper on Reinforcement Learning (RL) being widely adopted in post-training for large language models (LLMs) at scale; the paper can be found here.
- The paper investigates how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods.
Self-Principled Critique Tuning Proposed: Deepseek adopts pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling.
- The paper proposes Self-Principled Critique Tuning (SPCT) to foster scalability.

Link mentioned: Inference-Time Scaling for Generalist Reward Modeling: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $…

Nous Research AI ▷ #interesting-links (9 messages🔥):

Claude Squad, Heterogeneous Recursive Planning, Panthalia Decentralized Compute, TextPulse Library

Claude Squad Manages Multiple Agents: Claude Squad is a free and open-source manager for Claude Code & Aider tasks that supervises multiple agents in one place with isolated git workspaces.
- It enables users to run ten Claude Codes in parallel.
Heterogeneous Recursive Planning for Creative AI: A new method called heterogeneous recursive planning enables AI to write creative stories and insightful deep research reports like an expert (paper, demo).
- It leverages adaptive subgoals and dynamic execution, allowing agents to dynamically replan and weave retrieval, reasoning, and composition mid-flow, based on previous work.
Panthalia Verifies Low-Cost Distributed Compute: Panthalia is a platform to safely and easily train ML models on peer-to-peer compute using a decentralized compute primitive, using a waitlist.
- The platform uses a compression algorithm heavily inspired by the Nous DeMo paper and the related codebase.
TextPulse Library for Text Processing: A member shared their library TextPulse for text processing and is looking for feedback.
- Currently, they resell low-cost providers aiming for the same interruptible prices (~$0.60/hr for an H100, ~$0.13/hr for a 4090).

Links mentioned:

Tweet from mufeez (@moofeez): Why settle for one Claude Code when you can run ten in parallel?We built Claude Squad — a manager for Claude Code & Aider tasks:• Supervise multiple agents in one place• Isolated git workspacesFree + ...
Tweet from Jürgen Schmidhuber (@SchmidhuberAI): What if AI could write creative stories & insightful #DeepResearch reports like an expert? Our heterogeneous recursive planning [1] enables this via adaptive subgoals [2] & dynamic execution. Agents d...
Tweet from Panthalia (@panthaliaxyz): Panthalia: Decentralized compute primitive The platform to safely and easily train ML models on peer-to-peer computeWaitlist now available
Panthalia Gradient Compression Algorithm | Panthalia: This document provides a detailed description of the DCT-based gradient compression algorithm used in Panthalia. The algorithm is designed to efficiently compress both gradients sent from nodes to a c...
panthalia-worker/spl/util/demo.py at main · ritser-labs/panthalia-worker: Contribute to ritser-labs/panthalia-worker development by creating an account on GitHub.

Nous Research AI ▷ #research-papers (2 messages):

Deepseek, Reinforcement Learning, Large Language Models, Reward Modeling, Self-Principled Critique Tuning

Deepseek’s New Paper on RL for LLMs: Deepseek released a new paper, available on arXiv, about Reinforcement Learning (RL) adoption in post-training for Large Language Models (LLMs) at scale.
- The paper investigates improving reward modeling (RM) with more inference compute for general queries and the effectiveness of performance-compute scaling with proper learning methods, proposing Self-Principled Critique Tuning (SPCT).
SPCT Improves Reward Modeling: The paper introduces Self-Principled Critique Tuning (SPCT) as a method to enhance the effectiveness of performance-compute scaling in reward modeling for LLMs.
- This approach aims to foster scalability by improving reward model inference compute for general queries beyond verifiable questions or artificial rules.

Nous Research AI ▷ #reasoning-tasks (6 messages):

Reasoning Benchmarking, Open Reasoning Tasks

Researcher transitions to LLM World: A researcher on logic and reasoning is considering moving into the LLM world and wants to contribute to reasoning categorisation and benchmarking.
- A member suggested checking out the reasoning tasks repo.
Discussion about Open Reasoning Tasks: A member is exploring the list of reasoning tasks to benchmark one of the LLMs and asks about the taxonomy behind it, its background, and related literature.
- They specifically inquired about who is behind the taxonomy and its history.

Link mentioned: GitHub - NousResearch/Open-Reasoning-Tasks: A comprehensive repository of reasoning tasks for LLMs (and beyond): A comprehensive repository of reasoning tasks for LLMs (and beyond) - NousResearch/Open-Reasoning-Tasks

MCP (Glama) ▷ #general (293 messages🔥🔥):

MCP Governance SDK, MCP Protocol Revision 2025, MCP Desktop Workflow Integrations, Pinging MCP Servers Before Initialization, MCP Server for Microsoft Loop

Auth0 Token Validation with MCP Governance SDK: A guide focuses on server-side implementation using the governance SDK to validate tokens (e.g., from Auth0) and enforce user roles and permissions on MCP operations, deciding access to tools or resources.
- The guide picks up after the client sends a token, detailing how the server can validate the token and fetch user’s roles, using the SDK’s RBAC system to enforce permissions.
Streamable HTTP Transport for MCP: The Model Context Protocol (MCP) specification uses JSON-RPC to encode messages, mandating UTF-8 encoding and defining two transport mechanisms: stdio and Streamable HTTP.
- Clients should support stdio, but custom transports are also possible, as outlined in the specification, which includes requirements like newline delimiters for messages in stdio.
Llama 4 Released, Still Doesn’t Know MCP: Llama 4 has been released with 17B parameters and outperforms deepseekv3, but still does not know what MCP is, despite its impressive capabilities.
- It’s a 17B MoE, with 109B total parameters, according to an announcement.
MCP Tool Installs Should be Standardized: Members discussed the need for more standardization around MCP server installation, similar to scoop or VS Code extensions, to improve accessibility for non-technical users.
- The discussion highlighted the friction in the current process, involving command-line arguments, environment variables, and varying install methods (Python, Node.js, Docker) with a suggestion to make it as easy as python-mcp install web-search.
A Holy War? OAuth-Backed APIs MCPs are the key: Members debated over the security of MCPs with some feeling that they need an app store with oversight to check for hacked servers and OAuth-backed APIs, while others claim that can already be done.
- One proposal is for providers like PayPal to host their own OAuth-backed APIs that don’t require external server install.

Links mentioned:

Unlock Effortless AI Automation: Best Way To Self-Host n8n is Railway & Integrate Firecrawl MCP…: Unlock the power of AI-driven Web automation without touching a Line of Code!
Tweet from undefined: no description found
GitMCP: Instantly create an MCP server for any GitHub project
Build a Remote MCP server · Cloudflare Agents docs: This guide will walk you through how to deploy an example MCP server to your Cloudflare account. You will then customize this example to suit your needs.
Transports: ℹ️ Protocol Revision: 2025-03-26 MCP uses JSON-RPC to encode messages. JSON-RPC messages MUST be UTF-8 encoded.The protocol currently defines two standard transport mec...
MCP Calculate Server: A mathematical computation service that enables users to perform symbolic calculations including basic arithmetic, algebra, calculus, equation solving, and matrix operations through the MCP protocol.
Extend your agent with Model Context Protocol (preview) - Microsoft Copilot Studio: Extend the capabilities of your agent by connecting to actions from a Model Context Protocol (MCP) server.
Reddit - The heart of the internet: no description found
GitHub - jaw9c/awesome-remote-mcp-servers: A curated, opinionated list of high-quality remote Model Context Protocol (MCP) servers.: A curated, opinionated list of high-quality remote Model Context Protocol (MCP) servers. - GitHub - jaw9c/awesome-remote-mcp-servers: A curated, opinionated list of high-quality remote Model Conte...
Enact Protocol: Enact Protocol has 3 repositories available. Follow their code on GitHub.
specification/docs/specification/2025-03-26/basic/lifecycle.md at main · modelcontextprotocol/specification: The specification of the Model Context Protocol. Contribute to modelcontextprotocol/specification development by creating an account on GitHub.
Reddit - The heart of the internet: no description found
no title found: no description found
enact-mcp/src/index.ts at 0e155b5d52c340b14de0a3f7804aec0c2456ff36 · EnactProtocol/enact-mcp: MCP Server for enact protocol. Contribute to EnactProtocol/enact-mcp development by creating an account on GitHub.
GitHub - semgrep/mcp: A MCP server for using Semgrep to scan code for security vulnerabilities.: A MCP server for using Semgrep to scan code for security vulnerabilities. - semgrep/mcp
MCP API Reference: API Reference for the Glama Gateway
REST API Docs | PulseMCP: Programmatic access to daily-updated JSON of all MCP server metadata, scraped to be comprehensive and filtered to be useful.
Reddit - The heart of the internet: no description found

MCP (Glama) ▷ #showcase (23 messages🔥):

MCP-k8s Docker Images, chat.md with MCP support, Cloudflare for Remote MCP Servers, WhatsMCP Oauth Support, Semgrep MCP Rewrite

MCP-k8s Docker Images Published: First working docker images published for mcp-k8s server are now available, and the release pipeline is completely running on CI.
- These images are multiarch, so they can run on Macs with ARM without Rosetta and also on Raspberry Pi.
Chat.md: fully editable chat interface with MCP support: A fully editable chat interface with MCP support on any LLM has been released, open-sourced under the MIT license and turning markdown files into editable AI conversations with its VS Code extension (chat.md).
- Notable features include editing past messages, LLM agnostic MCP support, streaming responses with shift+enter, and tool call detection.
Cloudflare Enables Remote MCP Servers: It is now possible to build and deploy remote MCP servers to Cloudflare, with added support for OAuth through workers-oauth-provider and a built-in McpAgent class.
- This simplifies the process of building remote MCP servers by handling authorization and other complex aspects.
WhatsApp MCP client is here: A user built WhatsApp MCP and asked Claude to handle all the WhatsApp messages, answering 8 people in approx. 50 seconds.
- The bot instantly detected the right language (English / Hungarian), used full convo context, and sent appropriate messages including ❤️ to my wife, formal tone to the consul.
Semgrep MCP Server Rewritten: The Semgrep MCP server, an open-source tool for scanning code for security vulnerabilities, has been completely rewritten, with demo videos showcasing its use in Cursor and Claude.
- It uses SSE (Server-Sent Events) for communication, though the Python SDK might not fully support it yet.

Links mentioned:

Tweet from Alex Varga (@vargastartup): I asked Claude to handle all my WhatsApp messages. One prompt. That’s it.1. It answered 8 people in approx. 50 seconds2. Instantly detected the right language (English 🇺🇸 / Hungarian 🇭🇺)3. Used fu...
WhatsApp MCP Client | Connect Your AI Stack: Connect your MCP server to power your AI stack through WhatsApp. Secure, private, and easy to use.
Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare: You can now build and deploy remote MCP servers to Cloudflare, and we handle the hard parts of building remote MCP servers for you. Unlike local MCP servers you may have previously used, remote MCP se...
GitHub - semgrep/mcp: A MCP server for using Semgrep to scan code for security vulnerabilities.: A MCP server for using Semgrep to scan code for security vulnerabilities. - semgrep/mcp
Semgrep MCP Demo: Use Loom to record quick videos of your screen and cam. Explain anything clearly and easily – and skip the meeting. An essential tool for hybrid workplaces.
Claude Desktop Using Semgrep MCP Resources: Use Loom to record quick videos of your screen and cam. Explain anything clearly and easily – and skip the meeting. An essential tool for hybrid workplaces.
GitHub - SDCalvo/MCP-to-Langchain-addapter: Addapter that turns MCP server tools into langchain usable tools: Addapter that turns MCP server tools into langchain usable tools - SDCalvo/MCP-to-Langchain-addapter
Streamable HTTP client transport by josh-newman · Pull Request #416 · modelcontextprotocol/python-sdk: Client implementation of spec version 2025-03-26's new transport.Motivation and ContextThe 2025-03-26 spec introduces a new HTTP transport mechanism (with fallback to the previous one). I did....
GitHub - rusiaaman/chat.md: Contribute to rusiaaman/chat.md development by creating an account on GitHub.

Eleuther ▷ #general (39 messages🔥):

RAG evaluation with lm-evaluation-harness, RoR-Bench paper by the_alt_man, Llama 4 release, Aligning AGI using Bayesian Updating

RAG Evaluation using LLM Harness?: A member suggested wrapping RAG outputs as completion tasks and using llm-harness locally with custom prompt + response files for evaluation.
- Another member admitted to having no idea what those are.
LLMs Exhibiting Recitation Behavior?: A member shared a link to the RoR-Bench paper which proposes a novel, multi-modal benchmark for detecting LLM’s recitation behavior, finding that top models can suffer a 60% performance loss by changing one phrase in the condition.
- The member expressed suspicion of these papers because they found that models that were evaluated at 0% on certain reasoning tasks could actually one-shot it.
Llama 4 Unleashed: A link to the Llama 4 release was shared (https://www.llama.com/llama4/), showcasing the most intelligent multimodal OSS model in its class, with Llama4 Maverick > Gemma3 and Llama4 Maverick > DeepSeek V3.
- Another member noted the training process, architecture, and inference time temperature scaling.
Aligning AGI with Moral Weights: A member shared a Google Doc about aligning AGI using Bayesian Updating of its Moral Weights and Modelling Consciousness.
- Another member shared a link to Arweave that discusses AI’s role in preserving human consciousness. (https://arweave.net/q6CszfPrxFZfm-BiVsvtiOXWuDkcYo8Pf9viDqv-Nhg)

Links mentioned:

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?: The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surp...
Llama 4 is Here | Meta: no description found
The Impact of Positional Encoding on Length Generalization in Transformers: Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding ...
ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness: (feel free to leave a comment here to say “I was here!”) ALBUM-WMC: Aligning lAGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness This document outlines a set of related id...

Eleuther ▷ #research (204 messages🔥🔥):

Mixture of Experts, Large Language Models, Gradient-Free Learning Methods, Hyper-connections as alternative to residual connections, Attention Sinks in LLMs

MoE++ Framework Achieves Expert Throughput: A new MoE++ framework integrates Feed-Forward Network (FFN) and zero-computation experts (zero expert, copy expert, and constant expert) for enhanced effectiveness and efficiency, achieving 1.1$\sim$2.1$\times$ expert forward throughput compared to vanilla MoE models, according to this research paper.
- The design of MoE++ offers advantages such as Low Computing Overhead by enabling dynamic token engagement, unlike uniform mixing in vanilla MoE.
NoProp Offers Gradient-Free Learning: A new learning method named NoProp, which does not rely on either forward or backwards propagation and takes inspiration from diffusion and flow matching methods, learns to denoise a noisy target at each layer independently, described in this paper.
- There’s a GitHub implementation by lucidrains and also a discussion that the pseudocode at the end of the paper says they’re effecting the actual updates using gradient based methods.
Meta releases Llama 4: Meta announced the Llama 4 family of models, including Llama 4 Scout, a 17 billion parameter model with 16 experts and a 10M token context window, outperforming Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 in its class, as noted in this blog post.
- Llama 4 Scout’s 10M context is trained on a mix of publicly available, licensed data, and information from Meta’s products and services including posts from Instagram and Facebook and people’s interactions with Meta AI.
Hyper-Connections Offer Alternative to Residual Connections: Hyper-connections, serve as an alternative to residual connections, addressing the seesaw effect between gradient vanishing and representation collapse, as outlined in this paper.
- The architecture is simple like an unrolled diffusion model and the magic here is more about the independence of each layer wrt each other.
Attention Sinks in LLMs Prevent Over-Mixing: A recent paper argues that attention sinks, where LLMs attend heavily to the first token in the sequence, is a mechanism that enables LLMs to avoid over-mixing, detailed in this paper.
- An earlier paper (https://arxiv.org/abs/2502.00919) showed that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved.

Links mentioned:

Hyper-Connections: We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connect...
NoProp: Training Neural Networks without Back-propagation or Forward-propagation: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stac...
nGPT: Normalized Transformer with Representation Learning on the Hypersphere: We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices ...
MoE++: Accelerating Mixture-of-Experts Methods with...: In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework...
Why do LLMs attend to the first token?: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways...
Soft Policy Optimization: Online Off-Policy RL for Sequence Models: RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training...
Llama 4 is Here | Meta: no description found
Tweet from BlinkDL (@BlinkDL_AI): https://arxiv.org/abs/2503.24322 I think the NoProp method might be applicable to LLM training too, as each LLM block is denoising the next token distribution. So we can try training all blocks in par...
Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings: Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. Despite often having...
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust pe...
no title found: no description found
GitHub - lucidrains/hyper-connections: Attempt to make multiple residual streams from Bytedance's Hyper-Connections paper accessible to the public: Attempt to make multiple residual streams from Bytedance's Hyper-Connections paper accessible to the public - lucidrains/hyper-connections

Eleuther ▷ #interpretability-general (17 messages🔥):

Polytope lens for NNs, ReLU networks geometry, Machine Unlearning Workshop, Origami view of NNs, Expressivity of Deep Networks

Polytope Perspective Powers Neural Net Pondering: A member shared a blog post discussing a geometrical approach to neural networks, advocating for the polytope lens as the right perspective, linking to a previous post on the “origami view of NNs”.
ReLU Network Regions Reveal Reason: A member shared Boris Hanin’s paper demonstrating mathematical properties of ReLU networks, specifically studying the geometry of their constant regions.
- They highlighted a figure from the paper as their “main reason for loving the paper,” referencing the expressivity of deep networks and the number of activation patterns.
Hyperplane Harmony: Neural Nets’ Natural Nuance: A member posited that neural nets, especially ReLUs, have an implicit bias against overfitting due to carving the input space along hyperplanes, which becomes more effective in higher dimensions.
- They argued that simpler configurations using hyperplanes efficiently are preferred by the optimizer, contrasting with learning schemes like spline bases that suffer from the curse of dimensionality.
Unlearning Urgency: Machine Mind Management: A member linked to the ICML Machine Unlearning Workshop which focuses on the challenges of removing sensitive data from Generative AI models trained on internet-scale datasets.
- The workshop aims to advance robust, verifiable unlearning methods to address privacy, security, and legal concerns like the EU’s GDPR.

Links mentioned:

Deep ReLU Networks Have Surprisingly Few Activation Patterns: The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the nu...
MUGen @ ICML 2025 - Workshop on Machine Unlearning for Generative AI: no description found
ADD / XOR / ROL: Some experiments to help me understand Neural Nets better, post 2 of N: no description found

Eleuther ▷ #lm-thunderdome (19 messages🔥):

lm-eval-harness EOS token, Llama 2 vs Llama 3 IFEval Score, Huggingface tokenization

EOS token Accuracy Anomaly Appears: A member tried adding an EOS token to data instances in lm-eval-harness for the social_iqa task, and the eval accuracy dropped by 18 points.
- It was suggested to add self.eot_token_id to the continuation_enc here only for the continuations and not context.
IFEval Score: Llama 2’s Odd Dominance: A member compared Llama 2 v/s Llama 3.1 and 3.2 models and noticed Llama 2 has a much higher IFEval Score, which seemed weird for a base model looking at the HF leaderboard.
- It turns out it just seems to be unsuitable for base models because models simply continue with the question and somehow it’s considered correct.
Huggingface Tokenization Troubleshoot: Members discussed Huggingface tokenization, and how it happens in HFLM.tok_encode.
- One noted that for BOS you can pass add_bos_token to the model args.

Links mentioned:

Open LLM Leaderboard - a Hugging Face Space by open-llm-leaderboard: no description found
lm-evaluation-harness/lm_eval/api/model.py at 11ac352d5f670fa14bbce00e423cff6ff63ff048 · EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. - EleutherAI/lm-evaluation-harness

HuggingFace ▷ #announcements (1 messages):

huggingface_hub v0.30.0, monoELECTRA reranker models, YourBench Custom Evals, Jetson Robot, Accelerate v1.6.0

Huggingface Hub gets Biggest Update Ever!: The huggingface_hub v0.30.0 release introduces a next-gen Git LFS alternative and new inference providers.
- This release is the biggest update in two years!
MonoELECTRA Rerankers Ported to Sentence Transformers: monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group are now available in Sentence Transformers.
- These models were distilled from LLMs like RankZephyr and RankGPT4, as described in the Rank-DistiLLM paper.
YourBench builds Custom Evals Instantly: YourBench allows users to build custom evals using their private docs to assess fine-tuned models on unique tasks (announcement).
- The tool is game-changing for LLM evaluation.
Gradio Surpasses 1 Million Developers!: Gradio, a Python library for building AI web apps, is now used by over 1 million developers each month (announcement).
- The library has been adopted by popular open-source projects like Automatic1111, Fooocus, and LLaMA-Factory.

Links mentioned:

Release Xet is here! (+ many cool Inference-related things!) · huggingface/huggingface_hub: 🚀 Ready. Xet. Go!This might just be our biggest update in the past two years! Xet is a groundbreaking new protocol for storing large objects in Git repositories, designed to replace Git LFS. Unlik...
Tweet from tomaarsen (@tomaarsen): I've just ported the excellent monoELECTRA-{base, large} reranker models from @fschlatt1 & the research network Webis Group to Sentence Transformers!These models were introduced in the Rank-DistiL...
Tweet from Nathan (@nathanhabib1011): 🚀 Introducing ✨ YourBench ✨ ! Build custom evals instantly using your private docs & see how your custom fine-tuned models perform on your unique tasks.Congrats to @sumukx @clefourrier and @ailozovsk...
Tweet from Remi Cadene (@RemiCadene): Jetson @nvidia's version of our robot is available!Compute is now on-board like a @Tesla car with FSD 🚗Importantly, we rethink the control interface, so that you can view the video stream with th...
Tweet from Marc Sun (@_marcsun): accelerate v1.6.0 is out with lots of nice features ! - FSDPv2 support by @m_sirovatka, our incredible intern ! - DeepSpeed + tensor parallel support by the DeepSpeed team- XCCL distributed backend fo...
Tweet from Harry Mellor (@hmellor_): The @vllm_project now has a user forum which you can find at https://discuss.vllm.ai/Its fledgling community is still growing but I encourage all users to go there for their usage focused Q&A!
Reddit - The heart of the internet: no description found
Tweet from Orr Zohar (@orr_zohar): Excited to see SmolVLM powering BMC-SmolVLM in the latest BIOMEDICA update! At just 2.2B params, it matches 7-13B biomedical VLMs. Check out the full release: @huggingface #smolvlmQuoting Alejandro Lo...
Tweet from Unsloth AI (@UnslothAI): We partnered with @HuggingFace to teach you how to fine-tune LLMs with GRPO!Learn about:• Reward functions + creating them• GRPO Math + Free Reasoning training in Colab• Applying RL to real-world use ...
Tweet from AK (@_akhaliq): vibe coding AI apps for free has never been easier100% open source app, DeepSite on Hugging Face
Tweet from Ben Burtenshaw (@ben_burtenshaw): Welcome to the LLM Course!Education has always been at the heart of Hugging Face’s mission to democratize AI and we’re doubling down on that by giving http://hf.co/learn a big upgrade!
Tweet from Sergio Paniego (@SergioPaniego): 🆕New Unit in the Agents Course @huggingface. We just released the first Use Case on Agentic RAG—where we compare three frameworks side by side:🤏 smolagents🦙 @llama_index🦜 LangGraph (@LangChainAI)⬇...
Tweet from Abubakar Abid (@abidlabs): JOURNEY TO 1 MILLION DEVELOPERS5 years ago, we launched @Gradio as a simple Python library to let researchers at Stanford easily demo computer vision models with a web interface. Today, Gradio is used...

HuggingFace ▷ #general (169 messages🔥🔥):

Llama-4-Scout vs Mistral Small 3.1, AI Engineer Interview, Deepmind created AGI Internally?, Fine Tuning Quantized Models, Huggingchat 500 error

Llama-4-Scout or Mistral Small 3.1 is better?: Mistral Small 3.1 adds vision understanding and enhances context up to 128k tokens.
- A member suggested Llama-4-Scout is better, but it needs 16*17B VRAM.
AI Engineer Interview Code Section: A community member asks about what the code portion of an AI engineer interview looks like.
- Another member pointed to the scikit-learn library.
Rumors of Deepmind created AGI Internally: A member in another discord said Google will release yet another powerful model next week and it will be even better than gemini 2.5 pro exp.
- They also claimed that Deepmind created an AGI internally; however, this member later stated he doesn’t trust this person anymore.
Is Fine Tuning Quantized Models challenging?: A member asked about fine tuning quantized models, and the community gave varied advice, with some pointing to QLoRA, Unsloth, bitsandbytes as potential solutions. Check out Unsloth fine-tuning guide.
- While another stated that you can only do so using LoRA. GGUF is an inference-optimized format, not designed for training workflows.
Huggingchat experiencing 500 Error: Users reported that Huggingchat is experiencing a 500 error.
- A member stated that an issue was raised and pointed to workarounds being discussed on discord.

Links mentioned:

VIDraft/Gemma-3-R1984-12B · Hugging Face: no description found
Llama 4 Acceptable Use Policy: Llama 4 Acceptable Use Policy
@Reality123b on Hugging Face: "ok, there must be a problem. HF charged me 0.12$ for 3 inference requests to…": no description found
OpenAI compatibility · Ollama Blog: Ollama now has initial compatibility with the OpenAI Chat Completions API, making it possible to use existing tooling built for OpenAI with local models via Ollama.
Customer_Support_Agent - a Hugging Face Space by Remiscus: no description found
mindspore-ai/LeNet · Hugging Face: no description found
Consuming Text Generation Inference: no description found
Llamacpp Backend: no description found
ERROR: "docker buildx build" requires exactly 1 argument. with VS code: Hello, After reading a lot of documentation and searching on the internet, I can’t find where come from this issue. When i want to buid a image with a right clic and “Build image” this error showing...
bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF · Hugging Face: no description found
mistralai/Mistral-Small-3.1-24B-Instruct-2503 · Hugging Face: no description found
Models - Hugging Face: no description found
clipboard: no description found
Models: no description found
audioop-lts: LTS Port of Python audioop
Mistral3: no description found
Reddit - The heart of the internet: no description found
Unsloth Documentation: no description found
NEW by DeepSeek: SPCT w/ DeepSeek-GRM-27B: DeepSeek published a NEW learning method and a NEW model for the next generation of Reasoning models, called DeepSeek-GRM-27B. In this video I explain the ne...
README.md · Remiscus/Customer_Support_Agent at main: no description found
Spaces Configuration Reference: no description found
Enable tracing for Moshi by lddabhi-semron · Pull Request #36894 · huggingface/transformers: What does this PR do?Enabled tracing of MoshiForConditionalGenerationReplaced kwargs with args which was used inside forwardParsing forward signature to create kwargs for audio encoder, decoder...
How to train a new language model from scratch using Transformers and Tokenizers: no description found
The Large Language Model Course: no description found
Create a Large Language Model from Scratch with Python – Tutorial: Learn how to build your own large language model, from scratch. This course goes into the data handling, math, and transformers behind large language models....
GitHub - aashishjhaa/eq-for-youtube: Manipulate the audio of YouTube Video Realtime with 6 Frequency Band: Manipulate the audio of YouTube Video Realtime with 6 Frequency Band - aashishjhaa/eq-for-youtube
EQ for YouTube: no description found
make install-server does not have Apple MacOS Metal Framework · Issue #2890 · huggingface/text-generation-inference: System Info make install-server does not have Apple MacOS Metal Framework Please either remove from the readme info about brew/macOS altogether to not confuse users. OR add support for Apple MPS fr...
torch circular import AttributeError: I am trying to use a script that uses torch but I keep getting this Attribute Error: AttributeError: partially initialized module 'torch' has no attribute 'Tensor' (most likely...
[BUG] Circular import error with PyTorch nightly · Issue #6005 · deepspeedai/DeepSpeed: Describe the bug Circular import error with PyTorch nightly. If I uninstall deepspeed it works fine. Traceback (most recent call last): File "/test/oss.py", line 322, in <module> mp.sp...
Models - Hugging Face: no description found
meta-llama/Llama-4-Scout-17B-16E-Instruct · Hugging Face: no description found
GitHub - ggml-org/llama.cpp: LLM inference in C/C++: LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference: Large Language Model Text Generation Inference. Contribute to huggingface/text-generation-inference development by creating an account on GitHub.
Supervised Fine-tuning Trainer: no description found
When to use SFTTrainer vs Trainer? · Issue #388 · huggingface/trl: In the recent QLoRA blog post , the Colab notebooks used the standard Trainer class, however SFTTrainer was mentioned briefly at the end of the post. Why wasn't it used in the Colab notebooks asso...
LMM Fine Tuning - Supervised Fine Tuning Trainer (SFTTrainer) vs transformers Trainer: When should one opt for the Supervised Fine Tuning Trainer (SFTTrainer) instead of the regular Transformers Trainer when it comes to instruction fine-tuning for Language Models (LLMs)? From what I ...
Meta’s LLaMa license is not Open Source: Meta is lowering barriers for access to powerful AI systems, but unfortunately, Meta has created the misunderstanding that LLaMa 2 is “open source” - it is not.
Gemma License (danger) is not Free Software and is not Open Source: The **Gemma Terms of Use** and **Prohibited Use Policy** govern the use, modification, and distribution of Google's Gemma machine learning model and its derivatives. While Gemma is available for p...
The Open Source Definition: Introduction Open source doesn’t just mean access to the source code. The distribution terms of open source software must comply with the following criteria: 1. Free Redistribution The license s...
What is Free Software? - GNU Project - Free Software Foundation: no description found
How Hugging Face Enhances AI Agents with n8n Workflows: Discover how Hugging Face’s NLP models integrate with n8n to build smarter AI agents. Learn practical use cases like chatbots and data querying tools powered by open-source language models.
Add Mistral3 (#36790) · huggingface/transformers@e959530: * initial start* style and dummies* Create convert_mistral3_weights_to_hf.py* update* typo* typo* Update convert_mistral3_weights_to_hf.py* Update convert_mistral3_weights_to_hf.py*...

HuggingFace ▷ #today-im-learning (16 messages🔥):

LLM Development, Sebastian Raschka Book, Andrej Karpathy Video, NLP course chapter 3

Community Member Seeks LLM Dev Guidance: A community member asked where to start developing a 100M parameter LLM, given a background in Data Science and ML.
- Suggestions included starting with NLP or DL, or finding a specific course to follow.
Sebastian Raschka’s Book Recommended for LLM Building: The book Build a Large Language Model (From Scratch) by Sebastian Raschka was recommended for learning to build LLMs from scratch.
- One member shared that their workplace started a book club around it, and another mentioned having ordered the same book.
Andrej Karpathy’s GPT Reproduction Video Sparks Discussion: A video by Andrej Karpathy, Let’s reproduce GPT-2 (124M), was suggested as a good resource.
- However, the original poster stated that he started copy-pasting code and didn’t explain much, so they stopped watching it.
Assisted Pre-training and Shared Embeddings: One member suggests initializing weights and using the same tokenizer from another model, kinda like an ‘assisted’ pre-training.
- They also proposed sharing embeddings and maybe the linear layer to potentially expedite the LLM development process.

Link mentioned: Let’s reproduce GPT-2 (124M): We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really…

HuggingFace ▷ #cool-finds (2 messages):

Windows CLI, Virtual Environment Reset, LocalAI, Dify

CLI Fu for Virtual Env Reset: A quick Windows CLI command to reset your virtual environment is pip freeze | Select-String -Pattern "^(?!pip)" | ForEach-Object { pip uninstall -y $_.ToString().Trim() }.
- This snippet helps clean up the environment by uninstalling packages, excluding pip itself, streamlining the process for a fresh start, according to a blog post.
[Placeholder]: [Placeholder]
- [Placeholder]

Link mentioned: The Complete Roadmap to Mastering Agentic AI in 2025 | Girish Kotte: Discover a comprehensive 12-step roadmap to mastering agentic AI in 2025. Learn everything from basic concepts to advanced deployment techniques with resource links for each stage. Perfect for develop…

HuggingFace ▷ #i-made-this (8 messages🔥):

MCP Server and RAG Application, Osyllabi AI Curriculum, DocQuery AI Documentation Search, Municipal Law Dataset, LlamaResearcher with Llama-4

MCP Server + RAG App Debut: A member built a MCP server and client, connected via ngrok, along with a simple RAG application for QA with markdown documentation from a GitHub repository, showcased on LinkedIn.
- The RAG application, named DocQuery, is available for feedback at docquery-ten.vercel.app.
Osyllabi: AI Curriculum Crafter Hits GitHub: A member shared Osyllabi, a Python app for AI-driven personalized curriculums using web crawling and data integration, powered by Ollama, HuggingFace, Langchain, and Llama-Index, available on GitHub.
- It features AI-driven curriculum generation, advanced web crawling, seamless integration with educational platforms, customizable learning paths, and flexible export options.
DocQuery Transforms Documentation to Knowledgebase: A member shared DocQuery, which turns documentation markdown into a knowledgebase, is available on GitHub.
- DocQuery offers improved searchability, a smart Q&A system, and streamlined knowledge management for development teams.
Municipal Law Dataset Surfaces: A member shared the American Municipal Law dataset on Hugging Face Datasets, comprising municipal and county laws from across the United States in parquet format, organized by location’s GNIS id.
- Access requires agreeing to share contact information.
LlamaResearcher: Llama-4 Powers Deep Research: A member introduced LlamaResearcher (llamaresearcher.com), a deep-research AI companion powered by Llama 4 and Groq, which expands queries into sub-queries, searches the web, and produces essays with source citations.
- The project is open-source and Docker-ready, available on GitHub, and utilizes LlamaIndex, Groq, Linkup, FastAPI, Redis, and Gradio.

Links mentioned:

GitHub - md-abid-hussain/docquery: DocQuery: Turn your documentation markdown to knowledgebase: DocQuery: Turn your documentation markdown to knowledgebase - md-abid-hussain/docquery
GitHub - Ollama-Agent-Roll-Cage/oarc-osyllabi: Osyllabi: A streamlined Python app for designing personalized curriculums using AI, web crawling, and data integration.: Osyllabi: A streamlined Python app for designing personalized curriculums using AI, web crawling, and data integration. - Ollama-Agent-Roll-Cage/oarc-osyllabi
GitHub - the-ride-never-ends/municipal_law_search: Contribute to the-ride-never-ends/municipal_law_search development by creating an account on GitHub.
the-ride-never-ends/american_municipal_law · Datasets at Hugging Face: no description found
no title found: no description found
LlamaResearcher - Topic to Essay in Seconds!: AI-powered researcher companion that deep searches the web, validates information, and produces essays about any topic in seconds.
GitHub - AstraBert/llama-4-researcher: Turn topics into essays in seconds!: Turn topics into essays in seconds! Contribute to AstraBert/llama-4-researcher development by creating an account on GitHub.
DocQuery: no description found

HuggingFace ▷ #computer-vision (5 messages):

Data Annotation for OCR, VLM Fine-Tuning for Handwritten Text, Combining OCR Techniques with VLMs, Roboflow for managing images and labels, MS-Swift and PEFT/Unsloth Approaches

VLM Models Aid Handwritten Text OCR: A member is seeking methods for data annotation to fine-tune VLM models on handwritten text images, opting to move away from traditional OCR models and needing true text labels for training.
- They are considering tools and methods to generate or correct text labels from images for fine-tuning purposes.
Classic OCR and Open VLMs Combine for annotation: A member combined classic OCR techniques with open VLMs like InternVL2_5 and Qwen2.5 to generate initial annotations for extracting structured data from Brazilian documents.
- Manual review was performed to correct errors after using OCR/VLM, and closed-source models like Gemini were noted to potentially provide higher-quality pre-annotations.
Roboflow Manages Images and Labels Effectively: A member managed and stored raw images and corrected labels using Roboflow, annotating 510 images which were augmented to 1218 examples.
- Despite finding its interaction not ideal, they used Roboflow for managing the dataset.
MS-Swift and PEFT/Unsloth Enhance Fine-Tuning: A member fine-tuned several models using MS-Swift and experimented with PEFT and Unsloth approaches, achieving superior performance compared to Gemini and OCR methods with models adjusted from 1B to 7B.
- The member successfully fine-tuned models, highlighting the effectiveness of these frameworks.
Tesseract OCR and Label Studio Join Forces: One member is considering using Tesseract OCR followed by Label Studio for refining annotations.
- They also checked Gemma 3 and found it effective, implying a combination of automated and manual approaches for data annotation.

HuggingFace ▷ #NLP (5 messages):

Text Extraction from PDFs, Docling, SmolDocling, RolmOCR, Sci-BERT

PDF Text Extraction Advice Sought: A member is seeking advice on improving text extraction from PDFs, specifically research papers, as their current results are unsatisfactory.
- They have been using regex for section outline extraction but are facing challenges with fonts, headers, and footers, impacting the usability of the extracted content for Sci-BERT embeddings due to token limits.
Docling and SmolDocling recommended for Text Extraction: A member recommends Docling (GitHub) and SmolDocling (HuggingFace) for improved text extraction from PDFs.
- They note that while these tools still make errors, especially with images, they have yielded good results, with SmolDocling being an ultra-compact vision-language model for end-to-end multi-modal document conversion, as highlighted in their paper.
RolmOCR Model Based on Qwen 2.5 VL Released: A member mentions the release of RolmOCR (HuggingFace), a new model based on Qwen 2.5 VL, for OCR tasks.
- However, they haven’t personally tested it yet, but suggest it as a potential tool for text extraction.

Links mentioned:

GitHub - docling-project/docling: Get your documents ready for gen AI: Get your documents ready for gen AI. Contribute to docling-project/docling development by creating an account on GitHub.
ds4sd/SmolDocling-256M-preview · Hugging Face: no description found

HuggingFace ▷ #smol-course (24 messages🔥):

OpenWeatherMap API, ISO 3166-1 alpha-2 code, Qwen/Qwen2.5-Coder-32B-Instruct Alternatives, Hugging Face Token for Agent Creation, llm-course Channel

Geolocation API vs Static Country Code Dictionary: A member is building a tool to fetch weather conditions using the OpenWeatherMap API and is debating whether to use the GeoCoding API and another API for ISO 3166-1 alpha-2 codes, or to use a static dictionary.
Free alternative to Qwen/Qwen2.5-Coder-32B-Instruct?: A member asked for a free alternative to Qwen/Qwen2.5-Coder-32B-Instruct.
- Another member pointed out that the model itself is free under the Apache 2.0 license (Hugging Face Link) but suggested Together AI or Groq for free API access, noting potential rate limits of around 60 RPM.
Guidance on Hugging Face Token for Agent Creation: A member requested guidance on obtaining a Hugging Face token for agent creation in Unit 1 of a course.
llm-course Channel Request: A member inquired about the possibility of opening a dedicated channel for an LLM course.
Help needed with AI agents course setup: A member requested assistance with a code issue encountered in Unit 1 of an AI agents course, specifically related to HF token settings in Colab.

Links mentioned:

Qwen/Qwen2.5-Coder-32B-Instruct · Hugging Face: no description found
clipboard: no description found

HuggingFace ▷ #agents-course (36 messages🔥):

MCP in Agent Course, Inference Usage Costs, Gemini Models, Course Feedback, Hallucination in Agents

MCP barely mentioned in Agent Course: A user inquired about learning MCP in the agent course, but was informed that there’s no dedicated section, although MCP servers are briefly mentioned in unit 2.1 (smolagents) and unit 2.2 (llamaindex).
Inference costs incurred!: A user accidentally maxed out their Inference Usage Due Balance and inquired about payment.
- The suggestion was made to check the questions channel for a FAQ, or to use a local or cheaper hosted alternative.
Gemini Models may be your savior: A user facing issues with Code_agents notebook in Chapter 2 due to payment requirements was advised to try using Gemini models.
- It was noted that Gemini models can be used for free in many countries, with a link to course notes providing instructions.
Course Experience: Good but Buggy: A user summarized the course as full of good material but noted that many notebooks and code snippets don’t work, including a now infamous coding test in Unit 2, with no instructor presence.
- The suggestion was made to approach the course sceptically, focus on understanding the coding parts, and acquire the necessary accounts and API tokens.
Explain the halluuuucinations!: Users sought clarification on an example of hallucination in an agent.
- The explanation provided was that the agent, lacking access to weather data, fabricated the answer, and the solution involves equipping the agent with a tool to retrieve weather information.

Links mentioned:

Notes from a front-end dev on the Hugging Face "Agents Course": Notes from a front-end dev on the Hugging Face "Agents Course" - 01_context.md
Tools - Hugging Face Agents Course: no description found
Using Tools in LlamaIndex - Hugging Face Agents Course: no description found

Yannick Kilcher ▷ #general (177 messages🔥🔥):

Grok 3, Turing Machines, Raw Binary AI training, LLama 4, Quantization Techniques

Grok 3 Manifold Analogy Appears: A member shared an analogy describing various approaches to NLP, contrasting 0D Manifolds (tokens), 1D Manifolds (embeddings), and a dynamic signal approach where language is seen as a rushing and swirling river with no rigid bounds.
Raw Binary AI Training Discussed: Members discuss training AI on raw binary data to directly output file formats like mp3 or wav, with one member noting that this approach works based on discrete mathematics like Turing machines.
- Another argued that current AI models are far from Turing-complete, while the original poster explained that the AI doesn’t need to be Turing-complete to output appropriate tokens as responses.
New Llama 4 Models Released: Llama 4 Scout boasts 10 million context window, 17B active parameters, and 109B total parameters, while Llama 4 Maverick offers 1m context length, 17B active parameters, and 400B total parameters, and Llama 4 Behemoth features 2 trillion parameters.
- Members express skepticism about the 10M context window claim, the new license, and question if recent models are RL’ed or just base + SFT models, pointing out performance issues and mixed benchmarks.
Self-Principled Critique Tuning Explored: Self-Principled Critique Tuning (SPCT) from DeepSeek is a new reward-model system where an LLM prompted with automatically developed principles of reasoning generates critiques of CoT output based on those principles.
- The system aims to train models to develop reasoning principles automatically and assess their own outputs in a more system 2 manner, instead of with human hand-crafted rewards, as outlined in Inference-Time Scaling for Generalist Reward Modeling.
Quantization Techniques Examined: Members discuss novel quantization techniques for large language models, pointing to a paper that has the file.
- It was argued that quantization can serve as a compromise between maintaining a super long context length and being able to serve the model, but comes with decay in the value you are actually getting out of those long contexts.

Links mentioned:

Tweet from Fiction.live (@ficlive): Updated Long context benchmark with Llama 4
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust pe...
Token embeddings violate the manifold hypothesis: To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions ab...
Tweet from rohan anil (@_arohan_): @ficlive They don’t seem to enable the attn config. Will try to see how to contact them. Meanwhile,https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynbHas th...
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators: Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, ...
Inference-Time Scaling for Generalist Reward Modeling: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $...
Scaling Language-Free Visual Representation Learning: Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often...
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization: The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1...
Training Language Models to Self-Correct via Reinforcement Learning: Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correct...
Llama 4 | Model Cards and Prompt formats: Technical details and prompt guidance for Llama 4 Maverick and Llama 4 Scout

Yannick Kilcher ▷ #paper-discussion (28 messages🔥):

Llama 4, DeepSeek Paper, PaperBench, Text Diffusion

Llama 4 Omni Wakes Up: A member shared the Llama 4 documentation, followed by a link to Meta’s blogpost on Llama 4’s Multimodal Intelligence.
- The Llama 4 Scout model boasts 17 billion active parameters, 16 experts, and an industry-leading context window of 10M, outperforming models like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1.
PaperBench: OpenAI’s Replication Benchmark: A member shared an article about OpenAI’s PaperBench benchmark, designed to test AI agents’ ability to replicate cutting-edge machine learning research papers from scratch.
- The benchmark evaluates agents on reproducing entire ML papers from ICML 2024, with automatic grading using LLM judges and fine-grained rubrics co-designed with the original authors.
DeepSeek Paper Time: Members are planning to go over the first DeepSeek paper in an hour, with a link to the paper provided (https://arxiv.org/abs/2401.02954).
- The discussion took place in a Discord event.
Text Diffusion Steers Auto-Regressive LMs: Members are planning to discuss a paper (https://arxiv.org/abs/2408.04220) on using a guided diffusion model to steer an auto-regressive language model to generate text with desired properties.
- A recent talk by the main author discussing this paper was shared (https://www.youtube.com/watch?v=klW65MWJ1PY).

Links mentioned:

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism: The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark ...
🥇Top AI Papers of the Week: The Top AI Papers of the Week (Mar 31 - April 6)
Diffusion Guided Language Modeling: Current language models demonstrate remarkable proficiency in text generation. However, for many applications it is desirable to control attributes, such as sentiment, or toxicity, of the generated la...
Llama 4 | Model Cards and Prompt formats: Technical details and prompt guidance for Llama 4 Maverick and Llama 4 Scout
no title found: no description found

Yannick Kilcher ▷ #ml-news (17 messages🔥):

GPT-6 release, Llama 4, Mindcraft Update, Adapting pre-training text, diffusion modeling to control LLMs

GPT-6 Coming Soon (Maybe?): A user jokingly announced the release of GPT-6 yesterday, followed by O0 and OO in the next few weeks, citing difficulties with GPT-5.
- This sparked humorous reactions, with another user quipping that “release” doesn’t mean actually release the weights like a company that is open about AI.”
Llama 4 Arrives with 10M Context: Llama 4 Maverick, is the most intelligent multimodal OSS model in its class with 17 billion parameter model with 128 experts and 10M context window, according to llama.com.
- The model is said to be more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU, surpassing Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1.
Mindcraft Update Sees the Bots!: A member shared a YouTube video titled “Vision and Vibe Coding | Mindcraft Update”.
- The video description included a link to Tripo AI, offering extra credits for the first 300 sign-ups using the code R2XF70.
LLMs Trained for Database Lookups: A member mentioned adapting pre-training text to include database lookups for relevant facts, to train the LLM to look things up during generation, citing this video.
Diffusion Modeling Now Controls LLMs: Users discussed using diffusion modeling to control LLMs, referencing the paper “Diffusion-LM Improves Controllable Text Generation”.

Links mentioned:

Llama 4 is Here | Meta: no description found
meta-llama/Llama-4-Scout-17B-16E-Instruct · Hugging Face: no description found
no title found: no description found
Vision and Vibe Coding | Mindcraft Update: Try Tripo AI: https://www.tripo3d.ai/app?invite_code=R2XF70My Code: R2XF70The first 300 users to signup will get 500 extra credits on Tripo!The bots can see!...

GPU MODE ▷ #general (17 messages🔥):

CUDA Python Package, Vectorized Memory Access, Llama-4 Router Normalization, High RAM/VRAM SSH Access

CUDA Python Package Debuts: Nvidia released the CUDA Python package, offering Cython/Python wrappers for CUDA driver and runtime APIs, installable via PIP and Conda.
- It’s intended to unify the Python CUDA ecosystem, providing full coverage and access to the CUDA host APIs from Python, mainly benefiting library developers needing to interface with C++ APIs.
Vectorized Memory Access Practices Sought: Members discussed best practices for vectorized memory access when working with dynamic shapes, specifically in matrix multiplication with dynamic dimensions m, n, and k.
- The discussion mentioned Cutlass support and efficient vectorized loads as potential solutions.
Llama-4 Router Normalization Examined: The channel discussed whether Llama-4 uses router normalization, similar to how DeepSeek V3 and Mixtral do with their topk_weights normalization.
- It was noted that Llama-4 skips the normalization, potentially because it uses top_k = 1, and both DeepSeek V3 and Llama 4 use sigmoid for the router logits.
High RAM/VRAM SSH Access Needed for Testing: A member sought access to an SSH-like instance with at least 500GB of RAM/VRAM for a couple of hours to test a model in SGL.
- They have GPU credits from Modal and inquired about SSH access to a container.

Link mentioned: CUDA Python: CUDA Python provides uniform APIs and bindings to our partners for inclusion into their Numba-optimized toolkits and libraries to simplify GPU-based parallel processing for HPC, data science, and AI.

GPU MODE ▷ #triton (18 messages🔥):

Triton Kernel Debugging, GPU Assembly Debugging, Grayscale Kernel Writing, Block Index Creation, Data Transposing

Triton Kernel Debugging Step-by-Step: A first-time poster inquired about debugging Triton kernels step by step, specifically addressing issues with cdiv and fill zeros in interpret mode = 1.
- An alternative suggestion involved diving into GPU assembly, setting breakpoints in the Python file using either cuda gdb or roc gdb, and single-stepping through the assembly file.
GPU Assembly Debugging with VSCode: A member asked about using the VSCode debugger instead of only cuda gdb for debugging GPU assembly.
- It was noted that running cuda gdb and passing in the Python arguments is required, but the convenience and readability of the VSCode debugger is desired.
Grayscale Kernel Writing Block Index: A member described an attempt to write a grayscale kernel for a (K, K, 3) input, aiming to get blocks of (BLOCK_K, BLOCK_K, 3) in Triton.
- However, they faced challenges with tl.arange(0, 3) because 3 is not a power of 2.
Loading Nx3 Blocks: A member asked how to load an Nx3 block, as tl.arange won’t work since 3 is not a power of 2.
- One suggestion involved loading data three times and incrementing the range by image_w * image_h, with another member suggested that adding 1 to all indexes should work.
Data Transposing for Contiguous Data: A member considered transposing data with Torch for a contest, but they were concerned about abusing strides for loading contiguous data.
- It was suggested that transposing with Torch is acceptable for the contest, as the original tensor will be contiguous and transposing will only be symbolic.

GPU MODE ▷ #cuda (18 messages🔥):

CUDA debugger, nvshmem + mpi, nvbench and ubuntu 24.04, Shared memory access in CUDA, cute::copy and tiled_copy behavior

CUDA Debugging Delight: A user confirmed CUDA’s debugger works very similarly to GDB CLI.
- Another member inquired about the release date for cutile, announced at GTC this year.
nvshmem + MPI Race Condition: A member reported race conditions and hangs when running nvshmem + mpi with one more process than the number of GPUs, both with and without MPS.
- They were running mpirun -np 5 ./myapp on a system with 4 GPUs and asked if anyone had a solution.
nvbench Bumps CMake Requirement: NVBench kind-of dropped support for Ubuntu 24.04 because it requires a minimum CMake version of 3.30, while Ubuntu 24.04 comes with 3.28.
- A member suggested filing an issue on the nvbench repo and pointed to using a previous tag as a workaround.
Shared Memory Broadcasts in CUDA: In response to a question about shared memory access in CUDA, it was confirmed that there are broadcasts and multicasts from shared memory.
- A member pointed to the CUDA C++ Best Practices Guide and added that warp shuffles should be more performant.
Cute Copy Oddity: A user found a strange behavior of cute::copy regarding tiled_copy, where all threads in a warp collectively copy data from shared memory to registers, instead of each thread copying its corresponding data.
- An attached image demonstrated unexpected data arrangements in registers after the copy operation.

Links mentioned:

Kitware APT Repository: no description found
1. Preface — CUDA C++ Best Practices Guide 12.8 documentation: no description found
Download CMake: no description found
Build software better, together: GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

GPU MODE ▷ #torch (10 messages🔥):

torch compile backend, libtorch, mojo, torchscript, gelu+mul fusion

Graphviz Backends Not Ready for Torch Compile: A member inquired about a torch.compile backend that spits out graphviz, and another responded that they are moving towards producing libtorch free binaries using torch.compile.
- They further claimed that there’s no clever way of loading the model on torchscript.
Mojo Unlikely to Bypass Python’s GIL: A member asked if anyone has used mojo to bypass Python’s GIL.
- No response was provided, so it’s safe to assume the answer is NO.
Compiling Gelu+Mul Fusion for Benchmarking: A member asked how to get torch.compile to correctly and reliably fuse gelu+mul for benchmarking purposes, using PyTorch version 2.8, to compare against their Triton kernel.
- No response was provided, so it’s safe to assume the fusion is proving difficult!
DDP/FSDP and Compilation Conventions: A member inquired about the general convention for compiling a model before wrapping it around DDP/FSDP1/FSDP2.
- Another member pointed to torchtitan’s implementation as a reference, which does a weird per-block compile thing beforehand, possibly to work around some torch compile bugs.
Numerical Issues Plague FSDP: A member reported having problems with numerical issues with FSDP and has disabled torch compile completely.
- They claim that it doesn’t do a lot for them but the torchtitan authors need to compile the flex attention and hopefully fuse some of their sequence parallel TP stuff, and the block-wrapping was a compromise.

Link mentioned: torchtitan/torchtitan/models/llama3/parallelize_llama.py at main · pytorch/torchtitan: A PyTorch native library for large model training. Contribute to pytorch/torchtitan development by creating an account on GitHub.

GPU MODE ▷ #announcements (1 messages):

GPU Mode Website, Active Leaderboards, Website Feedback

GPU Mode Launches New Website: Thanks to the hard work of two members, GPU Mode launched a new website.
- The website includes active leaderboards, links to lectures on YouTube, and their GitHub repo.
Leaderboard Status Shows H100 Dominance: The website features active leaderboards for A100, T4, H100, and L4 GPUs, with several leaderboards showing results for H100.
- For example, in one leaderboard ending in 21 days, ajhinh ranked first on H100 with 7574.126μs.
Feedback Wanted on Website Features: The team is soliciting feedback on what to add to the website.
- Current features include leaderboard statuses, YouTube lectures, and the GitHub repo; feedback can be provided in a designated channel.

Link mentioned: Leaderboards – GPU MODE: no description found

GPU MODE ▷ #cool-links (6 messages):

Llama 4, Triton Distributed, Tensara Triton Support, AMD Instinct MI325X Performance

Llama 4 Arrives with Multimodal Prowess: Meta introduces Llama 4, the latest iteration, boasting enhanced personalized multimodal experiences and featuring Llama 4 Scout, a 17 billion parameter model with 16 experts (blog post here).
- It claims to outperform Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 and fit on a single NVIDIA H100 GPU, with an industry-leading context window of 10M.
ByteDance Releases Triton-distributed for Parallel Systems: ByteDance-Seed releases Triton-distributed, designed to extend the usability of Triton language (github here).
- The new release is for parallel systems development.
Tensara Adds Triton Support for GPU Kernel Challenges: Tensara now supports Triton, inviting users to compete in kernel optimization challenges and climb global leaderboards (homepage here).
- Recent updates include PyTorch-based test cases, 3D/4D Tensor matmul problems, and activation functions like Sigmoid and Tanh.
AMD’s Instinct MI325X Shows Strong MLPerf Inference Performance: AMD Instinct™ MI325X GPUs demonstrate robust performance in MLPerf Inference v5.0, excelling in GenAI, LLMs, and reasoning models (blog here).
- Results indicate a necessity for innovative GPU architectures tailored for AI transformation.

Links mentioned:

AMD InstinctTM MI325X GPUs Produce Strong Performance in MLPerf Inference v5.0 — ROCm Blogs: no description found
no title found: no description found
GitHub - ByteDance-Seed/Triton-distributed: Distributed Triton for Parallel Systems: Distributed Triton for Parallel Systems. Contribute to ByteDance-Seed/Triton-distributed development by creating an account on GitHub.
Home | Tensara: A platform for GPU programming challenges. Write efficient GPU kernels and compare your solutions with other developers.

GPU MODE ▷ #jobs (6 messages):

Qualcomm AI Engineer Hiring, Suno ML roles and H100 resources, Zero latency music creation

Qualcomm Seeks AI Engineer Team Lead: Qualcomm is hiring an AI Engineer/Team Lead with a strong background in deep learning to design/deploy SOTA models, focusing on accuracy-latency Pareto optimality.
- Interested candidates are asked to provide a short summary along with their CV or portfolio.
Suno’s ML Talent Hunt: Suno is hiring for all ML related roles, touting a small, well-resourced team with hundreds of H100s per researcher.
- They are targeting zero latency music creation so that people can jam with AI in real time.
Zero Latency Music Creation Sounds Sick: Suno aims to achieve zero latency music creation, enabling real-time AI jamming.
- A user expressed hope that Suno could be a VSTi in Ableton.
Suno Internships Abound: A user asked about internship opportunities at Suno, praising the platform.
- No response was given.

GPU MODE ▷ #beginner (19 messages🔥):

Centralized GPU programming language, OpenCL and SYCL, ROCm and HIP, 4-bit operations in CUDA for LLMs, Performance roofline models and arithmetic intensity

Why One GPU Programming Language Doesn’t Rule Them All: A newbie in GPU programming inquired why there isn’t a centralized GPU programming language like C given the existence of CUDA for NVIDIA and ROCm for AMD.
- An expert explained that OpenCL and SYCL exist but aren’t mainstream due to poor support from vendors like NVIDIA, suggesting that the interface for OpenCL is old and C-adjacent.
ROCm’s Dual Nature: AMD’s CUDA Toolkit and HIP: ROCm is AMD’s CUDA Toolkit, while HIP is AMD’s CUDA C++ that supports Nvidia hardware and compiling to PTX, but not Intel or others.
- This offers a degree of cross-platform capability, though not universally.
Navigating 4-Bit Operations in CUDA for LLMs: A user inquired about how to perform 4-bit operations in CUDA for LLMs, such as matmul.
- Another member recommended asking in the specific CUDA channel and being more specific about the operations.
Deciphering Arithmetic Intensity in Performance Roofline Models: A member questioned the common practice of calculating bytes accessed in GEMM by summing matrix sizes (MN + MK + KN) for arithmetic intensity in performance roofline models.
- Another member clarified that this is a simplification for establishing a theoretical maximum and is realistic for newer GPUs with large L2 caches, where one input matrix may fit entirely in L2.
Jumpstart CUDA Learning with Custom Projects: A user asked for beginner-friendly CUDA projects and another user suggested learning through stuff YOU find interesting.
- It was recommended creating something that requires a decent amount of multithreading or parallelism, such as linear algebra operations without a library, to simulate the concept of pipelining.

GPU MODE ▷ #torchao (2 messages):

Int4WeightOnlyConfig, torch.compile for speedup, Compiling individual submodules

Dequant with Int4WeightOnlyConfig benefits from torch.compile: A member was trying to integrate Int4WeightOnlyConfig and asked if torch.compile is needed to speed up the dequant process.
- Another member suggested that they can try to compile individual submodules by calling torch.compile on the submodules.
torch.compile submodules for efficiency: To compile only the int4 modules, a member suggests iterating through the model’s named modules and using torch.compile on specific submodules, such as torch.nn.Linear.
- The suggested code snippet is:

for n, m in model.named_modules():
    if isinstance(m, torch.nn.Linear):
        setattr(model, n, torch.compile(getattr(model, n)))

GPU MODE ▷ #irl-meetup (3 messages):

Silicon Valley Meetups, SF Meetups, Summer Intern Meetups

Silicon Valley Summer Meetups?: An intern in the area asked if there would be any meetups in Silicon Valley this summer and offered to help organize one.
SF Meetup Planned Later This Year: A member confirmed that a meetup is being planned in San Francisco for later this year, though specific dates were not mentioned.

GPU MODE ▷ #self-promotion (37 messages🔥):

RL fine-tuning with sandboxed code interpreter, Gemma 3 QAT vs HQQ, Wavespeed AI inference API, Vector Sum CUDA Kernel optimization, Tom and Jerry video generation with transformers

RL Code Fine-Tuning Toolset Showcased: A member shared a toolkit for fine-tuning coding models using reinforcement learning with a local, zero-setup sandboxed code interpreter.
- They found very promising results using a tiny fraction of data and training time versus traditional supervised fine-tuning and look forward to expanding it from Python to other languages, such as in HIP Script.
HQQ Quantization Beats QAT for Gemma 3: A member evaluated Gemma 3 12B QAT vs. HQQ, finding that HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size.
- With GemLite bfp16 support, quantized Gemma 3 can run faster without performance issues.
Wavespeed AI touts efficient inference API: The CEO of Wavespeed AI touted their platform’s fastest and most efficient AI image & video inference API such as FLUX and Wan with LoRA.
- They offer competitive custom pricing and hope to establish a win-win model to grow together.
Vector Sum Kernel achieves SOTA: A member shared a blogpost and code on achieving SOTA performance for summing a vector in CUDA, reaching 97.94% of theoretical bandwidth, outperforming NVIDIA’s CUB.
- However, another member pointed out a potential race condition due to implicit warp-synchronous programming, recommending the use of __warp_sync() for correctness, with reference to Independent Thread Scheduling (CUDA C++ Programming Guide).
Tom and Jerry Cartoons Generated with Diffusion Transformers: A team completed a project creating 1 minute long Tom and Jerry cartoons by finetuning a diffusion transformer, accepted to CVPR 2025, with code released on GitHub.
- The model leverages Test-Time Training (TTT) layers within a pre-trained Transformer, enabling it to generate coherent videos from text storyboards, outperforming baselines like Mamba 2.

Links mentioned:

Tweet from mobicham (@mobicham): So I run evaluation on Gemma 3 12B QAT vs. HQQ. HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size. With GemLite bfp16 support, ...
WaveSpeedAI - Ultimate API for Accelerating AI Image and Video Generation: Ultimate API for Accelerating AI Image and Video Generation
Tweet from Panthalia (@panthaliaxyz): Panthalia: Decentralized compute primitive The platform to safely and easily train ML models on peer-to-peer computeWaitlist now available
Panthalia Gradient Compression Algorithm | Panthalia: This document provides a detailed description of the DCT-based gradient compression algorithm used in Panthalia. The algorithm is designed to efficiently compress both gradients sent from nodes to a c...
pauleonix - Overview: Ph.D. student in Computational Science and Engineering researching GPU-accelerated preconditioners and solvers for sparse linear problems; M.Sc. in physics. - pauleonix
MyLMArena - Chrome Web Store: Track your personal LLM preferences using ELO ratings with MyLMArena.
Quantization support for heads and embeddings · Issue #31474 · huggingface/transformers: Feature request Hi! I’ve been researching LLM quantization recently (this paper), and noticed a potentially improtant issue that arises when using LLMs with 1-2 bit quantization. Problem descriptio...
One-Minute Video Generation with Test-Time Training: A new approach using Test-Time Training (TTT) layers to generate coherent, minute-long videos from text.
GitHub - test-time-training/ttt-video-dit: Contribute to test-time-training/ttt-video-dit development by creating an account on GitHub.
Tweet from Karan Dalal (@karansdalal): Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoo...
Making vector sum really fast: In this blogpost we want to briefly describe how to archive SOTA performance for the task of reduction on a vector, i.e. our program should do the following:...
GitHub - simveit/effective_reduction: Improve reduction kernel step by step: Improve reduction kernel step by step. Contribute to simveit/effective_reduction development by creating an account on GitHub.
GitHub - pranjalssh/fast.cu: Fastest kernels written from scratch: Fastest kernels written from scratch. Contribute to pranjalssh/fast.cu development by creating an account on GitHub.

GPU MODE ▷ #reasoning-gym (18 messages🔥):

Curriculum Learning for Reasoning, Llama 3 vs Qwen 2.5, Dream 7B Diffusion Model, Llama 4 Maverick coding, Claude Think Tool

Curriculum Learning Elicits Reasoning: A member is experimenting with curriculum learning to elicit reasoning behavior in weaker LLMs like Llama-3.2-3B, by using easier reasoning tasks and gradually increasing difficulty to prime the model without SFT.
- Another member mentioned that another user has already done some work on curriculum learning with RG and found better results compared to the same tasks without curricula, which is supported by the training/ dir in the main branch.
Qwen 2.5 Beats Llama 3.2 in Training: Members have been mostly using Qwen 2.5 3B over Llama 3.2 3B because Qwen seems to be a bit easier to train for reasoning.
- This agrees with the findings in the ‘4 Habits’ paper, in which Llama 3.2 struggled with backtracking and sub-goal setting without first using SFT.
Dream 7B Diffuses Reasoning: The Dream 7B (HKU Blog Post), a diffusion based LLM, seems to show really good success on the kind of problems that the channel has, which might make it a really good candidate for gym training, especially looking at sudoku.
- Dream 7B consistently outperforms existing diffusion language models by a large margin and matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities.
Llama 4 Maverick Aider Score Revealed: Llama 4 Maverick scored 16% on the Aider polyglot coding benchmark.
- This was referenced in a message on X, discussing coding benchmarks.
Claude Thinks with Tool Use: A member shared a link to Anthropic’s Claude Think Tool.
- It wasn’t specifically discussed how this relates to Reasoning Gym.

Links mentioned:

Dream 7B | HKU NLP Group : no description found
Tweet from Paul Gauthier (@paulgauthier): Llama 4 Maverick scored 16% on the aider polyglot coding benchmark.https://aider.chat/docs/leaderboards/
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs: no description found

GPU MODE ▷ #gpu模式 (3 messages):

Deepseek communication library, NVSHMEM and UVA, Peer-to-peer GPU communication

Deepseek Leverages NVSHMEM Library: The Deepseek communication library is built off the NVSHMEM library from NVIDIA, allowing for high-performance communication.
- A member inquired if NVSHMEM utilizes Unified Virtual Addressing (UVA) for intra-node communication to facilitate peer-to-peer loads/stores to data stored in a remote GPU connected by NVlink.
NVSHMEM UVA usage for inter-GPU communication: A member is inquiring about NVSHMEM and it’s usage of Unified Virtual Addressing (UVA) for inter-GPU communication.
- Specifically, they want to know if UVA enables peer-to-peer loads/stores to data stored in a remote GPU, connected by something like NVlink.

GPU MODE ▷ #general (1 messages):

leikowo: any way to have a ptx torch extension (not cuda with inline ptx) ?

GPU MODE ▷ #submissions (24 messages🔥):

matmul Leaderboard submissions, vectoradd Benchmark Submissions, Modal Runners success, grayscale Leaderboard submissions

Modal Runners Deliver Matmul Masterpieces: Multiple leaderboard submissions for matmul benchmark on H100, A100, T4, L4 GPUs using Modal runners were successful, with IDs ranging from 3440 to 3453.
Vectoradd Victorious with Modal on L4: Several benchmark submissions for vectoradd on L4 GPUs using Modal runners succeeded, including submissions with IDs from 3464 to 3506.
Grayscale Gauntlet Gets Green Light: A test submission (ID 3447) and leaderboard submission (ID 3503) for grayscale benchmark on A100, H100, L4, T4 GPUs using Modal runners were successful.

GPU MODE ▷ #ppc (5 messages):

libsanitizer-collection.so, compute-sanitizer, LD_LIBRARY_PATH

Troubleshooter Seeks libsanitizer-collection.so Solution: A member is encountering an issue where the grader can’t find libsanitizer-collection.so when running compute-sanitizer during a ./grading test for i8mm/gpu_blm.
- They tried setting LD_LIBRARY_PATH=/usr/lib/nvidia-cuda-toolkit/compute-sanitizer based on googling, but it had no effect.
Compute Sanitizer error with i8mm: A member reported a compute-sanitizer error where the system was Unable to find injection library libsanitizer-collection.so.
- The error occurred during a test run of i8mm with the command compute-sanitizer --tool memcheck.
DejaVu Debugging: Another member recalls encountering the libsanitizer-collection.so issue previously.
- They stated that they did not quite remember what the solution was.

GPU MODE ▷ #feature-requests-and-bugs (2 messages):

Leaderboard Units, Nanos vs Millis, Discord Cluster Manager

Leaderboard Display Units Clash!: A user noted a discrepancy in the leaderboard’s time units, with the web leaderboard displaying nanoseconds while the Discord leaderboard shows milliseconds.
- A member responded that a new leaderboard website is prepared which converts to an optimal unit for clarity.
New Leaderboard Website Incoming: A member announced that they have a new leaderboard website prepared, but they do convert to an optimal unit for clarity.
- The discrepancy in the original leaderboard website had the web leaderboard displaying nanoseconds while the Discord leaderboard shows milliseconds.

GPU MODE ▷ #hardware (2 messages):

Local LLM inference, Fine-tuning, GPU selection, L40 vs A100, Quantization

Local LLM Rig Build for Org: Members are considering building a small rig for organizational LLM tasks like summarization, chatbots, and text generation, exploring options like L40 or A100 GPUs.
- The primary focus is on optimizing for 4-bit and 8-bit model inference and potential fine-tuning, taking price considerations (+5-10% of US prices) into account.
L40 Underperformance Puzzle: Despite the L40 theoretically being better for 4-bit quantized Llama 3 70b, it only achieves 30-35 tok/s on single-user requests via vLLM, underperforming compared to online benchmarks of the A100.
- The performance gap may be due to the A100’s superior DRAM bandwidth and tensor ops performance, which are nearly twice as fast as the L40.
Exploring Quantization and Optimization Strategies: The discussion suggests exploring TensorRT and specific quant formats to improve the performance of L40.
- Despite L40 having FP8 support and a larger L2 cache, these advantages don’t seem to translate to better performance compared to A100 in current setups.

Notebook LM ▷ #use-cases (14 messages🔥):

Interactive voice mode, Mind maps rollout, Website URL use cases, Commercial scale version of NotebookLM

Interactive Voice Mode Inspires!: A user expressed that the interactive voice mode was an interesting way of getting them to think about ideas.
- After trying to make a solid NotebookLM foundation since January, they mentioned they can now make almost every text work and are confident they can help corporations set up a notebook tailored to their specific needs.
Mind Maps Finally Go Live!: Users reported the mind maps feature has been fully rolled out, appearing in the middle panel for some, while others are still waiting.
- One user mentioned seeing it briefly on the right side panel before it disappeared.
Audio Overview identifies website as a book: A user inquired about use cases with a website URL, noting the Audio Overview incorrectly identified a website as a book.
- Another user suggested the source type/genre is identified based on the source’s content/format, and running it again with a “customization” specifying it’s a website resolved the issue.
Commercial NotebookLM Version Inquired: A user asked if there is a commercial scale version of NotebookLM, where the data is not in the public domain, and specific programming or prompts can be entered.

Notebook LM ▷ #general (154 messages🔥🔥):

NotebookLM's Discover feature rollout, Gemini 2.5 family, Mind Map evolution with generative AI, YouTube audio EQ Chrome extension, Google Cloud Next and Google I/O events

Theorizing Image-Based Mind Map Revolution: Users discussed how generative AI tools could soon evolve mind maps to include images, drawing inspiration from Tony Buzan’s original mind maps.
- Members expressed excitement about the potential for more visually rich and informative mind mapping.
Discover feature Rollout Delays Frustrate Users: Users have expressed frustration over the delayed rollout of the new ‘Discover Sources’ feature in NotebookLM, which has been ongoing for over a week and is expected to take up to two weeks for full availability, announced April 1st.
- The feature promises to streamline learning and database building by allowing users to create notebooks with sources directly within NotebookLM, eliminating the need to search outside the platform; one user even shared a Peter Griffin ‘But I want it now’ GIF.
NotebookLM still on Gemini 2.0; 2.5 Tunability Teased: Currently, NotebookLM utilizes the Gemini 2.0 Thinking model, though its effectiveness versus the Flash model in this context remains under evaluation.
- Gemini 2.5 is confirmed to be a family of models including a Flash version and 2.5 Pro will soon be tunable, enabling developers to adjust its ‘thinking’ intensity.
Chrome Extension tunes YouTube audio with AI: A member created an AI-powered Chrome Extension called EQ for YouTube which allows users to manipulate the audio of YouTube videos in real-time with a 6-band parametric equalizer; the extension has features for real-time frequency visualization, built-in presets, and custom preset creation.
- The GitHub repo is available for download.
NotebookLM’s Language Change Explained: To change the language in NotebookLM, use the URL https://notebooklm.google.com/?hl=LANGUAGE_CODE, replacing LANGUAGE_CODE with the desired language code (e.g., es for Spanish).
- While the team acknowledged a previously identified translation bug (since resolved), the podcast output cannot be translated at this time.

Links mentioned:

Discord - Group Chat That’s All Fun & Games: Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
Vote for the best of the internet: I just voted in The Webby People's Voice Awards and checked my voter registration.
Peter Griffin But I Want It Now GIF - Peter Griffin But I Want It Now - Discover & Share GIFs: Click to view the GIF
Google Careers: no description found
Learn how NotebookLM protects your data - NotebookLM Help: no description found
no title found: no description found
Learn how NotebookLM protects your data - NotebookLM Help: no description found
no title found: no description found
GitHub - aashishjhaa/eq-for-youtube: Manipulate the audio of YouTube Video Realtime with 6 Frequency Band: Manipulate the audio of YouTube Video Realtime with 6 Frequency Band - aashishjhaa/eq-for-youtube
EQ for YouTube: no description found
Learn how NotebookLM protects your data - NotebookLM Help: no description found
Learn how NotebookLM protects your data - NotebookLM Help: no description found

Modular (Mojo 🔥) ▷ #general (28 messages🔥):

Nvidia CUDA Python Support, Mojo GenAI, CuTile Programming Model, SIMD vs SIMT, Tenstorrent and Modular

Nvidia Adds Native Python Support to CUDA: A member shared a link to an article, Nvidia Finally Adds Native Python Support to CUDA, questioning if it’s the empire strikes back.
- The article discusses Nvidia’s approach to GPU execution using the CuTile programming model, abstracting away from thread-level programming.
Can Mojo Tackle GenAI?: A member wondered if Mojo is capable enough to develop GenAI or Inference.ai already.
- This sparks discussion on the current capabilities and potential of Mojo in the field of Generative AI.
CuTile Programming Model Questioned: A member expressed reservations about Nvidia’s CuTile programming model, viewing it as a higher-level abstraction that removes the fun from writing GPU code.
- They stated: there taking the fun out of writing gpu code.
SIMD vs SIMT: A member is working on a Proof of Concept model, noting that modern parallel compute makes less sense to view through a typical threading model.
- Discussion arose around exposing an SM as a big SIMD core with masking, and whether SIMD or SIMT is more appropriate, considering hardware flexibility and potential limitations.
Tenstorrent Software Stack: A member suggested that Tenstorrent should use Modular’s software stack, but another member noted that Tenstorrent’s driver is incredibly easy to target and use.
- They stated: their driver is incredibly easy to target and use though, so while making effective use of their architecture might require some tinkering, just getting something that runs on it seems almost trivial

Links mentioned:

NVIDIA Finally Adds Native Python Support to CUDA: For years, NVIDIA’s CUDA software toolkit for GPUs didn't have native Python support. But that’s now changed.
1,001 Ways to Write CUDA Kernels in Python | GTC 25 2025 | NVIDIA On-Demand: You have to write a CUDA kernel

Modular (Mojo 🔥) ▷ #mojo (85 messages🔥🔥):

Auto Lowering, MLIR Interpreter stress test, Implicit ctor hack, Mojo language spec, Mojo implicit copies

Auto Lowering Achieved: A member discovered that auto lowering can be achieved when adding values from different scales, sharing a link to the code.
- The member noted that it might be my most cursed work yet however.
MLIR Interpreter Faces Stress Test: A member commented that the time interval library might turn into a stress test of the MLIR interpreter.
- Another member added that something didn’t work as expected but can be remedied with an implicit ctor hack.
Mojo Spec Debate Heats Up: A discussion started around whether Mojo will have a spec, with some arguing that it gives a language responsibility and maturity, referencing the design principles of Carbon.
- Others countered that Mojo’s design is tightly coupled to what MAX needs and that a spec would slow down development, with one member stating that Chris Lattner blames “design by committee” for the failure of OpenCL.
Mojo’s Copy Semantics clarified: A member inquired whether Mojo’s implicit copies use Copy-on-Write (CoW).
- Another member clarified that semantics wise, always copy; optimisation wise, many are turned into move or eliminated entirely (inplace). Though it happens at compile time, CoW is a run time thing.
ChatGPT’s Mojo Skills Under Scrutiny: A member asked if ChatGPT or alternatives are good enough to rewrite a large Python project in Mojo.
- Another member responded that ChatGPT isn’t good at any new languages.

Link mentioned: ChronoFlare/chronoflare/init.mojo at main · bgreni/ChronoFlare: A time interval library written in mojo. Contribute to bgreni/ChronoFlare development by creating an account on GitHub.

Nomic.ai (GPT4All) ▷ #general (54 messages🔥):

Nomic Embed Text V2, GPT4All release cadence, Llama 4 release, ComfyUI for multimodal tasks, Semantic chunking

Nomic Embed Text V2 integration is coming to Llama.cpp: A member shared a link to a GitHub Pull Request that shows Llama.cpp working on integrating Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture for multilingual embeddings.
- Another member expressed that everything hangs on Llama.cpp and hoped for Mistral Small 3.1 multimodal support.
GPT4All Silent Treatment Troubles Users: Members are noticing a period of silence from core developers, with one member mentioning that this causes uncertainty about contributing to the app and the community.
- The same member suggested this might not be a good policy for an open project, but that when they break their silence, they usually come out swinging.
Llama 4 is here, but is it the greatest?: Meta released Llama 4 on April 5, 2025 (announcement), featuring Llama 4 Scout, a 17B parameter model with 16 experts and a 10M token context window.
- Though some users are excited for the release, others expressed that it is a bit of a letdown and that DeepSeek and Qwen need to step up their game, while another noted the largest model has 2 Trillion parameters.
ComfyUI is more than just a pretty face for image generation: Members discussed the extensive capabilities of ComfyUI, noting that you can do a lot with comfy if you have the nodes including image and audio captioning.
- Another member mentioned the possibility of video processing and described using command-line tools for visual model analysis.
Semantic chunking server recipe for delicious RAG: A member shared a link to a semantic chunking server implemented with FastAPI.
- The member also shared a curl command example for posting to the chunking endpoint, showing how to set parameters like max_tokens and overlap.

Links mentioned:

kalle07/embedder_collection · Hugging Face: no description found
no title found: no description found
clipboard: no description found
clipboard: no description found
Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture by manyoso · Pull Request #12466 · ggml-org/llama.cpp: Adds MoE-based embedding model supporting multilingual embeddings.Selects architecture variant based on hyperparameter detection (MoE layers).Removes unnecessary subclass initialization checks fo...

LlamaIndex ▷ #blog (3 messages):

MCP Servers, Full-Stack Agent Application, LlamaParse Layout Agent

MCP Servers get CLI Tooling: A tool by @MarcusSchiesser lets you easily discover, install, configure, and remove new MCP servers from a single CLI interface, supporting Claude, @cursor_ai, and @windsurf_ai, as shown here.
- There are hundreds of official MCP servers out there.
Create Llama for Full-Stack Agents: The create-llama CLI tool lets you spin up a web application with a FastAPI backend and Next.js frontend in a single line of code, creating just 5 source files as shown here.
- This is meant to jumpstart agent application development like deep research.
LlamaParse Launches Layout Agent: A brand-new layout agent within LlamaParse gives you best-in-class document parsing and extraction with precise visual citations, using SOTA VLM models to detect all the blocks on a page and dynamically adapt.
- The new agent dynamically adapts, as shown here.

LlamaIndex ▷ #general (46 messages🔥):

Workflow as a Tool, Multi-Agent System with Supervisor Pattern, RAG System with LlamaParse, Scalability Issue with DocumentSummaryIndex, Tools retry when exception occurred

Wrap Workflows as Tools with FunctionTool: To transform a Workflow into a Tool, one can use the FunctionTool to wrap the workflow and gain control over its name, description, input annotations, and return values.
- A member suggested a code snippet:

async def tool_fn(...):
  """Some helpful description"""
  result = await workflow.run(...)
  return str(result)

tool = FunctionTool.from_defaults(tool_fn)

Agent Handoffs Supersede Supervisor Pattern: When building a multi-agent system, it is more robust to have agents handoff between each other as needed, instead of using a supervisor pattern, which can be more error prone.
- A GitHub repo was shared as an example of a supervisor pattern implementation.
Replicate Document Summary Index with Vector Store Index: The DocumentSummaryIndex may have scalability issues; it’s advised to replicate its functionality using a normal VectorStoreIndex by summarizing documents, indexing with reference IDs, and swapping summary nodes with the original document during retrieval.
- When using load_index_from_storage, the index store is loaded to memory which causes latencies as more documents are ingested.
Context’s State Prepending to user_msg: To avoid prepending the state content in the user message, one should avoid using the state key in the context and put data between tools elsewhere in the context.
- A suggestion was to use ctx.set("some_key", "some_val") and ctx.get("some_key") instead.
Implement Text-to-SQL Query Engine Tool: When implementing a text-to-SQL query engine tool for an agent, if there are only a few tables, it is not necessary to create an index of table descriptions and perform a vector query.
- In cases of a small number of tables, the index and vector search parts can be skipped for better performance.

Link mentioned: GitHub - run-llama/multi-agent-concierge: An example of multi-agent orchestration with llama-index: An example of multi-agent orchestration with llama-index - run-llama/multi-agent-concierge

tinygrad (George Hotz) ▷ #general (16 messages🔥):

torch-geometric for tinygrad, Llama 4 10M context limitations, fast pattern matcher bounty, UOps generation, tinygrad YouTube video

Tinygraph: Torch-geometric for Tinygrad?: A member inquired about the feasibility of creating a module similar to torch-geometric for graph ML within tinygrad, considering tinygrad’s existing torch interface.
- They questioned whether it would be “useful” to pursue such a module.
Llama 4’s long context may not be so good: A user shared a tweet claiming Llama 4’s declared 10M context is “virtual” because models were not trained on prompts longer than 256k tokens.
- The tweeter also stated that even problems below 256k tokens may yield low-quality output due to the difficulty of obtaining high-quality training examples and that the largest model with 2T parameters “doesn’t beat SOTA reasoning models”.
$2000 Fast Pattern Matcher Bounty is available: A member highlighted an open $2000 bounty for a fast pattern matcher in tinygrad.
- The proposed solution involves a JIT for the match function, avoiding function calls and dict copies.
Reduce UOps to Speed Up Rewrite: It was suggested that tinygrad sometimes generates more UOps than needed, increasing the cost to rewrite.
- A member asked if it would be acceptable to sacrifice a few lines to generate fewer UOps initially, even if they are later optimized to the same result.
Tinygrad YouTube video shared: A member shared a link to a YouTube video.
- No additional details were given.

Links mentioned:

Tweet from Andriy Burkov (@burkov): I will save you reading time about Llama 4.The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it,...
first attempt at fast pattern matcher [pr] by geohot · Pull Request #9737 · tinygrad/tinygrad: no description found

tinygrad (George Hotz) ▷ #learn-tinygrad (24 messages🔥):

Tensor and SimpleMathTrait inheritance, Mesozoic tinygrad tutorials issues, METAL sync issue, AMD and BEAM issues

Debate About Tensor Inheriting SimpleMathTrait: A discussion arose regarding whether Tensor should inherit from SimpleMathTrait, given that it re-implements every method that SimpleMathTrait provides without using the .alu() function.
- It was noted that a previous bounty for refactoring Tensor to inherit from MathTrait was canceled due to poor submissions, with some suggesting Tensor may not need to inherit from either.
Colab CUDA Bug Causes Mesozoic Tinygrad Tutorial Issues: A user encountered issues while running code from the mesozoic tinygrad tutorials in Colab, prompting others to request the error message for debugging.
- It was identified as a Colab bug related to incompatible CUDA and driver versions, with a suggested workaround involving specific apt commands to remove and install compatible versions; in the meantime using the CPU device was suggested.
METAL Sharding Behavior Leads to Unexpected Results: A member encountered unexpected behavior in sharding while trying to reproduce a minimal example of a METAL sync issue, suspecting that the COPY from METAL:1 to CPU might be executing before the XFER from METAL to METAL:1 completes.
- The DEBUG output seemed to show the timeline adding the XFER when committed to the GPU command queue, not when it ends.
AMD and BEAM cause AssertionError: A user encountered an AssertionError when running with BEAM=2 and AMD=1, which seemed to be related to opening the device outside of the if __name__ == "__main__" block.
- Setting PARALLEL=0 or ensuring the device is opened within the if __name__ == "__main__" block resolved the issue.

Cohere ▷ #「💬」general (19 messages🔥):

MCP with Command-A model, Cohere Tool Use, Cohere Scholars Program, Events Recording

MCP use with Command-A Model Explored: A member inquired about using MCP (Modular Conversational Platform) with the Command-A model, suggesting it should work via the OpenAI SDK.
- Another member agreed, stating that there is no reason why it should not work.
Cohere Tool Use Capabilities Detailed: A member shared the Cohere Tool Use Overview, highlighting its ability to connect Command family models to external tools like search engines, APIs, and databases.
- It also mentions that Command-A supports tool use, similar to what MCP aims to achieve.
Cohere Scholars Program Details Shared: A member asked about the requirements for the Cohere Scholars Program, specifically if prior publications are accepted.
- A community member responded by linking the application form (https://share.hsforms.com/10OrjljwpQ52ILJA6ftENIwch5vw) and clarifying that while prior research experience is beneficial, it is not a requirement.
Inquiry about Events Recordings: A member inquired whether Cohere events are recorded, as they were interested but unable to attend the live sessions.
- The question remained unanswered in the provided context.

Links mentioned:

Form: no description found
Basic usage of tool use (function calling) — Cohere: An overview of using Cohere's tool use capabilities, enabling developers to build agentic workflows (API v2).
Cohere For AI - Scholars Program : The C4AI Scholars Program offers an opportunity to collaborate with renowned researchers and engineers, fostering a collective exploration of the unknown.

Cohere ▷ #【📣】announcements (1 messages):

Aya Vision, Multilingual Multimodal Models, Open Weights Model

Aya Vision Team Hosts Tech Talks and AMA: The core team behind Aya Vision, a multilingual multimodal open weights model, is hosting tech talks followed by an AMA on <t:1744383600:F>.
- Attendees can join for exclusive insights on how the team built their first multimodal model and the lessons learned, with the event hosted by Sr. Research Scientist <@787403823982313533> and lightning talks from core research and engineering team members; further details are available at Discord Event.
Multilingual Model Aya Eyes Community Feedback: The team has scheduled an Ask Me Anything to allow the community to directly engage with the creators.
- Questions can be about anything from model architecture to future roadmap.

Cohere ▷ #「🔌」api-discussions (5 messages):

Notion Connector, Vector DB for Notion

Slack app struggles with Notion integration: A member asked for help with a working solution for a Slack app integration with a company Notion wiki database.
Vector DB Recommended to bolster Notion: A member suggested using a vector DB due to Notion’s subpar search API.
- No specific recommendations were given, and it was stated that Cohere models work well with all vector DBs.

Cohere ▷ #「🤖」bot-cmd (3 messages):

greetings

Users greet each other: Two users are greeting each other in the 「🤖」bot-cmd channel, using “hey” and “sup”.
- The Cmd R Bot acknowledges the greetings.
Bots respond to greetings: A bot responded to the users’ greetings.
- The bot used a casual “sup” to acknowledge the interaction.

Torchtune ▷ #dev (22 messages🔥):

Fix for Timeout Crash, NeMo Resilient Training, RL Workflow, DeepSpeed Integration

Timeout Crash Bug Fixed: A member fixed a bug related to timeout crashes and created torchtune.utils._tensor_utils.py with a wrapper around torch.split in this pull request.
- They suggested merging the tensor utils separately and then syncing with another branch to handle any conflicts.
NeMo Tackles Resilient Training: A member attended a NeMo session on resilient training, highlighting features such as fault tolerance, straggler detection, asynchronous checkpointing, preemption, in-process restart, silent data corruption detection, and local checkpointing.
- Not all of these are implemented, with some only planned; the member offered to rewatch and present details comparing torchtune vs. NeMo in terms of resiliency.
RL Workflow, Data Standard Format, and Prompts: A member discussed the complexities of RL workflows, data formats, and prompt templates, suggesting a separation of concerns to decouple data conversion and prompt creation, allowing the same templates to be re-used across datasets.
- The member suggested factorizing into a component that converts the data into a standard format, and another component that takes this standard format and converts it into the actual string with the prompt.
DeepSpeed backend for Torchtune?: A member inquired about integrating DeepSpeed as a backend into torchtune and created an issue to discuss the possibility.
- Another member asked for more context, noting that FSDP supports all the sharding options from DeepSpeed.

Links mentioned:

llama3_tokenizer — torchtune 0.6 documentation: no description found
deepspeed backend in torchtune? · Issue #2569 · pytorch/torchtune: would be nice to have this optionality- happy to look into it if not out of scope
fix: Timeout crash because of chunked_output len by bogdansalyp · Pull Request #2560 · pytorch/torchtune: ContextWhat is the purpose of this PR? Is it to add a new feature fix a bug update tests and/or documentation other (please add here)Please link to any issues this PR addresses - closes #25...

Torchtune ▷ #papers (1 messages):

pjbontrager: You think they used AI to write that scrolling live updated chart?

LLM Agents (Berkeley MOOC) ▷ #mooc-announcements (1 messages):

AI4Math, Theorem Proving, Autoformalization, Formal Mathematical Reasoning, Language Models

Kaiyu Yang Presents on Autoformalization and Theorem Proving: Kaiyu Yang will present on “Language models for autoformalization and theorem proving” today at 4pm PDT.
- The presentation will cover the basics of using LLMs for formal mathematical reasoning, focusing on theorem proving and autoformalization.
AI4Math is Crucial for AI-Driven System Design: AI for Mathematics (AI4Math) is intellectually intriguing and crucial for AI-driven system design and verification, and extensive efforts have mirrored techniques in NLP.
- The talk explores formal mathematical reasoning grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback.

LLM Agents (Berkeley MOOC) ▷ #mooc-questions (4 messages):

LLM Agents MOOC, AgentX Competition, Course Quiz

LLM Agents MOOC link shared: A member asked for a link to the LLM Agents MOOC, another shared the link.
AgentX Competition Sign-Ups: The staff shared sign-ups for the AgentX Competition are available here.
Course Quiz delayed: A member asked about the missing quiz for the previous week.
- A staff apologized for forgetting to post it and mentioned it would be available in a few minutes.

Link mentioned: Advanced Large Language Model Agents MOOC: MOOC, Spring 2025

DSPy ▷ #general (4 messages):

asyncio support, full-async fork of dspy, reasons to migrate

Asyncio support: will dspy be async?: A member inquired about plans to add asyncio support for general dspy calls.
- They mentioned using litelm initially and then growing into dspy optimization, expressing interest in native dspy async capabilities.
Full Async Fork Faces Abandonment?: A member has maintained a true full-async fork of dspy for a few months but is migrating away from dspy.
- They are willing to continue merging upstream changes if there’s community interest but will abandon it otherwise.
Reasons to Migrate & Benefits of Async DSPy: Members expressed curiosity about the reasons for migrating away from dspy, and which tool is being migrated to.
- One member asked about the advantages of having a full async DSPy and suggested merging relevant features into the main repository.

Gorilla LLM (Berkeley Function Calling) ▷ #discussion (3 messages):

GitHub PR Review, Phi-4 Support

GitHub PR Gets Eyeballed: A member mentioned reviewing a GitHub Pull Request, leaving comments for further discussion on the platform.
- The author expressed gratitude for the review, acknowledging the effort put into it and indicating a need to rerun the process based on the feedback.
Phi-4 Family Support Considered: A member is considering extending functionality to Phi-4-mini and Phi-4, despite them not being officially supported.
- This suggests an effort to broaden compatibility beyond the initially intended scope, potentially enhancing the tool’s appeal.

MLOps @Chipro ▷ #events (1 messages):

Manifold Research, Multimodal AI, Self-assembling space robotics, Robotic metacognition, Community Research Call

Manifold Research Hosts Community Research Call #4: Manifold Research Group is hosting Community Research Call #4 this Saturday (4/12 @ 9 AM PST), covering their latest work in Multimodal AI, self-assembling space robotics, and robotic metacognition.
- Interested parties can register here to join the open, collaborative, and frontier science focused event.
CRCs are Manifold’s Cornerstone Events: Community Research Calls (CRCs) are Manifold’s cornerstone events where they present significant advancements across their research portfolio.
- These interactive sessions provide comprehensive updates on ongoing initiatives, introduce new research directions, and highlight opportunities for collaboration.
CRC #4 Agenda Announced: The agenda for CRC #4 includes updates on Generalist Multimodality Research, Space Robotics Advancements, Metacognition Research Progress, and Emerging Research Directions.
- The event will cover recent breakthroughs and technical progress in their MultiNet framework, developments in Self-Assembling Swarm technologies, updates on VLM Calibration methodologies, and the introduction of a novel robotic metacognition initiative.

Link mentioned: Community Research Call #4 · Zoom · Luma: Interested in generalist AI models, self-assembling space robots or machine self-awareness? Join us for Community Research Call #4!Community Research Calls…

{% else %}

The full channel by channel breakdowns have been truncated for email.

If you want the full breakdown, please visit the web version of this email: [{{ email.subject }}]({{ email_url }})!

If you enjoyed AInews, please share with a friend! Thanks in advance!

{% endif %}

AI Twitter Recap

AI Reddit Recap

/r/LocalLlama Recap

Theme 1. “Transforming Time Series Forecasting with Neuroplasticity”

Theme 2. “Disappointment in Meta’s Llama 4 Performance”

Theme 3. “Meta’s AI Struggles: Controversies and Innovations”

Other AI Subreddit Recap

Theme 1. “Llama 4 Scout and Maverick Launch Insights”

Theme 2. “AI Innovations in 3D Visualization and Image Generation”

Theme 3. “Evaluating AI Models with Long Context Windows”

AI Discord Recap

PART 1: High level Discord summaries

LMArena Discord

Unsloth AI (Daniel Han) Discord

Manus.im Discord Discord

OpenRouter (Alex Atallah) Discord

aider (Paul Gauthier) Discord

Cursor Community Discord

Perplexity AI Discord

OpenAI Discord

LM Studio Discord

Latent Space Discord

Nous Research AI Discord

MCP (Glama) Discord

Eleuther Discord

HuggingFace Discord

Yannick Kilcher Discord

GPU MODE Discord

Notebook LM Discord

Modular (Mojo 🔥) Discord

Nomic.ai (GPT4All) Discord

LlamaIndex Discord

tinygrad (George Hotz) Discord

Cohere Discord

Torchtune Discord

LLM Agents (Berkeley MOOC) Discord

DSPy Discord

Gorilla LLM (Berkeley Function Calling) Discord

MLOps @Chipro Discord

PART 2: Detailed by-Channel summaries and links

LMArena ▷ #general (1150 messages🔥🔥🔥):

Unsloth AI (Daniel Han) ▷ #general (1294 messages🔥🔥🔥):

Unsloth AI (Daniel Han) ▷ #off-topic (11 messages🔥):

Unsloth AI (Daniel Han) ▷ #help (770 messages🔥🔥🔥):

Unsloth AI (Daniel Han) ▷ #showcase (9 messages🔥):

Unsloth AI (Daniel Han) ▷ #research (37 messages🔥):

Manus.im Discord ▷ #general (777 messages🔥🔥🔥):

OpenRouter (Alex Atallah) ▷ #announcements (82 messages🔥🔥):

OpenRouter (Alex Atallah) ▷ #general (755 messages🔥🔥🔥):

aider (Paul Gauthier) ▷ #general (932 messages🔥🔥🔥):

aider (Paul Gauthier) ▷ #questions-and-tips (58 messages🔥🔥):

Cursor Community ▷ #general (1056 messages🔥🔥🔥):

Perplexity AI ▷ #announcements (3 messages):

Perplexity AI ▷ #general (941 messages🔥🔥🔥):

Perplexity AI ▷ #sharing (18 messages🔥):

Perplexity AI ▷ #pplx-api (53 messages🔥):

OpenAI ▷ #ai-discussions (501 messages🔥🔥🔥):

OpenAI ▷ #gpt-4-discussions (12 messages🔥):

OpenAI ▷ #prompt-engineering (167 messages🔥🔥):

OpenAI ▷ #api-discussions (167 messages🔥🔥):

LM Studio ▷ #general (511 messages🔥🔥🔥):

LM Studio ▷ #hardware-discussion (132 messages🔥🔥):

Latent Space ▷ #ai-general-chat (199 messages🔥🔥):

Latent Space ▷ #ai-announcements (1 messages):

Latent Space ▷ #ai-in-action-club (255 messages🔥🔥):

Nous Research AI ▷ #general (308 messages🔥🔥):

Nous Research AI ▷ #ask-about-llms (27 messages🔥):

Nous Research AI ▷ #research-papers (2 messages):

Nous Research AI ▷ #interesting-links (9 messages🔥):

Nous Research AI ▷ #research-papers (2 messages):

Nous Research AI ▷ #reasoning-tasks (6 messages):

MCP (Glama) ▷ #general (293 messages🔥🔥):

MCP (Glama) ▷ #showcase (23 messages🔥):

Eleuther ▷ #general (39 messages🔥):

Eleuther ▷ #research (204 messages🔥🔥):

Eleuther ▷ #interpretability-general (17 messages🔥):

Eleuther ▷ #lm-thunderdome (19 messages🔥):

HuggingFace ▷ #announcements (1 messages):

HuggingFace ▷ #general (169 messages🔥🔥):

HuggingFace ▷ #today-im-learning (16 messages🔥):