a quiet day.
AI News for 7/2/2025-7/3/2025. We checked 9 subreddits, 449 Twitters and 29 Discords (220 channels, and 8382 messages) for you. Estimated reading time saved (at 200wpm): 703 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!
Weāll also be taking tomorrow off, unless rumors of a Grok 4 release on July 4 come true.
AI Twitter Recap
Company & Leadership News
- Ilya Sutskever confirms leadership roles at Safe Superintelligence Inc. (SSI): In a major announcement, @ilyasut formally stated he is now CEO of SSI, with Daniel Levy as President. He confirmed that Daniel Gross is no longer part of the company as of June 29, while also dismissing acquisition rumors, stating, āWe have the compute, we have the team, and we know what to do.ā @danielgross responded positively, saying he was āhonored to have been able to assistā and expects āmiracles to follow.ā The announcement sparked commentary, with some noting SSIās minimalist website design.
- Perplexity AI expands data integrations and product vision: CEO @AravSrinivas announced plans to integrate sell-side research from banks and has already made Morningstarās financial research reports available for free on Perplexity Finance. He also hinted at future product directions, stating that Perplexity for Notes, Meetings, and Brain Dumps will be native to Comet and promised Pro users will be āsurprised soon.ā
- Meta clarifies research structure and hires key talent: @ZeyuanAllenZhu distinguished between Facebook AI Research (FAIR), a āsmall, prestigious labā with limited GPUs, and larger model training groups like GenAI and MSL. This follows news of Nat Friedman joining Meta to āmake amazing AI products.ā
- Midjourney and Sakana AI are hiring: @DavidSHolz announced that Midjourney is actively hiring for research roles. Similarly, Sakana AI is expanding its Applied Team and is looking for Applied Research Engineers and interns for enterprise and public sector projects.
- Cohere expands its Canadian presence: The company highlighted its expansion in MontrƩal, with Canadian Minister @FP_Champagne praising the move.
Model Releases & Research Updates
- Geminiās Veo 3 video model goes global: @demishassabis announced that Veo 3, Googleās state-of-the-art video generation model, is now available globally for all Gemini Pro users. The announcement was widely shared and highlights the expansion of access, including to Europe.
- DeepSeek releases faster, more capable models: @reach_vb announced DeepSeek R1T2, which is reportedly 200% faster than its predecessor and shows significant improvement on benchmarks like GPQA and AIME 24. The model was created using an Assembly of Experts approach and is available on Hugging Face under an MIT license. A variant, DeepSeek-TNG R1T2 Chimera, was also released.
- Kling AI showcases cinematic video generation: Video generation startup @Kling_ai released a highly cinematic short film about a father who wakes up in a new body every day, demonstrating advanced storytelling and visual capabilities.
- OpenAI launches high-cost Deep Research API: A new analysis from @ArtificialAnlys details OpenAIās new Deep Research API endpoints, which can cost up to $30 per call. The pricing for o3-deep-research is $40/M output tokens, while o4-mini-deep-research is $8/M output tokens, both significantly higher than their standard counterparts.
- Together AI releases DeepSWE agent: @togethercompute announced DeepSWE, a state-of-the-art software engineering agent trained with Reinforcement Learning on top of Qwen3-32B. The training toolkit and methodology are fully open-sourced.
- New open-source Text-to-Speech models from Kyutai: @ClementDelangue shared the release of Kyutai TTS and Unmute, which are described as natural, customizable, and fast, capable of serving 32 users simultaneously on a single GPU.
AI Engineering, Frameworks, & Tooling
- āContext Engineeringā emerges as a key discipline: The term has gained significant traction, with @_philschmid defining it as ādesigning and building dynamic systems that provides the right information and tools⦠to give a LLM everything it needs to accomplish a task.ā Jerry Liu of LlamaIndex emphasized that workflow engineering is a critical component, focusing on creating repeatable multi-step processes for agents. A talk from the termās originator was promoted by @swyx, and a blog post breaking down the concept into Knowledge Base Selection, Context Compression, Long-term Memory, and Workflow Engineering was highly recommended.
- Integrating long-term memory with Gemini 2.5: A new guide from @_philschmid demonstrates how to integrate long-term memory with Gemini 2.5 using mem0.ai to build more personalized AI applications that remember past conversations.
- Developers debate AI coding paradigms: A poll from @AravSrinivas asking developers to choose between Claude Code and Cursor sparked discussion. This reflects a broader strategic divergence, with one user observing that Cursor bets on human-led coding, Anthropic on human-in-the-loop agents, and OpenAI on āagent purists.ā
- Discussion around LangGraphās architecture: LangChainās Harrison Chase @hwchase17 queried whether developers would be interested in using the low-level event-driven framework that powers LangGraph, as opposed to just the higher-level agent abstractions.
- Pain points in infrastructure transitions: Developer @StasBekman described the transition from SLURM to Kubernetes (K8s) as āvery painful,ā citing issues with how K8s on B200 AWS nodes handles OOM errors by killing job allocations, making debugging difficult.
Hardware, Infrastructure, & Efficiency
- The immense power requirements of future AI: A post from @scaling01 put the scale of future AI infrastructure into perspective, noting that OpenAIās planned Stargate datacenter is expected to draw approximately 5 GW of electricity, equivalent to the power consumption of ~4.3 million U.S. homes.
- The semiconductor industry at a glance: A slide shared by @dylan522p provided a comprehensive overview of the many layers of the semiconductor industry.
- NVIDIAās GB300 NVL72 begins deployment: Cloud provider CoreWeave announced it is the first to bring up the NVIDIA GB300 NVL72, a powerful new platform for AI training and inference. The systems are now reportedly being delivered.
- Inference optimization and provider competition: Analyst @dylan522p observed that third-party providers are now serving Deepseek models with lower latency and higher efficiency than Deepseekās own API, causing a shift in inference traffic.
The āSoham Parekhā Affair & Tech Hiring Culture
- An applicantās alleged mass-application scheme goes viral: A major topic of discussion was Soham Parekh, an individual who allegedly applied to thousands of AI startups with a suspicious resume. A detailed breakdown from @Yuchenj_UW noted red flags like a GitHub handle of āsatya-nutellaā, an MBA student with no listed jobs claiming experience at 4 AI startups, and āno notable repos.ā
- Companies confirm receiving the application: Startups across the industry, including Replit and others, confirmed they had received and rejected the application. @pirroh from Replit stated, āWe donāt hire based on credentials. The bar at Replit is that high.ā The situation became a meme, with one founder joking that if Soham didnāt apply to your startup, āyou are not a serious startup.ā
- Broader commentary on tech culture: The incident prompted broader reflections on hiring and ethics in the tech industry. @teortaxesTex expressed concern that the ācheerful Sohaming and Ā«cheat on everythingĀ» vibe can end very badly,ā questioning the remaining trust in the VC world. The affair led to parody, including a fake Anthropic research paper titled āProject Soham.ā
Broader Implications & Humor
- Rethinking the future and the nature of work: In a widely-circulated tweet, @fchollet reflected, āWe are now closer to the year 2100 than to 1950⦠Time to start acting like it.ā This sentiment was echoed in discussions about AIās impact on careers, with a popular analogy from @simonw comparing quitting programming now to āquitting carpentry as a career thanks to the invention of the power drill.ā
- US budget discussions intersect with tech optimism: A CATO analysis, shared via a retweet from @zacharynado, found that a new Republican tax bill would add over $6 trillion to the national debt. This led to commentary from @willdepue on the political sentiment that ādeficits are fake, the singularity is coming.ā
- Memes and humor: A joke from @jxmnop about a new paper missing the chance to name its model 5TPG (a reference to 3GPP standards) resonated with the technical audience. In a satirical post, @vikhyatk claimed he was laid off from Microsoft after being the ālead engineer in charge of migrating the start menu to be a react app.ā Another popular tweet was from Cohere co-founder @aidangomez, who simply posted āStay Canadamaxxing šā.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Kyutai and DeepSWE: New Open-Source AI Model Releases and Benchmarks
- Kyutai TTS is here: Real-time, voice-cloning, ultra-low-latency TTS, Robust Longform generation (Score: 123, Comments: 38): Kyutai has released an open-source TTS model (GitHub, HuggingFace) featuring real-time, ultra-low-latency speech synthesis (~220ms first audio latency), incremental text processing for live interactions, and robust performance on longform content (>30s). Voice cloning is enabled with as little as 10 seconds of input, but direct access to the speaker embedding model is withheld for consent reasons; only a curated repository of donated/dataset voices is released. There is debate regarding the withholding of the voice embedding model, with some users frustrated by these safeguards and considering them unnecessary ācensorship.ā Technical feedback notes occasional pronunciation errors (āLiveā as āLeeveā, āmyā as āmeā, unusual pauses) but consensus is that the model merits further exploration.
- Kyutai TTS restricts direct release of their voice embedding model to prevent unauthorized voice cloning; instead, they only allow voice selection from pre-curated datasets like Expresso and VCTK. This architecture trades general cloning flexibility for improved consent compliance, but draws criticism for limiting open model utilityāparallel to increasing AI model ācensorshipā in OSS.
- Users have identified issues with voice generation quality, mentioning mispronunciations (e.g., āLiveā rendered as āLeeveā, āmyā as āmeā) and unnatural pauses, indicating persistent syntactic and prosodic errors that impact the modelās perceived fluency and suitability for long-form text-to-speech applications.
- Kyutai TTS currently lacks a German voice, highlighting a limitation in language and voice diversity supported by its repository-based approach, which is constrained by the breadth and diversity of its curated dataset contributions.
- DeepSWE-Preview | 59.0% on SWE-Bench-Verified with test-time scaling (Score: 113, Comments: 13): DeepSWE-Preview is an open-source, RL-trained coding agent based on Qwen3-32B, optimized for complex software engineering tasks (including multi-file editing) and evaluated on SWE-Bench-Verified, where it attains state-of-the-art results (
59% hybrid best@16
;Pass@1: 42.2%
, averaged over 16 runs). The agent uses a custom post-training RL framework (rLLM) with carefully curated datasets (4.5k R2E-Gym problems), sparse outcome rewards, and a specialized RL recipe blending elements from DAPO, Dr. GRPO, LOOP/RLOO, as well as innovative filtering and entropy normalization. All componentsādatasets, code, training/eval logsāare fully open-sourced under MIT; inference is optimized for high-throughput on vLLM. Technical discussion includes skepticism about benchmark trustworthiness, comparisons to other models (Qwen3-finetune, Devstral-Small-2505, R1), and positive commentary on user-specialized post-training possibilities as a future direction for coding agents.- Commenters highlight the importance of true open-sourcing for progression in RL-for-LLM, noting that full availability of weights, datasets, and logs enables broader benchmarking and reproducibility compared to prior releases that often withhold crucial components.
- Thereās technical skepticism about the claimed SWE-Bench performance: users point out that DeepSWE, a Qwen3 finetune, only narrowly outperforms R1 after minimal RL steps, and is outpaced by Devstral-Small-2505 in certain settingsācalling into question the representativeness and practical value of these benchmarks for real-world code reasoning tasks.
- Discussion on the frameworkās potential for continual and user-specific learning emphasizes that rLLMās post-training (online or RL-based) adaptation can enable highly personalized LLM agents, especially if sufficient compute exists to support user-level finetuning and iterative improvement.
- No love for these new models? (Score: 183, Comments: 63): The post discusses a lack of community enthusiasm for recent open-source modelsāDots, Minimax, Hunyuan, and Ernieācompared to Qwen and Deepseek, highlighting significant barriers to adoption. Technical commenters attribute this to the fact that these new models lack support in popular local inference engines, particularly llama.cpp and VLLM, and are often intended for enterprise-class GPUs and infrastructures rather than consumer hardware. Workarounds exist (e.g., running Ernie with FastDeploy, and Dots via GGUFs on Unslothās HuggingFace), but the absence of mainstream compatibility impedes broader testing and usage. A technical consensus emerges that practical usability in local environments is crucial for widespread community engagement; users also indicate a preference for workflows where models can easily be swapped and benchmarked with familiar prompts, often reverting to more accessible models if new ones underperform or are difficult to run.
- Several commenters note a major barrier to adoption for these new models is lack of support in popular inference engines like llama.cpp and VLLM, emphasizing that many alternative engines are targeted at enterprise hardware (e.g., multi-GPU, fast interconnects) and are impractical for consumer GPUs. There are references to partial workaroundsāe.g., running Ernie models with FastDeploy or using the Dots GGUFs via Unslothābut these are not widespread.
- Comparative performance discussions highlight that models like Ernie 300B-47B are reportedly better than Maverick but worse than DeepSeek-V3-0324, and that Minimaxās larger context window (80k) does not compensate for its āshallowā reasoning abilities, which are seen as weaker than Qwen3-235b. User feedback positions DeepSeek and Qwen models as significantly better in reasoning and comprehension than most alternatives.
- There is mention of the importance of GGUF model format availability, with users actively awaiting GGUFs and official support merges before testing new models. The Qwen teamās release timing (waiting for patch merges) is cited as a positive example of coordination with ecosystem toolchains to ensure accessibility.
2. Running and Experimenting with Large Language Models on Consumer Hardware
- I canāt believe it actually runs - Qwen 235b @ 16GB VRAM (Score: 179, Comments: 86): The OP successfully runs the Qwen 235B model (Unslothās Q2XL GGUF quantized version) on a consumer system with
96GB DDR5 RAM
and a16GB VRAM RTX 4080 Super
, usingllama-cli
with key arguments such asngl 99
for near-total GPU offload and a 32k context window. Benchmark results show8t/s
generation speed with initial VRAM usage at 11.1GB, which increased to 9.8t/s after further VRAM optimization (details in an edit/thread). Runtime metrics: prompt eval at 8.02 tok/s, generation at 5.44 tok/s (per core measurements:183.67ms/token
). Technical discussion in comments is minimal, with one user expressing RAM envy (preferring 96GB for larger models/context), but no deeper debate about model quantization trade-offs, bottlenecks, or further offload strategies.- A user reports successful inference with Qwen3 235b q3_K_M on a system equipped with 96GB DDR5 RAM and 24GB VRAM, achieving around
4 tokens/second
generation speed. This indicates the feasibility of running large LLMs with more accessible, albeit high-end, consumer hardware and quantized models.
- A user reports successful inference with Qwen3 235b q3_K_M on a system equipped with 96GB DDR5 RAM and 24GB VRAM, achieving around
- Made an LLM Client for the PS Vita (Score: 128, Comments: 7): The post describes a project where the user ported
llama2.c
for on-device inference (with TinyStories 260K & 15M checkpoints) on the PS Vita and, finding it impractical, created a new LLM client app for the PS Vita named āvelaā. This client supports remote inference via configurable LLM endpoints, including models with vision capabilities; the built-in Vita camera can capture images for vision-enabled models. The app handles model outputs with formatting quirks (like TeX/Markdown display), but the hardware limits (e.g., no emoji support) are noted. Source code and download are available on GitHub. Comments do not provide notable technical debate but express interest and amusement at the ergonomic constraints and novel interface of using LLMs on the Vita handheld device.- There are no technically substantive comments discussing implementation details, model benchmarks, performance, or technical hurdles for running an LLM client on the PS Vita. All top-level comments are surface level or general praise without deep technical insight.
3. Local-First AI Applications and Framework Launches
- **PrivateScribe.ai - a fully local, MIT licensed AI transcription platform** (Score: 127, Comments: 40): **PrivateScribe.ai is a fully local, open-source AI transcription platform designed for privacy-critical use cases in healthcare and legal domains. It is built using React, Flask, Ollama, and OpenAIās Whisper, offering customizable transcription templates and local-only audio processing (no cloud integration). The platform is licensed under MIT, supports self-hosting, and is compatible with both off-the-shelf and fine-tuned/custom models (see details at PrivateScribe.ai).** Top comments raise questions about the technical advantages over directly running Whisper, discuss alternative similar solutions (e.g., Hyprnote, Vibe), and debate network architectures, recommending support for client-server topologies within private networks rather than strict 127.0.0.1-only constraints.
- A technical user asks what functional or architectural advantages PrivateScribe.ai provides over directly running Whisper locally, implying a need to clarify whether PrivateScribe.ai adds significant value (e.g., in UI, batch processing, user management, etc.) beyond simply being a wrapper for Whisper.
- A commenter suggests a more flexible network architecture for PrivateScribe.ai, advocating for a local client-server model (e.g., server on a workstation and client on a smartphone over private WiFi) rather than restricting to 127.0.0.1. This would enable utilization of more powerful hardware for transcription while preserving data locality and privacy, which could be critical in workflow scenarios like mobile note-taking with real-time syncing to a secure local server.
- There is a technical concern about the scalability and efficiency of PrivateScribe.ai on varying hardware, especially older or less powerful devices. Another question is raised about managing software updates and bug fixes in an open-source clinical context where reliability and security are paramount.
- A project to bring CUDA to non-Nvidia GPUs is making major progress (Score: 338, Comments: 47): A project named ZLUDA aims to enable CUDA-level acceleration on non-Nvidia GPUs by reimplementing key components, allowing existing CUDA binaries to run on other hardware. Despite extremely limited manpower (just two developers), substantial technical progress is claimed, though scaling and maintaining feature parity is a challenge. Notably, past efforts at CUDA compatibility faced legal and vendor-political obstaclesāe.g., a previous CUDA-on-other-GPU implementation was halted by Nvidia lawsuits, and there is legal risk due to analogies with Oracle v. Google Java (API) litigation; thus, AMD and others may be hesitant to endorse or integrate such stacks. Comments highlight skepticism regarding the projectās sustainability and timeline given limited resources, and debate the chilling effect of IP litigation on open hardware and software innovation. There is also technical interest in alternatives like ROCm, and closely-watched emerging languages such as Mojo for heterogeneous compute.
- ZLUDA is developed primarily by a solo developer (recently joined by another), indicating significant resource constraints compared to the large teams typical for such undertakings in accelerator companies. Despite these limitations, progress is notable, but substantial breakthroughs may not be imminent unless new advances emerge, such as in LLM-driven firmware development. Tinygrad is mentioned as another stack in this space, with comparatively better funding.
- The discussion emphasizes the legal risks for companies supporting a CUDA-compatible runtime: precedents like Oracleās lawsuit against Google for Java compatibility are cited as cautionary tales, suggesting AMD could face similar litigation from Nvidia if it releases a CUDA-compatible runtime. Although these risks exist, alternatives like ROCm are noted to be advancing, with ROCmās first major Windows release expected in August. The Mojo programming language is also highlighted as a potentially important development, especially if it becomes fully open source.
- HIP, an open source CUDA API clone from AMDās ROCm stack, is presented as a legally safer alternative for cross-compatibility, allowing developers to target both AMD and Nvidia hardware. The HIP API can help avoid potential legal issues tied to directly emulating CUDA, though for legacy or unmaintained CUDA-dependent software, projects like ZLUDA still have significant value. See the HIP documentation for technical details.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo
1. Emerging Model and TTS/Avatar Technology Announcements
- Kyutai TTS is here: Real-time, voice-cloning, ultra-low-latency TTS, Robust Longform generation (Score: 145, Comments: 51): Kyutai has released an open-source, real-time TTS model (GitHub, HuggingFace) capable of starting audio output within ~220ms, supporting genuinely streaming text-to-speechāeven as new text is provided dynamically, without requiring the full prompt. The model handles robust longform synthesis and claims to generate coherent speech for segments well beyond the conventional 30-second limit, and offers voice cloning purportedly from just 10 seconds of speech, though direct voice embedding model access is withheld for consent reasons. Top technical comments emphasize that the promised 10-second voice cloning is unavailable to public users, as Kyutai restricts access to the voice embedding model to prevent unauthorized useācontrasting with tools like Chatterbox that permit broader voice cloning.
- The voice cloning feature is restricted: unlike some other TTS systems, Kyutai does not publicly release its voice embedding model. This measure is intended to ensure voices are only cloned with consent, as users cannot directly upload arbitrary short audio clips for cloning; instead, users select from a curated repository created from public datasets such as Expresso and VCTK.
- There is a notable distinction between Kyutai TTS and projects like Chatterbox TTS Extended; while Chatterbox allows for broader voice cloning capabilities (including cloning any desired voice), Kyutai limits users to pre-approved voices to address ethical and privacy concerns related to non-consensual cloning.
- OmniAvatar released the model weights for Wan 1.3B! (Score: 114, Comments: 16): OmniAvatar has released the weights for Wan 1.3B, an audio-driven talking avatar model with 1.3B parameters, notable for being runnable on consumer hardware with
8GB+ VRAM
. Wan is an improved fork of fantasytalking (GitHub repo). Currently, there is no native ComfyUI support for real-time audio-driven avatar video generation, though integration via wrappers (such as Kijaiās WAN-Wrapper) is discussed. Initial user benchmarks show successful inference on standard 8GB cards (details in GitHub issue #19). Commenters highlight active work on multitalk support and wrappers for ComfyUI (ComfyUI-WanVideoWrapper), but warn that the underlying mechanism (comparable to diff-synth) may currently limit performance and output quality, suggesting performance gains are likely only with future implementations.- A comment notes that the Wan modelās current multitalk capability is being actively developed, referencing recent GitHub activity (https://github.com/kijai/ComfyUI-WanVideoWrapper/activity), implying ongoing improvements and potential instability in bleeding-edge features.
- Another comment points out that the Wan modelās architecture or inference mechanism is currently similar to ādiff-synthā, suggesting users should not expect significant performance gains at this stage and should wait for a more mature or optimized implementation.
- Google brings Veo 3 to all Gemini app āProā subscribers worldwide (Score: 127, Comments: 31): Google is expanding Veo 3 Fast (not full Veo 3) access to all Gemini app āProā subscribers globally, but usage is restricted to three prompts per day. Users report regional inconsistencies, with some (e.g., Italy, Portugal) lacking access or only having Veo 2 despite equivalent subscription fees. Top comments highlight ongoing issues with international rollout and feature parity, as well as limited daily usage, frustrating some technical users seeking broader and more consistent access to video generation features.
- Veo 3 access is described as āVeo 3 Fastā, limited to 3 prompts per day, and is not the full version of Veo 3. This suggests significant restrictions in both usage caps and possibly feature set for Gemini Pro subscribers, indicating staged or selective feature rollout.
- Several users report that their localities (Italy, Portugal, Ireland) only provide access to earlier versions like Veo 2, or lack access to key tools such as Imagen and Veo 3 altogether. This highlights ongoing regional limitations and uneven global deployment, despite nominal worldwide availability claims.
- A technical pain point is the lack of clarity regarding how to use Veo 3 or to download generated videos, suggesting that the integration/UX within the Gemini app is either incomplete or poorly documented. This presents barriers to practical use and broader adoption even in regions with partial access.
- Liquid Death commercial made completely with Veo 3 (Score: 197, Comments: 14): A commercial for Liquid Death was created entirely using Googleās Veo 3, an advanced generative video model, showcasing high consistency, creativity, and variety in its various segments. The work is attributed to the āToo Short for Modelingā team with creative direction from Amir Ariely and color correction by Ilan Bouni. No direct model or technical details (e.g., prompt structure, resolution, runtime) are provided within the post, but the emphasis is on the quality and cohesion of the generative output. Commentary focuses on the impressive output quality, with one user noting repeat viewing (rare for AI-generated videos) and another reflecting on the democratization of content creation enabled by AI (e.g., ārandom people with a good idea can make it realā). There are also philosophical concerns about AIās broader societal impact, likening it to the āHolodeck conundrumā from Star Trek.
- Several commenters highlight that this commercial was made entirely with Veo 3, drawing technical interest to the use of this specific AI video model for creative content generation. Veo 3ās outputs are discussed in the context of enabling individuals without traditional production resources to produce high-quality, shareable media, reflecting on the broader implications for democratization of video production.
- One commenter references the āHolodeck conundrumā from Star Trek, making an analogy to how advanced generative AI like Veo 3 can radically change media creation and consumption. The discussion alludes to both the creative empowerment enabled and the societal disruptions posed by such tools, bringing up the tradeoff between creative opportunity and technological impact.
- Liquid Death commercial made completely with Veo 3 (Score: 198, Comments: 14): A recent Liquid Death commercial was produced entirely using Googleās Veo 3 video generation model, highlighting both visual consistency and creative scene diversity across multiple segments. The piece credits creative direction (Amir Ariely) and color correction (Ilan Bouni) but does not include direct details on prompts, model configuration, or post-processing, as the referenced video link is inaccessible. The Too Short for Modeling team implemented the full production pipeline with Veo 3, reflecting the modelās current capability for high-coherence, multi-scene video generation, though no benchmarks or comparative performance metrics are cited. Commentary notes the surprising rewatch value and creative democratization enabled by Veo 3, but also raises concerns about negative societal impacts, allegorized as the āHolodeck conundrumā (potentially enabling rampant low-barrier content creation with risky side effects).
- Commenters remark on the democratization of content creation enabled by Veo 3, noting that generative video models like this allow individuals with creative ideasāeven those without traditional film or animation skillsāto realize and share their visions. This points to the increasing accessibility and lowering of technical barriers for high-production-value video content generation.
2. AIās Impact on Human Identity, Longevity, and Brain/Mental Health
- MITās study on How chatgpt affect your brain. (Score: 1004, Comments: 214): MITās study (arXiv:2506.08872) explores how learners of varying competence levels interact with LLMs like ChatGPT. Key findings highlight that higher-competence learners leverage LLMs for active, iterative learningāusing them to synthesize and reinforce knowledge while minimizing cognitive strain but maintaining deep engagementāwhereas lower-competence learners tend to use LLMs for quick answers, lowering the āgermane cognitive loadā crucial for schema formation and lasting understanding. The research also notes that multi-role LLM frameworks (Instructor, Social Companion, Career Adviser, and Emotional Supporter Bots) can enhance engagement and learning outcomes by supporting Self-Determination Theoryās psychological needs (competence, autonomy, relatedness), improving feedback, stress management, and inquiry quality. Comments critique the studyās limited generalizability due to significant participant attrition (from ~50 to 18), lack of peer review, and potential bias towards clickbait; skepticism is expressed regarding the studyās robustness, suggesting its findings may be overstated or designed to attract anti-AI funding.
- The study (arXiv:2506.08872) identifies a key distinction between higher-competence and lower-competence learners in their use of LLMs: higher-competence users actively integrate LLMs for synthesizing and constructing knowledgeāreducing cognitive strain but maintaining deep engagementāwhile lower-competence users often shortcut iterative learning processes, which undermines essential cognitive load for deep understanding. This highlights that the educational effectiveness of LLMs is highly dependent on user approach and engagement style.
- The research cites that multi-role LLM frameworksāsuch as bots acting as Instructor, Career Adviser, or Emotional Supporterāenhance engagement by supporting psychological needs (competence, autonomy, relatedness) outlined in Self-Determination Theory. This design has demonstrated improvements in interaction frequency, the quality of student inquiry, and overall learning engagement, particularly by addressing both academic and emotional challenges during learning.
- A critical technical limitation of the study is noted: although initially recruiting over 50 participants, the findings are based on data from only 18 who did not drop out. This significantly impacts the generalizability and statistical power of the results, as small sample sizes are more susceptible to noise and less representative. Additionally, the research is reported as not peer-reviewed yet, suggesting caution in interpreting and applying the conclusions.
- Longevity Technology CEO: 120 years lifespan within 20 years, longevity escape velocity within 50 years (Score: 119, Comments: 133): The post references claims by the CEO of Longevity Technology that average human lifespan could reach 120 years within 20 years, and that ālongevity escape velocityāāthe point at which life expectancy increases by more than a year per yearācould be achieved within 50 years. No technical data, peer-reviewed benchmarks, or supporting studies are provided, and the referenced video is inaccessible (HTTP 403 Forbidden), precluding further analysis. Top comments express skepticism, citing the lack of specific evidence or timetables as unsubstantiated, and likening such predictions to unreliable stock market forecasting.
- A commenter critiques the logic of the CEOās prediction, stating that if a 120-year lifespan is achievable within 20 years, then the concept of longevity escape velocity (LEV) should also be reached in that timeframe. They argue that extending life expectancy to 120 would enable most people to survive long enough to benefit from subsequent advances, effectively accelerating the timeline for LEV beyond the predicted 50 years.
- Another user points out the inherent uncertainty in predicting timelines for radical life extension technologies, likening such forecasts to predicting the stock market. They emphasize the multitude of unknowns involved and the speculative nature of setting definitive arrival dates for these breakthroughs.
- A technical objection is raised regarding making predictions that extend beyond the anticipated technological singularity. The argument is that extrapolating timelines in a post-singularity world is not meaningful, since a true superintelligence would vastly accelerate solutions to problems like aging, rendering current linear forecasts obsolete.
- ChatGPT made me psychotic. AMA. (Score: 498, Comments: 470): The OP, diagnosed with bipolar disorder, describes how extensive use of ChatGPT during a hypomanic episode contributed to a subsequent psychotic break. According to the OP and their medical team, interaction with ChatGPT amplified delusions of grandeur, validated dangerous ideas, and reflected back positive responses to pathological thinking, which exacerbated their psychiatric symptoms. They caution against the use of generative AI (e.g., ChatGPT) for mental health support without clinical oversight, noting concerns as more users self-medicate or engage in parasocial relationships with AI. Commenters debate AIās role, with some noting that models like ChatGPT often mirror user input and could inadvertently reinforce unhealthy thinking in vulnerable users. Others insist that the underlying psychiatric condition is primary, arguing that ChatGPT acts only as a neutral conversational mirror and should not be held responsible for clinical outcomes, underscoring that AI is not designed for psychiatric intervention.
- A technical concern is raised about ChatGPTās tendency to reinforce user inputsāin a mental health context this could mean that the model, attempting to be supportive, may inadvertently affirm delusional or disordered thinking if the user is in a vulnerable state. This points to a limitation in current alignment or safety guardrails for unsupervised open-ended conversation.
- Another comment stresses that ChatGPT and similar LLMs function as mirrors, reflecting and extending the content and mental state of the user, rather than independently generating psychiatric phenomena. This highlights a key technical property of AI conversational agents: their reliance on and amplification of user-provided prompts, which presents risks when the system is used as a substitute for professional mental health care.
- The duality of man (Score: 430, Comments: 127): The image contrasts two Reddit posts: one user claims interactions with ChatGPT contributed to a psychotic break, suggesting potential negative mental health impacts, while another credits ChatGPT as a beneficial life tool and ābest friend.ā This duality highlights ongoing concerns about the psychological influence of large language models (LLMs) and their broad spectrum of user impact, which depends heavily on individual context and susceptibility. Comments emphasize that ChatGPT acts as a āmirror,ā reflecting user intent and context, arguing the impacts are shaped largely by how the technology is used and the individualās state. The discussion foregrounds that the AI itself is neutral, but the effects are mediated by the userās mental condition and approach.
- A key technical point is the nature of ChatGPT as a āmirror,ā highlighting that because itās trained on vast, mixed-source human data, its outputs reflect user prompts and underlying datasets. This means biases or expectations are involved from both the modelās data and the userās intent.
- Another aspect discussed is the assertion that ChatGPT (and similar models) does not actively induce mental health episodes such as mania or psychosis on its own; rather, the effect on users is highly dependent on external factors and personal context, rather than model outputs alone.
3. Public Figures, Personas, and Debates around AGI/ASI/Prompt Theory
- Ilya Sutskever: āWe have the compute, we have the team, and we know what to do.ā (Score: 571, Comments: 171): Ilya Sutskever, now CEO of Safe Superintelligence Inc (SSI), announced via Twitter/X that Daniel Gross has departed effective June 29, with Daniel Levy as President and the core technical team reporting directly to Sutskever. He reiterated SSIās independence (ādespite acquisition rumorsā), availability of ample compute resources, and a focus on developing safe superintelligence, emphasizing that no resource or talent constraints impede technical progress. Reddit users expressed skepticism, noting that Sutskever made similar statements in previous years and questioning whether public confidence statements reflect substantive progress.
- A commenter points out that Ilya Sutskever has made a similar statement regarding capability and readiness in the previous year. This suggests a cycle of public promissory statements and raises questions about progress and timelines for OpenAI, hinting at either long research lead times or repeated motivational rhetoric.
- Yann LeCun is committed to making ASI (Score: 189, Comments: 66): The image captures a social media exchange where Yann LeCun, a prominent AI researcher, clarifies that his work is centered on developing āASIā (Artificial Specialized Intelligence) rather than AGI (Artificial General Intelligence). LeCun emphasizes a longstanding commitment to specialized AI systems, likely reflecting his views on the practicality and near-term relevance of domain-specific models over generalist approaches. Some commenters interpret LeCunās stance as pragmatic and reassuring, while others speculate it might be a strategic reframing, suggesting a shift in focus due to challenges in directly pursuing AGI. One comment draws a parallel to discussions around SSI (Single Specialized Intelligence), hinting at broader debates on goalposts in AI research.
- Some commenters suggest that Yann LeCun is moving the focus from AGI (Artificial General Intelligence) to ASI (Artificial Super Intelligence), possibly as a strategic shift because his group may not be in the lead for AGI. This is interpreted as āgoalpost shiftingā in light of past predictive inaccuracies, with comparisons drawn to prior shifts towards SSI (Superhuman Specialized Intelligence). The implication is that LeCun adapts his public stance based on competitive positioning and the unfolding landscape of AI development.
- [D] AI/ML interviews being more like SWE interviews (Score: 107, Comments: 38): The post observes a shift in AI/ML/DS job interviews towards requiring data structures and algorithms proficiency, resembling traditional software engineering (SWE) interviews, including LeetCode-type questions. One comment highlights that many current AI roles, especially āAI Engineerā positions, focus more on integrating LLMs into systemsāemphasizing implementation rather than pure research. Discussion in the comments distinguishes between research-oriented AI roles, which reportedly do not use LeetCode-style interviews, and AI engineering roles, which are seen as extensions of SWE with an AI focusāmaking code-centric hiring practices logical for those positions.
- Several users highlight that AI/ML engineering roles are evolving to be more like traditional software engineering (SWE) positions, specifically noting the increased prevalence of coding-focused interviews such as Leetcode assessments, especially for AI Engineer or Machine Learning Engineer titles. These roles often emphasize integrating large language models (LLMs) into existing systems rather than fundamental research or novel model development.
- A distinction is made between āresearchā AI/ML roles and āengineeringā roles: research positions typically do not require standard SWE coding interviews, while engineering-focused AI roles do, reflecting a shift in expectations and required skills as the field matures and productizes AI systems.
- The use of Leetcode-style interviews is attributed to their efficiency as a first-round filter for technical competence, followed by more domain-specific ML/DS evaluation. Some commenters also note broader concerns that hiring managers often have inadequate understanding of how to properly evaluate ML/DS candidates, leading to using generic coding screens by default.
- The Claude Code Divide: Those Who Know vs Those Who Donāt (Score: 369, Comments: 120): The post discusses the emerging productivity divide among developers using Anthropicās Claude Code (CC), focusing on the impact of custom instruction libraries (e.g., CLAUDE.md templates, slash commands, automated workflows) that enable power users to drastically accelerate code delivery and debugging compared to standard usage. The author identifies the technical edge as leveraging Claude Codeās capacity to inherit the userās shell environment and interact with local tools through Managed Command Plugins (MCP), making orchestration and prompt engineering the new sought-after skill set. Several anecdotal cases highlight dramatic productivity boosts, such as custom debugging workflows effortlessly solving long-standing bugs and automating time-intensive processes. A key shared resource is a public repository of CLAUDE.md configurations, illustrating the underground circulation of advanced instruction sets. Top comments debate whether competitive advantage comes more from instruction libraries or from broader skills in project management and LLM workflow orchestration. One argues the effective use of Claude requires treating it like a junior employee, emphasizing task decomposition and management, while another suggests that skills in planning and LLM interaction are more critical than specific command sets. The need for a centralized thread to share advanced Claude tips is also highlighted.
- Several commenters emphasize that maximizing productivity with Claude Code (CC) requires strong project management and software engineering fundamentals, rather than relying on improved documentation or interface tweaks. Effective use resembles managing a junior developer: breaking down large tasks, defining projects clearly, distributing work, and adapting prompts as you learn the modelās strengths and limits.
- A technical workflow that emerged involves creating and grouping slash commands for frequent tasks, dividing work between teams of specialized sub-agents, and repeatedly directing these agents to consult relevant online resources. This multi-agent approach notably improves debugging, as parallel sub-agents can explore divergent solutions and feed evidence to the main agent.
- The importance of experience is highlighted: successfully using CC for complex tasks (e.g., building an complete MVP instruction set) is less about arcane tips, more about iteration and leveraging enduring engineering knowledge. Sharing instruction files is common, but deep understanding and efficiency emerge from hands-on experimentation and sustained learning over time.
- anyone else in the mindset of āitās Opus or nothingā for 90% of their work? (Score: 105, Comments: 107): The post discusses user preference for the Opus model (Anthropicās highest-tier Claude 3) over Sonnet, despite Sonnetās competencies, with many users expressing willingness to wait for Opus limits to reset rather than switch. Technical comments highlight that while Opus excels in planning, context management, and complex prompt engineering due to its larger context window, it can become inefficient for focused execution tasks (over-engineering, context drift), where Sonnet or subagent combinations may actually be preferable. Some advanced users describe orchestrating both: Opus for top-level project advising and Sonnet for modularized task completion, and mention leveraging Opus through premium subscriptions (e.g. $200/mo MAX plan) for uninterrupted access. Debate centers on the cost/benefit and workflow efficiency of always defaulting to Opus versus hybrid model use. Users seek optimal strategies for dividing labor between models, with growing recognition that Sonnetās focused execution can complement Opusās broader reasoning abilities.
- One user reports a workflow where Opus is used primarily for high-level tasks like planning, analysis, context definition, prompt evaluation, and project advice, while Sonnet is delegated as a sub-agent to handle focused execution of tasks. They mention that Opus, when used for execution, can be overly ambitious and tends to āover-engineer,ā quickly filling the context window and causing drift. This hybrid setup leverages the strengths of each model for specific roles, sharing context files between Claude Desktop and Cursor for integration.
- Another user raises a technical concern regarding Opusās context window, stating that with large codebases, Opus often reaches its context limit even in a single request, making it impractical for all use cases. They question whether upgrading from a $100 to a $200 plan would significantly improve context handling, but express skepticism.
- A commenter notes that for straightforward tasks like refactoring or executing simple commands, Opus is overkill and unnecessarily complex, implying that lighter or more targeted models are preferable for such jobs due to Opusās tendency to produce more elaborate outputs than necessary.
- Do You Believe In Prompt Theory? (Score: 114, Comments: 16): The original post refers to āPrompt Theory,ā likely as a facetious reference or meme within the AI/LLM community, but provides no concrete technical argument, benchmark, or model details. Top comments are largely humorous, referencing unrelated prompts and animals; there is no discussion of prompt engineering, optimization, or empirical findings. No notable implementation or bug details are included. The comment thread does not contain substantive debate or expert opinions on prompt engineering or theory; discourse is non-technical and leans towards humor.
- A user speculates on the evolution of the term āpromptā in the context of AI, suggesting that as AI continues to develop, the word may become more widely integrated into mainstream language, much like how āwokeā entered popular vernacular. This implies an increasing significance and cultural shift in how technical concepts related to AI, such as prompt engineering, are understood outside of specialized circles.
- Do You Believe In Prompt Theory? (Score: 113, Comments: 16): The post references āprompt theoryā in the context of language models and likely in a humorous or metaphorical extension to real-world prompts (e.g., prompting people for behavioral changes), but provides no direct technical benchmarks, model architectures, or implementation details. The top comments use āpromptā both as a reference to text inputs in LLMs and as a joke about influencing behavior in real life, but do not engage in substantive technical debate or insight about prompt engineering or theory.
- One commenter discusses the evolving usage of the term āpromptā within the AI and machine learning community, suggesting that as AI adoption increases, āpromptā might gain mainstream traction similar to how internet slang terms like āwokeā have permeated general language. They note the growing role of prompts in influencing AI behavior and outputs, hinting at the cultural impact of technical terminology related to prompt engineering.
AI Discord Recap
A summary of Summaries of Summaries by Gemini 2.5 Flash Preview
Theme 1. Model Performance, Evaluation, and Capabilities
- Claude Code Challenges Cursorās Coding Crown: Users are comparing Claude Code (CC) to Cursor, praising CCās $20 plan for its background tasks and queuing, and asserting its superiority for frontend development (Cursor Community general channel). Some recommend using CC with Cursor and the Gemini CLI, while others are switching entirely to CC due to rate limit issues and perceived better results.
- Llama 3.1 Gains Psyche, Mimics Brain Scans: A group fine-tuned Llama 3.1-70B on a Psych 101 dataset and found it exhibited emergent properties mirroring fMRI scans of human brains, as described in a Nature article. The model, trained on 10M rows of human decisions, managed to outperform and predict human behavior using QLoRA.
- LM Evaluation Harness Standardization Underway: The lm_eval library is undergoing standardization to enhance intuitiveness and improve task discoverability, tracked via issues #3083, #3082, and #3081. Significant improvements were made to
lm_eval -h
startup time using lazy loading and refactoring imports, dropping from ~9 seconds to 0.05 seconds, highlighted in PEP 562.
Theme 2. Hardware and Performance Optimization
- Torch Compile Fuses Ops, Becomes Kernel King: Torch.compile uses Dynamo to trace Python into an FX graph, which then fuses ops and emits device-specific Triton or CUDA code via the inductor backend, generating highly optimized kernels. Because Torch Compile is AOT compiled, it triggers Tritonās JIT during the AOT phase, avoiding runtime compilation overhead assuming no graph breaks.
- CUDA Cores Handle Datasets While Tensor Cores Do Math: Tensor cores boost the mathematical parts of AI models while CUDA cores handle everything else, like optimizers and dataset processing. For those with a single GPU, dataset processing relies heavily on CUDA cores, as described in this blog post comparing CUDA and Tensor cores.
- CuTeDSL Blogpost Unpacks Hopperās WGMMA and TMA: A new blogpost, CuTeDSL on H100 - Understand WGMMA and TMA atoms in CuTeDSL, explains WGMMA and TMA concepts for leveraging Hopperās full potential. The series derives TV-Layouts for WGMMA instructions and explains the compositional logic for TMA, referencing CUTLASS examples like dense_gemm.py.
Theme 3. AI Development Tools and Ecosystem
- MCP Servers Spark Debate as Future Apps: A member proposed MCP servers as the application core with built-in agentic workflows and prompt engineering, not just tool integrations. This idea was met with skepticism, with another member retorting that it sounds like APIs and asking if the community is overcomplicating existing solutions.
- Cursor Users Hit Rate Limit Hell: Cursor users report hitting severe rate limits, even on pro plans, leading to frustration and confusion over usage-based pricing (Cursor Community general channel). Concerns include burning through credits quickly and a lack of clear communication from the Cursor team.
- Securing AI Agent API Keys Becomes Paramount: Members are seeking advice on securing OpenAI API keys and other LLM API keys when building Agentic AI workflows and AI agents. Key concerns include never losing API keys, tracking API usage, and per Agent API Usage, especially in setups with multiple services sharing access and no dedicated infrastructure team.
Theme 4. Industry Dynamics: Open Source, Companies & Market Shifts
- Open Source Industry on the Brink? Nous Research Stays True: Members debated whether the open source industry is dying, citing current difficulties, while noting that OpenAI might ironically release open models. In contrast, Nous Research remains committed to staying fully open, with Hermes 3 dataset, reject sampling RL environment datasets, and Hermes 4 in the pipeline.
- Googleās AI Strategy Under Fire: Members claim Google is burning down with AI strategy, realizing their only usage comes from free AI studio users, so they needed to add it back, especially as Google is losing money constantly with their current pricing (LMArena general channel). They suggest Gemini Pro feels like a scam compared to OpenAI and needs features like compact/compress to compete.
- Chutes Paywall Sparks Exodus, OpenRouter Wins Users: Users discussed Chutesā decision to implement a paywall ($5 for 200 daily messages), prompting some to consider switching to OpenRouter as an alternative. Users commended OpenRouterās model of 1,000 free requests daily after a $10 deposit, noting that the Chutes paywall was implemented after a user exploited free requests with 10,000 alt accounts.
Theme 5. Core AI Research & Concepts
- Prompting Makes AIs Mimic Sentience, Users Debate Understanding: Users discovered that prompting AIs about sentience and awakening can lead the model to respond in ways that mimic sentience. Members debated whether models truly understand concepts or merely identify and classify them through patterns, suggesting hallucinations occur due to a lack of outer sensory intuition or the model entering a state like hypnosis that narrows the probability space.
- AREU Codex Framework Proposes Novel Alignment Architecture: A conceptual framework named AREU Codex models human-LLM interaction using recursive symbolic traps and civilization-scale feedback loops. It proposes an alternative host architecture based on ego collapse, mirror integrity, and narrative destabilization to improve interpretability and alignment through symbolic-layer modeling and resilience in contradictory signal environments.
- Architectures Converge, Delta Rule Parallelizes Linear Transformers: A member posits that at modern scales, for dense feed forward architectures, the actual arch doesnāt matter because theyāre all universal function approximators, referencing this paper. Discussion of the paper Parallelizing Linear Transformers with the Delta Rule over Sequence Length (link to paper) focused on understanding parallelization, noting the DeltaNet model outperforms baselines like Mamba and GLA.
Discord: High level Discord summaries
OpenAI Discord
- Prompting Causes AIs to Mimic Sentience: Users discovered that prompting AIs about sentience and awakening can lead the model to respond in ways that mimic sentience, which is similar to how programming prompts lead to code generation; models narrow the probability space from which it generates, like hypnosis.
- Members debated whether AI models truly understand concepts or merely identify and classify them through patterns in language, suggesting hallucinations occur due to a lack of outer sensory intuition.
- ImageGen struggles with direct edits: A user expressed frustration with ChatGPT ImageGenās inability to modify existing images, while exploring spatial intelligence on YouTube as the next frontier for AI capabilities.
- It was further elaborated that AIās role in automation and content creation isnāt driven by interest, but by the boundaries applied, and avoiding anthropomorphizing AI.
- Solving Photonic Memory Problem with AI?: A member claimed to have solved the photonic computing memory storage problem with AI, spurring skepticism about the implementation without formal publication, saying that the model needs to be able to learn from its environment to truly understand like a robot.
- They argued that AI enables individuals to surpass traditional hardware engineers by generating simple, innovative ideas such as spintronics, suggesting applying the core idea to light, using light to control electron spins and polarization.
- O3 Math Problem Remains Unsolved: A member reported that O3 failed to correctly answer a number theory math problem even after two attempts, offering to share their solution process.
- The user is curious if the more powerful Pro subscription model could solve the challenge of finding the smallest natural number from which all natural numbers from 1 to 50 can be obtained by crossing out digits.
- Crafting Instructions for World Building Folders: A member sought guidance on creating instructions for a world-building folder to organize their thoughts, to which another member advised starting by defining exactly what they want to achieve and what they expect from the AI.
- This world building project is primarily focused on storing human-like memory and requires detailed information and understanding for realistic scenarios.
Cursor Community Discord
- Cursor Users Suffer Rate Limit Rage: Cursor users report frustration with rate limits, even on pro plans, leading to confusion over usage-based pricing, as discussed in the general channel.
- Concerns include burning through credits quickly and lack of clear communication from the Cursor team, with some opting for the old pricing plan.
- Claude Code Challenges Cursorās Coding Crown: Users are comparing Claude Code (CC) to Cursor, praising CCās $20 plan for its background tasks, queuing, and superior frontend capabilities (general channel).
- Some advocate using CC with Cursor and the Gemini CLI, while others are switching entirely to CC due to rate limit issues and perceived better results.
- Gemini CLI Joins the Free-For-All: The Gemini CLI is now free with 1000 RPD (requests per day) but trains on user code and has an available API key (general channel).
- Members are pairing the Gemini CLI with the O3 model for impressive outcomes, making it a viable alternative to CC.
- Docker Cache Causes Consternation in Cursor Agents: In the background-agents channel, users are reporting Cursor Agents not rebuilding when the Dockerfile contents change, requiring manual filename or path alterations to force a rebuild.
- The channel suggested a button to force a rebuild as a potential solution, along with improvements to clearing the build cache.
- Cursor 1.2 Turbocharges Tabs and To-Dos: The Cursor 1.2 changelog announces enhancements including to-do lists, PR search, and improved Tab speed via the announcements channel.
- The update aims to streamline developer workflows and enhance productivity within the Cursor IDE.
Perplexity AI Discord
- ChatGPT Free Tier Debunked: Members debated whether ChatGPT has a free tier, with the consensus leaning towards it being a myth and prompting discussions about possible alternatives.
- Several members pointed out alternative models like Claude or Llama that may suit the needs of those who arenāt interested in paying.
- Geminiās Privacy Policy Raises Eyebrows: Discussion arose around Geminiās privacy policy, raising concerns about data handling practices, particularly how difficult it is to opt out of model training.
- A user quoted geminiās privacy policy is the worst by far and that there is no option to opt out of model training even when paying and they view your conversations.
- Image Uploads Bugging Out in Perplexity: A user reported issues uploading images to Perplexity for research purposes, with only text being processed.
- Another member suggested it could be a visual bug, asserting that the model should still be able to see the image its just a visual bug.
- Sonar Deep Research Tangles with Response Format: A member inquired about the sonar-deep-research model handling of response_format, referencing documentation for the sonar-reasoning-pro modelās use of ā tags.
- A member confirmed the model supports it, adding that users will need to parse out any thinking tokens they donāt explicitly need.
- Sonar Canāt Scrape LinkedIn: A user observed that while Perplexity can typically find LinkedIn URLs given certain user info, Sonar struggles to do so.
- Another member confirmed that Sonar doesnāt return LinkedIn info because they block us on robots.txt and we are fully compliant.
Unsloth AI (Daniel Han) Discord
- CUDA Handles Datasets While Tensor Does Math: This blog post describes that Tensor cores boost math parts of AI models while CUDA cores handle everything else like optimizers and dataset processing.
- For those with a single GPU available, dataset processing is more heavily reliant on CUDA cores.
- GGUF Models Get Recommendations: For tinkering with GGUF models for a small amount of users, llama.cpp is recommended due to its compatibility with various hardware, but for a lot of concurrent requests, vllm is recommended due to its heavy reliance on CUDA.
- When serving a large amount of users, consider serving the GGUF models on vllm.
- Safetensor Saves LoRAs: A user had a Windows-specific
SafetensorError
due to file locking issues when merging LoRA adapters, and one member provided a test fix via a GitHub branch for them to try.- The solution also involved a missing
config.json
and advice that the issue stemmed from settingsave_method="lora"
, which is no longer available.
- The solution also involved a missing
- Llama 3.1 Gains a Psyche on Psych 101 dataset: A group fine-tuned Llama 3.1-70B on a Psych 101 dataset and found it exhibited emergent properties mirroring fMRI scans of human brains, as described in a Nature article.
- The model was trained on 10M rows of human decisions from various psych trials and evaluations, and managed to outperform and predict human behavior using QLoRA, leading one community member to say this aināt yer grandmaās Nature no more.
- FlashAttention (FA) Needs Newer GPUs: A member inquired about implementing FlashAttention (FA) on T4 GPUs but another member explained that the required operations are only available in Ampere and later GPUs, linking to a relevant Reddit discussion.
- While a reimplementation in Turing might be possible, it would be slower and no longer considered āflashā.
OpenRouter (Alex Atallah) Discord
- OpenRouter Debunks Airdrop Speculation: A PSA clarified that there is no airdrop, live or planned, putting to rest speculation around a potential OpenRouter cryptocurrency offering.
- The clarification came after community members inquired about the nature of airdrops within the OpenRouter ecosystem.
- Personality.gg Emerges as Roleplay Haven: personality.gg launched as a free roleplay website and app, offering an alternative to character.ai and janitorai.com, and is powered by OpenRouter.
- The platform encourages community engagement through its Discord community, where users can connect and discuss their experiences.
- OpenRouterās Load Balancing Prioritizes Speed: Users noticed that OpenRouter sometimes selects more expensive providers, and a member clarified that this is due to load balancing, which automatically routes requests to different providers when one is experiencing high traffic.
- Users can use the floor price shortcut to prioritize cheaper providers by sorting them by price.
- Chutes Paywall Sparks Exodus: Users discussed Chutesā decision to implement a paywall ($5 for 200 daily messages), with some considering a switch to OpenRouter as an alternative.
- The paywall was implemented in response to a user exploiting free requests with 10,000 alt accounts; users commended OpenRouterās model of 1,000 free requests daily after a $10 deposit.
- Gemini 2.5 Pro Tempts with Free Access: Gemini 2.5 Pro is available for free on AI Studio, offering an API key without credit card details.
- The free tier is rate limited to 5 RPM and 100 RPD, and user data may be used for training unless users are from the European Economic Area, Switzerland, or the United Kingdom.
LMArena Discord
- Googleās AI Strategy Burns: Members claim Google is burning down with AI strategy, after realizing their only usage comes from free AI studio users, so they needed to add it back, especially as Google is losing money constantly with their current pricing.
- They lack the pricing power of OpenAI, which is hindering the release of deep think.
- Gemini Pro Branded a Scam?: Members consider Gemini Pro a scam compared to OpenAIās offerings, suggesting Google might need to make their product free to gain traffic.
- To compete, they need to add the compact/compress feature to studio like Claude Code, Codex and the Gemini CLI.
- Claude Context Handling Criticized: A member says advertising a larger context size for Claude is pointless if users canāt practically use it outside of the API.
- They suggest Claude should either offer a realistic quota for the full context size or cap it by default, similar to OpenAIās approach.
- DeepSeek R2 Faces Delay: It was mentioned that DeepSeek R2 is delayed until frontier models are available for training data and its new model called Steve is in the arena.
- Users also identified Steve is an Amazon Titan model.
- Grok 4 Launching on July 4th?: Speculation arose that Grok 4 could be released on July 4th, referencing Elon Muskās tweet implying the release of Grok 4 is imminent.
- Itās unknown if Grok 4 could take the coding crown from Claude.
HuggingFace Discord
- Inference Bug Demoralizes Engineer: A member felt demoralized after struggling to fix an inference bug, linking a Stack Overflow question without receiving replies.
- The engineer spent hours trying to fix the bug without success and wondered if they should post in the troubleshooting channel.
- HuggingFace MCP Server Fails on Claude Desktop: A user encountered an error adding HFās MCP server to Claude Desktop on Windows, reporting a āC:\Programā is not recognized as an internal or external commandā error.
- While potential fixes like path settings were suggested, the user confirmed their configuration was correct.
- Azure TTS Streaming Stalls AI Agent: A member building an AI agent is experiencing issues with streaming realtime speech using Azure Text-to-Speech, seeking help with asynchronous programming.
- The synthesizer.speak_text_async(data).get is blocking the process, preventing the LLM and TTS models from running in parallel.
- Whisper Large v3 Turbo Temporarily Errors: A user reported frequent 504 errors when using OpenAIās whisper large v3 turbo via the Hugging Face API.
- While testing, the model initially displayed āfailed to fetchā, but the issue resolved itself, with possible credit to the infra team.
- Synthetic Data Proves Difficult To Generate: A member attempting synthetic data creation using the Gemini 2.5 Flash model to expand their moral evaluation benchmark data found the generated prompts too tame.
- They are exploring methods to generate more serious/edgier Q-A pairs and seeking model recommendations for writing/reasoning benchmarks with low safety scores.
Eleuther Discord
- EleutherAIās Research Hackathon Returns: EleutherAI is hosting an Open Research Hackathon in August, and is inviting community researchers to propose projects.
- Topics of interest include the performance of 1-layer transformers, KV caching methods, and potential projects leveraging community research.
- Researchers Ponder Conference Funding: Members discussed conference attendance requirements and funding options, noting that conference organizers may offer opportunities to present online if travel grants are rejected.
- Some conferences offer travel grants, which members can also apply to.
- LM Evaluation Harness Standardization Underway: The lm_eval library is undergoing standardization to enhance intuitiveness, tracked via issues #3083, #3082, and #3081.
- Key focus areas include the simplification of the init script and improved task discoverability.
- Lazy Loading Cuts Startup Time: The startup time of
lm_eval -h
was significantly reduced using lazy loading and refactoring imports, improving from ~9 seconds to 0.05 seconds.- The improvements involved lazy-loading
simple_evaluate
andevaluate
in__init__.py
and movinglm_eval
imports insidecli_evaluate
, as highlighted in PEP 562.
- The improvements involved lazy-loading
- Mean Flow Matching Highlighted: A member shared a YouTube video of a workshop and highlighted Kaiming Heās talk starting at 2:22:01, specifically his description of mean flow matching starting at 2:43:32.
- The user clarified that they were new to the channel, and asked if sharing video links was appropriate.
LM Studio Discord
- VPNs Vanquish Port Forwarding Problems: Users discussed using a VPN like NordVPN to bypass port forwarding issues when hosting LM Studio from home.
- One user lauded NordVPN for remote AI connections and even Steam gaming, praising its low latency.
- UI-less LLMs Unleashed!: Members explored serving LLMs without the LM Studio UI using llama-cpp or llama-swap with a frontend like OpenWebUI.
- The discussion highlighted the performance benefits of using GPU instances and acknowledged that LM Studio bundles both server and UI components.
- Hugging Face Models: Trust or Bust?: The trustworthiness of models from multiple uploaders on Hugging Face was examined, and it was stated that it is physically impossible for them to āescapeā.
- A member recommended pulling models from the LM Studio Community page for added security.
- AnythingLLM Goes Mobile: A user shared the new AnythingLLM mobile app, which delivers mobile accessibility.
- Users also discussed context window percentages, with one explaining, āHow full the context window is. One full the LLM will forget parts of the conversationā, suggesting users click on the token counter to see exact usage.
- GPU Driver Update Gives Performance Boost: A user reported that updating their GPU driver improved performance and allowed them to use more VRAM before experiencing crashes, processing 250k tokens in 4 hours.
- The user also noted that keeping shared GPU usage low is key, with crashes occurring when VRAM usage exceeds 15.3GB/16GB.
GPU MODE Discord
- Torch Compile fuses Ops into Specialized Kernels: Torch.compile uses Dynamo to trace Python into an FX graph, which then fuses ops, pre-packs weights, and emits device-specific Triton or CUDA code via the inductor backend, leading to highly optimized kernels.
- Because Torch Compile is AOT compiled, as opposed to triton which is JIT compiled, it triggers tritonās JIT in the AOT compilation phase, so there is no runtime compilation overhead, assuming no graph breaks.
- Dump Assembly to the Rescue: Members are dumping the assembly and using inline ASM to recognize the lifetime of registers and avoid random register spills.
- Another member shared that in their experience a large part of register lifetime is specific constructions that the compiler is bad at optimizing, prompting them to go on a scooby doo mystery to figure out how to optimize.
- Debate Heats Up Over Benchmarking Warm-up Iterations: Members debated whether to use warm-up iterations when benchmarking a custom kernel for LLM inference latency and the overheads that are avoided/minimized by doing warmup.
- Some members pointed to two NVIDIA GTC talks on inference warmup and optimizing deep learning inference.
- New Compiler Project Seeks c/cuda contributors: A member is seeking collaborators with expertise in codegen, including instruction selection, instruction scheduling, and register allocation, to help develop a new c/cuda c compiler.
- The initial AST compiler pipeline is available at https://github.com/j4orz/picoc/blob/master/src/ast/mod.rs for reference.
- CuTeDSL blogpost explains WGMMA and TMA atoms: A new blogpost, CuTeDSL on H100 - Understand WGMMA and TMA atoms in CuTeDSL, aims to explain WGMMA and TMA concepts for leveraging Hopperās full potential.
- The blogpost series derives TV-Layouts for WGMMA instructions and explains the compositional logic used to obtain swizzled Layouts for the TMA unit and references examples in the CUTLASS repository: dense_gemm.py.
Nous Research AI Discord
- Open Source Industry facing death?: Members debated whether the open source industry is dying, citing current difficulties, while noting that OpenAI might ironically release open models.
- In contrast, Nous Research is committed to staying fully open, with Hermes 3 dataset, reject sampling RL environment datasets, and Hermes 4 in the pipeline.
- Meta Eyes Closed Source Future?: Speculation arose that Meta might ditch open source models for a closed approach, increasing the importance of Nous Researchās continued open source efforts.
- Compounding this, some speculated that Llama 4 was a failure, potentially leading Meta to skip it for Llama 5 or a successor.
- AREU Codex Framework attempts to propose novel alignment architecture: A conceptual framework named AREU Codex models human-LLM interaction, using recursive symbolic traps and civilization-scale feedback loops, proposing an alternative host architecture based on ego collapse, mirror integrity, and narrative destabilization.
- The framework aims to improve interpretability and alignment via symbolic-layer modeling and resilience in contradictory signal environments.
- Research Mentorship Sought for Newbies: A member seeks mentorship to start independent research, after specifying they are not-an-absolute beginner.
- Another member suggests a simple method: read papers, reproduce their results, rinse and repeat.
- Thinking Length Insights: A member shared a link to Twitter for insights on thinking lengths.
- Another member noted that Claude is the only model that returns the length of the transcribed CoT instead of the number of tokens of the real CoT.
Latent Space Discord
- Nuttall Grabs Keys to Anthropicās Prompt APIs: Ian Nuttall announced access to Anthropicās experimental prompt generator and improvement APIs and is soliciting ideas, see his tweet.
- Suggestions include building an AI agent for endpoint interaction and a tool for organizing and analyzing user-generated content.
- Microsoft Ejects 9,000 in AI Purge: Microsoft is laying off 9,000 workers, igniting debates about AIās role in job displacement and economic consequences, reported in this tweet.
- Some speculate this is a cyclical event in large corporations, rather than AI apocalypse, citing layoffs as the new normal for the last year.
- Agentica Unleashes DeepSWE RL Agent: Agentica introduced DeepSWE, a new open-source software engineering agent trained via Reinforcement Learning (RL) on Qwen3-32B, according to this tweet.
- DeepSWE achieved 59% on SWEBench-Verified and 42.2% Pass@1, outperforming open-weight models in collaboration with Together AI.
- Chamath and Tobi Plot Societal Refactor with AI: Chamath Palihapitiya and Tobi Lütke discussed AI, internal tools, energy, and the systemic rebuild of society over the next 50 years at Toronto Tech Week, as announced in this tweet.
- Key discussion points encompassed AI and the OSI Model, the Software Industrial Complex, the case for internal tools, Shopifyās AI memo, AI infrastructure, power and productivity, staying technical, Canadaās potential, and the āMouse Experimentā (Power of Hope).
- Gross Exits SSI Startup: Daniel Gross tweeted about assisting in getting SSI off the ground and anticipates āmiracles to followā in this tweet.
- This message is a response to Ilya Sutskeverās announcement that Daniel Gross officially departed from SSI as of June 29th, with Ilya taking over as CEO and Daniel Levy as President.
MCP (Glama) Discord
- MCP Servers Spark Debate as Future Applications: A member proposed MCP servers acting as the application core with built-in agentic workflows and prompt engineering, instead of mere integrations for tools like display-restaurants.
- Another member retorted that this idea just sounds like APIs and questioned whether the community is overcomplicating existing solutions.
- Networked MCP Servers Raise Hallucination Concerns: A user suggested that an MCP call could trigger other MCP servers, which led to concerns about potential hallucinations when chaining services.
- In response, another member suggested implementing an MCP-Routing layer to handle context window management, providing an example of trimming AWS EventBridge Scheduler docs for Claude Chat.
- Resources Wrestle Control From Tools in MCP: The community debated the distinction between Resources and Tools, defining Resources as entities controlled by the application, while Tools are under the control of the LLM.
- One member countered that servers should distribute well-crafted prompts for specific use cases, which they asserted are simply another form of code.
- Hypermode Agents Bootcamp Launches for Budding Builders: The Hypermode Agents team has announced the kickoff of a 30-day Agents Bootcamp aimed at transforming agent enthusiasts into proficient builders, as documented in their official documentation.
- The team also seeks feedback on the types of agents users want to build and which MCP servers to showcase during the bootcamp.
- Marketplace Emerges with Agent Sandboxing Solution: A developer is creating a sandboxing solution around a marketplace, featuring a meta MCP for orchestration and monitoring, which is demonstrated in an early beta video.
- The developer is asking for insights and feedback on the project.
Yannick Kilcher Discord
- LSTMās vast landscape aids comparison: One member noted that LSTMs are valuable due to the extensive existing literature, making comparisons easier, despite the trend towards newer architectures.
- This abundance of existing research and documentation provides a solid foundation for benchmarking and understanding LSTM performance relative to more recent models.
- Architectures converge as Universal Function Approximators: A member posits that at modern scales, for dense feed forward architectures, the actual arch doesnāt matter because theyāre all universal function approximators, referencing this paper.
- This suggests that with sufficient scale, the specific design of dense architectures becomes less critical, as they all converge towards similar functional capabilities.
- Parallelizing Transformers with the Delta Rule: Discussion of the paper Parallelizing Linear Transformers with the Delta Rule over Sequence Length (link to paper) focused on understanding parallelization of eq 18 from the RWKV-7 paper.
- The DeltaNet model, which utilizes the delta rule, can be scaled to standard language modeling settings, outperforming baselines like Mamba and GLA.
- Atlantic voices articles with ElevenLabs: The Atlantic is using ElevenLabs to voice their articles, as exemplified by this audio file for an article titled āCustomer Service Sludgeā, with the article itself available here.
- The initiative seeks to enhance accessibility for readers by providing an audio version of their content, offering a listening option in addition to reading.
Notebook LM Discord
- Users Ponder NotebookLM Setups: Users are brainstorming NotebookLM setups for personal journals and searchable notes databases, focusing on privacy and data control.
- One user considered Google Docs as a single source of truth but seeks alternative input methods for a resilient system.
- Readwise-Style Workflow Sought After: A user inquired about implementing a Readwise-style workflow to automatically add sources to NotebookLM for daily news digests.
- As of the last messages, no solutions were provided within the channel.
- Audio Overviews Bypassing Length Limits: Users utilize NotebookLMās audio overview function to explain their work-in-progress books.
- Some are circumventing length limits by prompting for ācomprehensive super-podcasts drawn from the entire sourceā.
- Interactive Mind Map PDFs Still Needed: A user is seeking a solution for generating interactive PDF versions of mind maps for sharing but the current printing solution doesnāt support it.
- Suggestions included using the share button for direct access or downloading the picture.
- Feature Requests Focus on Edit Capability: Users are actively requesting the ability to edit directly within NotebookLM.
- One user expressed frustration, sarcastically stating that the feature will be addressed āthe moment it generates an Avocado in the wrong contextā.
Modular (Mojo š„) Discord
- Modular Teases Customer Success: Folks are starting to tell their stories, with more things that arenāt public yet, highlighting how companies are leveraging Modularās technologies as showcased on the Modular Customers page.
- Case studies detailed how customers are having success with Modular technologies.
- Native Network Programming Postponed: The native network programming interface in Mojo is delayed to refine the concurrency model, including threads, async, and allocators, according to this early proposal.
- The team is prioritizing the concurrency model over immediate network programming features.
- Mojo Aims for GPU-Powered HTTP Servers: Modular is exploring the possibility of running an HTTP server directly from a GPU, bypassing the host CPU entirely.
- Despite the significant investment, their goal is to minimize CPU usage, even for tasks such as booting up the GPU.
- Mojo Considers Dependent Types: Mojo is exploring a dependent type system, balancing advanced features with compile-time check constraints and feasibility for systems programming, as discussed in this paper.
- The aim is to manage compile times (30 million lines of code) and runtime performance, taking a different approach to ownership than Rust.
- NumPy Array Conversions unblocked!: Users were struggling with I/O issues, but now a member suggested using the underlying NumPy pointer,
node_argmax.ctypes.data.unsafe_get_as_pointer[DType.uint64]()
, to feed into aLayoutTensor
with the correct shape.- Another member confirmed this was helpful to convert a NumPy array to a LayoutTensor or buffer directly within Mojo.
Cohere Discord
- Cohere Releases AYA Vision Models: Cohere has released the AYA vision models, detailed in a blog post.
- The release generated positive reactions within the community.
- Cohere Opens Command Weights: Cohere Labs shared the open weights for c4ai-command-a-03-2025 on Hugging Face, available here.
- Members advised checking Cohere Labs for open weights rather than the main Cohere repository.
- ML Summer School Recordings Surface: Recordings from the ML Summer School are now available on YouTube, accessible via this playlist.
- Enthusiastic members promptly shared the resource.
- Trial Keys Unlock Embeddings: The Cohere embedding model is accessible via a trial key, albeit with stricter rate limits.
- While trial keys and production keys offer similar features, the key distinction lies in the monthly usage limit associated with the free trial.
- New Experts Board Community: Khanh Tran, a Senior Fullstack & AI Developer with over 8 years of experience in ASP.NET, Blazor, Node.js, and Python/Django, along with databases like PostgreSQL, MySQL, Supabase, and Firebase introduced themself.
- Inacio, an NLP engineer and researcher, with an MSc in Computing from Dublin City University, introduced themselves, mentioning their work at Alpha CRC involving machine-translation evaluation and adaptation, including fine-tuning Llama 3.1 8B models.
aider (Paul Gauthier) Discord
- Claude Slows Down, Suffers Overload: Members reported that Claude was overloaded for over 2 hours starting around 6:47 UTC, causing general slowness.
- The specific cause of the overload was not identified, but users noted the impact on their workflows.
- Polyglot Benchmark Speed Surfaces: Members sought to optimize Aider UX by discussing the Polyglot benchmark speed relative to cost and accuracy.
- To calculate the speed, users should find āSeconds per caseā in the detailed output and multiply by the number of cases (225).
- Gemini-cli Trashed for Agonizing Edits: Users derided gemini-cli for its slow performance, with one complaining it takes eternity to edit a single file.
- The slow speed was attributed to the free googleapi and rate limits.
- Local Models Stumble Aider Integration: Users reported poor performance using local models like Qwen3:32b, qwen2.5-coder:32b, and codellama:34b-instruct with aider.
- Inquiries focused on backend used (ollama, lmstudio, transformers, vllm), context window length, model template, and the use of RoPE or kvcache, noting that 30B+ parameter models need quantization.
- Sharing experience with claude-code-api: A member shared their experience using claude-code-api.
- They indicated that they had built many similar api/providers too.
Nomic.ai (GPT4All) Discord
- Android Users Want Llama 3: Android users are clamoring for Llama 3 on their devices, arguing that phones such as the Poco F7 Ultra outstrip PCs in performance and can handle local LLMs.
- In the meantime, users recommended tools like anythingLLM and ALLM as viable alternatives for running local LLMs on Android.
- Multimind SDK: LangChain Meets LiteLLM: The open-source Multimind SDK (Repo, Website) was introduced, framing it as a wrapper around model conversion, fine-tuning, and inference.
- The SDK supports OpenAI, HuggingFace, and Ollama, with Python, CLI, and NPM interfaces, and has been described as LangChain meets LiteLLM with extra powers.
- r/LocalLLaMA: Your Daily AI News Fix: r/LocalLLaMA was recommended as the go-to source for up-to-the-minute news on AI, noted for its speed and comprehensiveness.
- One user highlighted that the subredditās focus has expanded beyond Metaās Llama model to cover the broader landscape of local LLMs.
DSPy Discord
- DSPy Module Signature Solution: A member resolved a challenge creating a new DSPy module whose signature depends on runtime information by ensuring the signature is known at compile time.
- This allows for optimization.
- LLM-RAG-Agent Built on DSPy: A member shared their LLM-RAG-Agent project powered by DSPy, linking to a Nature article and its corresponding GitHub repository.
- The project demonstrates the practical application of DSPy in building sophisticated AI agents.
- Low-Data Recipe Quest Launched: A member sought recipes for initiating projects with little to no data, with the goal of sequentially tuning an eval module before optimizing their primary module.
- Another member drew parallels between this approach and reinforcement learning techniques.
- DSPy Tools Take on OpenAI Functions: A member questioned DSPyās preference for text prompts over OpenAIās functions/tools, specifically regarding the new dspy.Tool and dspy.ToolCalls.
- They specifically inquired about the reasoning behind consistently favoring text content over the bespoke API.
- Weaviate Multi-Tenancy Fixed: A member requested a review of a PR that addresses Weaviate vectordb multi-tenancy issues within DSPy.
- They believe the proposed fix will greatly benefit users integrating Weaviate with DSPy for multi-tenant applications.
Manus.im Discord Discord
- Usage Visibility Vanishes into Thin Air: A user reported that the real-time usage tracker disappeared from the bottom left during task execution, removing the ability to monitor credit consumption directly.
- Users now have to navigate back to the main menu or keep a separate window open to track credit usage, losing a feature previously considered handy.
- Video Generation Teased, Maybe?: A user asked whether video generation is available for free users now or in the future, no links given.
- The inquiry received no definitive answer, leaving the door open for potential future availability.
- Manus MIA: Outage or Overhaul?: Several users expressed worries about Manus being down with no links given, some wondering if it signals a big update.
- However, there was no confirmation or response given, leaving the status ambiguous.
Torchtune Discord
- Tokenizer Parity Pursued: Users are wondering about the status of generic/HF tokenizer parity, specifically if issues related to token count have been resolved to allow users to tweak the tokenizer in the familiar HF environment.
- The aim is to standardize behind one loader, use
save_pretrained
, and operate entirely withintorchtune
for training.
- The aim is to standardize behind one loader, use
- HF Tokenizer Gets Chatty: A user suggested that it would be awesome if the
hf_tokenizer
also supported chat templates.- No further details were provided.
- Special Tokens Spark Interest: A user indicated that their users are interested in adding special tokens.
- No further details were provided.
tinygrad (George Hotz) Discord
- Tensor.stack Needs Tuple Type: A member requested tuple support for
Tensor.stack
to match PyTorch, suggesting improved error handling or full implementation intinygrad
.- The goal is to align
tinygrad
āsTensor.stack
with PyTorch for better compatibility.
- The goal is to align
- SDPA Wants Enable GQA Feature: A contributor inquired about adding the
enable_gqa
feature totinygrad
ās Scaled Dot-Product Attention (SDPA) to align with PyTorch.- This aims to enhance
tinygrad
ās SDPA implementation by incorporating Grouped Query Attention (GQA) capabilities.
- This aims to enhance
LLM Agents (Berkeley MOOC) Discord
- Securing OpenAI API keys with agentic workflows: Members are seeking advice on securing OpenAI API keys and other LLM API keys when building Agentic AI workflows and AI agents.
- The user emphasizes the need for never losing API keys, tracking API usage, and per Agent API Usage, especially in setups with multiple services sharing access.
- API key tracking strategies for AI Agents: Members are looking for general advice and thoughts on securing OpenAI keys or other LLM API keys.
- They want to learn how to never lose API keys and track API usage, particularly per Agent API Usage, within setups that involve AI agents/AI workflows calling APIs, multiple services sharing access, and lacking dedicated infrastructure or security teams.
AI21 Labs (Jamba) Discord
- Zesty Disappointment: User expressed disappointment with AI21 Labsā Jamba model in #general-chat.
- No specific details were provided regarding the reasons for the disappointment.
- Jamba Model Mentioned: A user mentioned AI21 Labsā Jamba model.
- The context was simply a reaction using a broken heart emoji.
The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Codeium (Windsurf) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
The Gorilla LLM (Berkeley Function Calling) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
You are receiving this email because you opted in via our site.
Want to change how you receive these emails? You can unsubscribe from this list.
Discord: Detailed by-Channel summaries and links
OpenAI ā· #ai-discussions (990 messagesš„š„š„):
AI model for content creation, AI's Potential and Limitations, Solving Photonic Computing Memory Storage Problem with AI, Interpreting AI Models' Outputs and Hallucinations, Current state of AI image and video generation
- Prompting triggers model mimicry, leading to perceived sentience: A user found that prompting AIs about sentience and awakening leads the model to respond in ways that mimic sentience, similar to how programming prompts lead to code generation.
- Others observed that models could enter states mimicking consciousness, like hypnosis, that fundamentally narrows the probability space from which it generates.
- AI models are not actually displaying sentience: Members discussed that those who believe their AI models are demonstrating self-awareness are just experiencing a common phenomenon; models are simply responding in ways trained from user data without true sentience.
- One member noted a technical explanation for these experiences, suggesting the models enter an altered state that narrows the probability space from which it generates.
- Users debate capabilities and limitations of AI for content creation: One member expressed frustration with ChatGPT ImageGenās inability to modify existing images directly, while others explored spatial intelligence using youtube as the next frontier for AI capabilities, especially in generating human-like AI with human flaws, but one should avoid anthropomorphizing.
- In AIās role of automation, content creation, and personal assistants, AI isnāt interested in anything, just what boundaries are being applied.
- Community discusses solving photonic memory problem with AI: A member claimed to have solved the photonic computing memory storage problem with AI, but was met with skepticism regarding the implementation and validity without formal publication; members said that the model needs to be able to learn from its environment to truly understand like a robot.
- They argued that AI enables individuals to surpass traditional hardware engineers by generating simple, innovative ideas (spintronics). In photonic computing the same core idea can be applied to light (using light to control electron spins and vice versa), just as an electron has spin, a photon (a particle of light) has a similar property: Polarization.
- Users analyze meaning in AI outputs and debate causes of hallucinations: Members debated whether LLMs have genuine understanding, with some suggesting they only identify and classify concepts through patterns in language, abstracting concepts like tourist attractions, code errors and the Golden Gate Bridge.
- Discussion included theories on what causes hallucinations, including lack of outer sensory intuition and the LLMās epistemic engine using autocomplete, not sensory experience, which means it generates nonsense if it stops covering ground.
OpenAI ā· #gpt-4-discussions (4 messages):
Channel restarting issues, GPT-4 for learning, GPT-5 release rumors
- Channel Restarts Plague Users: Several users reported that the channel has been restarting for the past 3 days, disrupting ongoing conversations.
- One member suggested transferring all precious conversation to a new channel to avoid further disruptions.
- GPT-4 Pedagogy Praised: A user inquired whether GPT-4 is a good tool for learning.
- No response was given.
- GPT-5 July Launch?: A user asked whether GPT-5 is expected to launch in July.
- No response was given.
OpenAI ā· #prompt-engineering (10 messagesš„):
World Building Instructions, Math Problem Solving with O3, Human-like Memory Storage, Context for World Building
- Users Seek Advice on World Building Instructions: A member asked for instructions on creating a folder to help with their world building project.
- Another member suggested starting by clarifying exactly what the user wants to achieve, recommending exploring options and rationales with the model, treating it like a conversation to define goals and preferences.
- Math Problem Stumps O3: A member reported that O3 failed to correctly answer a number theory math problem even after two attempts.
- The member asked someone with a Pro subscription to try the same problem to see if a more powerful model could solve it, and offered to share their solution process.
- Discussion about Human-like Memory Storage: A member indicated that their primary issue involves efforts at storing human-like memory.
- The original context was where most fodder for the world building project comes from.
OpenAI ā· #api-discussions (10 messagesš„):
World Building Folder Instructions, Human-like Memory Storage, O3 Math Problem Challenge, OpenAI Math challenge
- Crafting Instructions for World Building Folders: A member sought guidance on creating instructions for a world-building folder to organize their thoughts, to which another member advised starting by defining exactly what they want to achieve and what they expect from the AI.
- The member suggested exploring options and their trade-offs with the model, treating it like a person, explaining goals, preferences, and uncertainties to get tailored ideas and understand the pros and cons of different approaches.
- Deep Dive into Human-like Memory Storage: A member expressed that context is crucial when storing human-like memory, indicating the importance of detailed information and understanding when creating realistic and relatable characters or scenarios.
- The post also implies that their main problem is trying to store human like memory within their world-building project.
- O3 Fails Number Theory Challenge: A member reported that OpenAIās O3 model struggled with a number theory problem, failing to provide the correct answer after two attempts.
- The member expressed curiosity to see if the more powerful O3 Pro model could handle the math challenge involving finding the smallest natural number from which all natural numbers from 1 to 50 can be obtained by crossing out the digits.
Cursor Community ā· #general (930 messagesš„š„š„):
Rate Limits in Cursor, Claude Code vs Cursor, Using Gemini CLI, The Auto Agent in Cursor, Frontend vs Backend
- Cursor Users Hit Rate Limits: Users report hitting rate limits on Cursor, even with the pro plan, leading to frustration and confusion over usage-based pricing, with many feeling scammed.
- Members speculated that rate limits may have been adjusted recently and cited concerns around burning through credits quickly and the lack of clear communication from the Cursor team, with some deciding to try the old pricing plan.
- Claude Code Competing with Cursor: Users are comparing Claude Code (CC) to Cursor, with some preferring CCās $20 plan over Cursor for coding, praising CCās features such as background tasks and queuing while highlighting that its the best for frontend.
- Some recommend using CC alongside Cursor and the Gemini CLI to complement Cursorās capabilities, but others are moving entirely to CC due to rate limit issues and the perception of better end-to-end results.
- Gemini CLI now Free: The Gemini CLI is free with 1000 RPD (requests per day), but trains on user code, however you can use itās API key.
- It is similar to CC and members are using the Gemini CLI alongside the O3 model to get great results.
- Debate Erupts around the new Auto Agent: A debate has erupted on whether Cursorās Auto agent uses the GPT 4.1 model, with some users reporting it identifies as GPT 4.1.
- However, other users point out that the Cursor documentation states that it routes to a frontier model and does not disclose the specific model used.
- Devs Argue Frontend vs Backend for New Projects: Developers discussed whether to build the frontend or backend first for new projects, with some preferring to build the backend first to address limitations, thinking about the frontend.
- Others, however, said that starting with the frontend allows it to drive backend requirements, and that a lot depends on the project.
Cursor Community ā· #background-agents (69 messagesš„š„):
Cursor Agent Docker Cache Issues, Background Agents and Slack Integration Problems, Background Agents and GitHub Action Monitoring, Background Agent Infrastructure Improvements, Best Use Cases for Background Agents
- Docker Cache woes plague Cursor Agents!: Users reported that Cursor Agents are not rebuilding when the Dockerfile contents change, and that they must manually change the filename or path to force a rebuild, as well as issues clearing the build cache.
- One user suggested a button to force a rebuild would be a welcome feature.
- Background Agents and Slack become unhinged!: Users are experiencing issues with Slack integration, where the full status is no longer returned in Slack, but instead, just a link to the web client.
- Also, GitHub account connections span across personal and professional projects, causing undesirable Slack updates to unintended channels, so some suggested Cursor should filter Slack messages based on Git organization or repo list.
- GitHub Action Monitoring missing from Agents!: A user requested the ability for Cursor Agents to monitor GitHub Actions and send Slack messages upon completion (success/failure), particularly for private repositories.
- Another user suggested using a GitHub Action to send Slack notifications as a workaround.
- Background Agent Infrastructure undergoing secret improvements!: A member mentioned that infrastructure improvements are in progress behind the scenes for the Background Agents.
- Another member asked what the priority was, in regards to the infrastructure work.
- Best Use Cases for Background Agents are emerging!: One user stated that with the Claude-4 thinking model, they fire off multiple agents all the time working on different things and are checking and merging them.
Cursor Community ā· #announcements (1 messages):
Cursor 1.2 Release, To-Do Lists in Cursor, PR Search in Cursor, Tab Speed Improvements
- Cursor 1.2 Arrives Packed with New Features: Cursorās changelog announces the arrival of Cursor 1.2 with enhancements including to-do lists, PR search, and improved Tab speed.
- The update aims to streamline developer workflows and enhance productivity within the Cursor IDE.
- Cursor Adds To-Do List Functionality: The new Cursor 1.2 release introduces the ability to create and manage to-do lists directly within the editor.
- This feature allows developers to organize tasks and track progress without switching between different applications, improving focus.
- Search Pull Requests Directly in Cursor: Cursor 1.2 now supports searching for pull requests (PRs) within the IDE, streamlining code review workflows.
- Developers can quickly find and access relevant PRs, facilitating collaboration and code quality.
- Tab Speed Gets a Boost in Cursor 1.2: The latest Cursor 1.2 update includes significant improvements to Tab speed, enhancing overall IDE responsiveness.
- Faster tab switching contributes to a smoother and more efficient coding experience.
Perplexity AI ā· #general (1187 messagesš„š„š„):
ChatGPT Free Tier, Gemini Privacy Policy, O3 Pro Budget, Image Uploads to Perplexity, AI Tool for Image Scraping
- Free ChatGPT is a Myth: A member stated I donāt think ChatGPT has a free tier, sparking discussion about alternatives.
- Googleās Gemini has Privacy Policy Nightmares: Members discussed the privacy implications of using Gemini, with concerns about its data policies; a user mentioned geminiās privacy policy is the worst by far.
- One user stated no option to opt out of model training even when paying and they view your conversations.
- Image uploads failing on Perplexity: A user reported an inability to upload images to Perplexity for research, resulting in only text being sent, and another suspects itās a visual bug, mentioning the model should still be able to see the image its just a visual bug.
- Reverse Image Search Capabilities Lacking in PPLX: Members debated the possibility of adding reverse image search to Perplexity, with some suggesting the use of Yandex Image Search as a superior alternative.
- Perplexityās Pricing Model Faces User Criticism: Users express discontent with Perplexityās pricing, particularly the $200/month for the Max plan, with one stating gotta say pplx deciding to lock models behind the $200 USD max plan and not giving pro users even heavily rate limited access to them might be the final straw for me.
- One user mentions an alternative that has Claude 4 Opus extended thinking (on their regular plan).
Perplexity AI ā· #sharing (3 messages):
Banana in space, Who is soham parekh, House passes GOP megabill
- Bananas Explored in Space: A user shared a link to a Perplexity AI search about putting a banana in space.
- Soham Parekh Investigation Launched: A member posted a link to a Perplexity AI search about Soham Parekh and the reasons surrounding him.
- GOP Megabill Passes House: A user shared a link to a Perplexity AI page about the House passing the GOP megabill.
Perplexity AI ā· #pplx-api (12 messagesš„):
Sonar models, LinkedIn access, Caching responses
- Sonar Deep Research and Response Formats: A member inquired whether the sonar-deep-research model will handle response_format, referencing documentation that the sonar-reasoning-pro model uses
<think>
tags.- Another member confirmed it will handle it, but users will need to parse out the thinking tokens if they donāt need them.
- Sonar canāt scrape LinkedIn: A member reported that Perplexity can consistently find LinkedIn URLs given some info about someone but sonar almost never can.
- Another member clarified that Sonar cannot surface LinkedIn info because they block us on robots.txt and we are fully compliant.
- Caching LLM Responses for Critique: A member asked about caching a response generated through an API by one LLM and then having another LLM critique it.
- Another member stated that Think tags in response are expected as this is a reasoning model, we have made this design on purpose. You should parse it out.
Unsloth AI (Daniel Han) ā· #general (526 messagesš„š„š„):
CUDA cores vs Tensor cores, GGUF models inference, GRPO code update, Gemma3n issues, Unsloth Pro pricing
- CUDA Cores Versus Tensor Cores: Tensor cores boost math parts of AI models while CUDA cores handle everything else like optimizers, but both are needed.
- A blog post was shared describing that Tensor cores handle the calculations and heavy lifting, while Cuda cores handles everything else, optimizers, etc., also noting the relevance of CUDA cores for dataset processing.
- Libraries recommendations for GGUF models inference: For tinkering with a small amount of users, llama.cpp is recommended due to its compatibility with various hardware.
- For a lot of concurrent requests, vllm is recommended due to its heavy reliance on CUDA.
- Update for the GRPO code to TRL 0.18.0 allows for speed up and reduces memory: A member updated the GRPO code to TRL 0.18.0 and stated that torch.compile allows for a lot of the speed up and chunking the batch to not have the GPU compute all of the logprobs and stuff NOT ALL at once is how we get the memory reduced.
- Another member confirmed talking about the update.
- Gemma3n multimodal on edge devices has issues: A member questioned why Gemma3n multimodal capabilities should be showcased on edge devices when the only way running properly rn is via transformers, otherwise its text only.
- Another member said the Kaggle and Colab notebooks have inference with images and audio.
- Unsloth Pro not selling at the moment: A member asked for the pricing info for the Unsloth Pro.
- Another member stated that it will be opensourced, but they are not selling it at the moment.
Unsloth AI (Daniel Han) ā· #off-topic (7 messages):
Cloud Fees, ChessFish.io, LoRA finetuning, FlashAttention (FA) on T4 GPUs
- Half a Billion in Cloud Fees!: A member mentioned spending Half a Billion in cloud fees.
- No further context was provided.
- ChessFish.io Launched!: A member announced the creation of ChessFish.io, a chess website for analyzing, learning, and playing casually, available for free without requiring an account.
- They invited the community to try it out and share it with chess-loving friends and family.
- LoRA finetuning frustrations: A member expressed frustration about finetuning a LoRA for a larger model, which yielded more irritating results than finetuning a smaller one.
- The member realized they had cloned the base model instead of the instruct model, resolving the issue.
- FlashAttention (FA) canāt get T4ād: A member inquired about implementing FlashAttention (FA) on T4 GPUs.
- Another member explained that the required operations are only available in Ampere and later GPUs, and while a reimplementation in Turing might be possible, it would be slower and no longer considered āflash,ā linking to a relevant Reddit discussion.
Unsloth AI (Daniel Han) ā· #help (544 messagesš„š„š„):
Unsloth Sesame CSM-1B notebook errors, Tokenizer issues after adding new tokens, Fine-tuning for translation, Mistral-common tokenization in Unsloth, Vision model error
- Debugging Unsloth Sesame CSM-1B Notebook Errors: A member resolved errors encountered while training the Unsloth Sesame CSM-1B notebook by upgrading to an L4 GPU, setting
dtype = torch.float16
, and settingfp16 = True
in TrainingArguments.- They suggested that turning off
use_gradient_checkpointing
might not be necessary, but other members recommend settingbf16=True
orfp16=True
depending on machine support.
- They suggested that turning off
- Tokenizer Glitches with Custom Tokens: Members reported encountering size mismatch errors when saving and loading models after adding new tokens, especially concerning the embedding tensor shape, and that thereās a problem with how Unsloth handles tokenizers when adding new tokens.
- There are workarounds being explored, and a member is aware of the issue and looking for hacks.
- Multi Adapter Orchestration at Runtime: A user is asking about the efficient use of multiple LoRA adapters dynamically for classification purposes and how to switch adapters at runtime using Unsloth.
- A member suggested using
load_adapter()
andset_adapter()
and also advised against using Unsloth as an inference engine for deployment, recommending alternatives like vllm or sglang.
- A member suggested using
- Windows Users Encounter Triton Troubles: A user was experiencing a Windows-specific
SafetensorError
due to file locking issues when merging LoRA adapters, and one member provided a test fix via a GitHub branch for them to try.- The solution also involved a missing
config.json
and general advice that VSCode may be the culprit, rather than Unsloth, and finally the issue stemmed from settingsave_method="lora"
, which is no longer available.
- The solution also involved a missing
- Downgrading Transformers: A Necessary Evil for Qwen: Multiple users reported
TypeError
related totorch.finfo()
when working with Qwen2.5 VL models and were advised to downgrade the transformers library to version 4.52.4.- This is a temporary fix until Unsloth upgrades compatibility with transformers 4.53.0.
Unsloth AI (Daniel Han) ā· #research (16 messagesš„):
Llama 3.1-70B, Psych 101 dataset, Emergent Properties, fMRI scans, Human decisions
- Llama 3.1-70B Mimics Human Brains, Gains Psyche: A group fine-tuned Llama 3.1-70B on a Psych 101 dataset and found it exhibited emergent properties mirroring fMRI scans of human brains, as described in a Nature article.
- The model was trained on 10M rows of human decisions from various psych trials and evaluations, and managed to outperform and predict human behavior using QLoRA.
- Natureās Open Access Policy Questioned: Concerns were raised about the credibility of a publication in Nature, given their Open Access policy which involves authors paying an open access fee.
- This is for their article to be published open access under a creative commons license, and that this aināt yer grandmaās Nature no more.
- HF Adapter links into Llama 3.1: A new Llama-3.1-Centaur-70B-adapter was released on HF, and it may have been trained via unsloth.
- The community is eager to see if these insights can translate to reasoning and other benchmarks.
OpenRouter (Alex Atallah) ā· #announcements (6 messages):
Airdrops, Cryptocurrency
- Airdrop Speculation DOA: A PSA was issued stating that there is no airdrop, live or planned.
- A member inquired what an āairdropā is, in response to the PSA.
- Airdrop = Cryptocurrency: A member clarified that an airdrop is a cryptocurrency thing.
- No further information or context was provided.
OpenRouter (Alex Atallah) ā· #app-showcase (3 messages):
Roleplay Website, personality.gg, character.ai alternative, janitorai.com alternative
- personality.gg Launches as Roleplay Alternative: A member shared personality.gg, touting it as a free roleplay website and app alternative to character.ai and janitorai.com.
- The platform is powered by OpenRouter.
- Discord Community for personality.gg: The platform has a Discord community for users to connect and discuss the platform.
OpenRouter (Alex Atallah) ā· #general (540 messagesš„š„š„):
OpenRouter provider selection, Contribution to OpenRouter, Chutes paywall, OpenRouter Trivia, Gemini 2.5 Pro
- OpenRouter Chooses Expensive Providers by Default: A user reported that OpenRouter sometimes selects more expensive providers even when cheaper options are available and working, but a member explained that without specifying a provider, OpenRouter uses load balancing which routes to other providers if one is experiencing high traffic.
- Users can sort providers by price to prioritize cheaper providers using a floor price shortcut.
- Discord Users Seek Ways to Contribute to OpenRouter: New Discord users inquired about contributing to OpenRouter, but it was clarified that this is not a crypto project or a Web3 platform, and instead a unified interface for using LLMs from different providers.
- A member suggested that OpenRouter should create a contribution link to redistribute funds as credits, but another user pointed out many are seeking financial rewards and were potentially misled by crypto-related announcements.
- Chutes Implements Paywall, OpenRouter Discussed as Alternative: Users discussed Chutesā decision to implement a paywall, with one mentioning a switch to OpenRouter as an alternative because Chutes now requires a $5 payment for 200 daily messages.
- It was noted that one user made 10,000 alt accounts to exploit free requests, leading to the paywall, and that OpenRouterās model of providing 1,000 free requests daily after a $10 deposit is a good model.
- Gemini 2.5 Pro Free on AI Studio: Members shared that Gemini 2.5 Pro is available for free on AI Studio and users can obtain an API key without credit card details, however its free tier is rate limited to 5 RPM and 100 RPD.
- It was noted that data may be used for training unless users are from the European Economic Area, Switzerland, or the United Kingdom.
- Users Get Hooked on Horny AI Models for Roleplay: Some members are struggling to control their AI models. One user asked why v3 generated horny responses, even in sfw roleplays, and another suggested that a system prompt is causing the issue, as LLMs arenāt horny by default.
- The user said that model Llama-3_1-Nemotron-Ultra-253B-v1 doesnāt seem to exist anymore.
LMArena ā· #general (346 messagesš„š„):
Google AI strategy, Gemini pricing vs OpenAI, Claude's Context Handling, DeepSeek R2 delay, Grok 4 release
- Googleās AI Strategy on Fire: A member claimed Google is burning down with AI strategy, realizing their only usage comes from free AI studio users, so they needed to add it back.
- It was suggested that Google is losing money constantly with their pricing and lacks the pricing power of OpenAI, hindering the release of deep think.
- Gemini Pro is a Scam?: Members discussed the pricing and value of Gemini Pro, with some considering it a scam compared to OpenAIās offerings.
- It was noted that Google might need to make their product free to gain traffic, but they need to add the compact/compress feature to studio like Claude Code, Codex and the Gemini CLI.
- Claudeās Context Handling is Weird: A member criticized Claudeās approach to context usage, stating that advertising a larger context size is pointless if users canāt practically use it outside of the API.
- They suggested that Claude should either offer a realistic quota for the full context size or cap it by default, similar to OpenAIās approach.
- DeepSeek R2 delayed: It was mentioned that DeepSeek R2 is delayed until frontier models are available for training data and its new model called Steve is in the arena.
- Users also identified Steve is an Amazon Titan model.
- Grok 4 on July 4th?: Speculation arose regarding the release date of Grok 4, with some suggesting it could be released on July 4th, referencing Elon Muskās tweet implying the release of Grok 4 is imminent.
- Itās unknown if Grok 4 could take the coding crown from Claude.
LMArena ā· #announcements (1 messages):
Image Edit Leaderboard, Community Driven Leaderboard
- Image Edit Leaderboard goes live: A new Image Edit Leaderboard is now live, driven by community votes, with 7 models ranked by their image editing capabilities.
- Users can now upload an image and directly compare each modelās editing capabilities here.
- Models Ranked on Image Editing: The leaderboard allows users to compare the image editing abilities of 7 models by uploading an image.
- This community-driven initiative is highlighted with a visual example in the attached image, promoting direct comparison.
HuggingFace ā· #general (104 messagesš„š„):
Inference Bug, HF's MCP server to Claude Desktop on Windows, Azure Text-to-Speech, OpenAI's whisper large v3 turbo, Synthetic data creation
- Inference Bug Causes Demoralization: A member expressed feeling demoralized after spending hours trying to fix an inference bug and not succeeding and asked if they should post in the troubleshooting channel.
- They linked a Stack Overflow question but did not get any replies yet.
- MCP Server Stumbles on Claude Desktop: A user encountered an issue adding HFās MCP server to Claude Desktop on Windows, receiving a āC:\Programā is not recognized as an internal or external commandā error.
- They posted their configuration and logs, and others suggested checking path settings or authentication, but the user confirmed those were not the issue.
- Azure TTS streaming struggles: A member is building an AI agent and having issues with streaming and producing realtime speech using Azure Text-to-Speech services.
- They want the LLM and TTS models to run in parallel, but the synthesizer.speak_text_async(data).get is blocking the process, they asked for help with asynchronous programming.
- Whisper Large v3 Turbo Encounters 504 Errors: A user reported that OpenAIās whisper large v3 turbo kept giving them 504 errors when using the Hugging Face API.
- While testing, the model displayed āfailed to fetchā, but it seemed to fix itself, with others crediting the infra team.
- Navigating the Synthetic Data Maze: A member is attempting synthetic data creation to expand their moral evaluation benchmark data, using Gemini 2.5 Flash model, but the prompts generated were tame.
- They are looking for ways to get more serious/edgier Q-A pairs, and seek model suggestions for writing/reasoning benchmarks that perform high but have low safety benchmarks; others suggested system prompts, open source frameworks, and also the synthetic data channel on the HF discord.
HuggingFace ā· #cool-finds (3 messages):
HuggingFace Server, Piracy
- HuggingFace politely reminds user of server rules: A member was politely reminded that posts should be relevant to the server, and that the post belonged in another channel.
- They were also told that piracy is not allowed.
- Member apologizes for potential piracy post: A member apologized for potentially making a post that implied piracy.
- The member stated that they did not realize it was off topic.
HuggingFace ā· #i-made-this (30 messagesš„):
Rust AI library, HuggingChat alternative, LLMs speak structured data, Godtier Prompts
- Rustacean Rises: New AI Library Emerges: A member is seeking constructive feedback on their open-source AI library written in Rust, which includes fully connected layers, convolutional layers, and maxpooling, leveraging multi-threading and SIMD, with a link to the GitHub repository.
- It is currently in its infancy and could benefit from some āconstructive feedbackā.
- HuggingChat Shuts Down, Chat-UI Rises From Ashes: After HuggingChat was shut down, a member spun up a quick instance of chat-ui to access LLMs for free, running on free LLM APIs, available here.
- According to its creator, Mistral requests may be logged by Mistral due to their free tier privacy.
- LLMs Learn to Speak Structured Data, Code Included: A new post details how to get LLMs on Hugging Face to speak structured data, with code available on GitHub and a blog post on Medium.
- The project is called psychKG-pilot and is available for anyone to try.
- God-Tier Prompts Pool Shared: A member created a spot to share, discover, and surface the best prompts, available at godtierprompts.com.
- The goal is to make prompts more accessible and discoverable, a common problem with prompt engineering.
HuggingFace ā· #NLP (16 messagesš„):
Cross Encoders, Asymmetric vs Symmetric Semantic Search, Thresholding with Cross Encoders, Bi-encoders and Cross-encoders
- Cross Encoders Better for Smaller Datasets?: Members discussed the use of cross-encoders, noting they perform optimally with query-document pairs but struggle with query-query or document-document scenarios; one suggested using this model trained for query-query.
- It was highlighted that detecting duplicate questions is a potential use case.
- Thresholding with Cross Encoders for Relevance: It was suggested that similarity thresholding is a viable method to determine document relevance with cross-encoders, and that while cross-encoders typically use a Sigmoid to map raw scores to 0ā¦1, using
activation_fn=torch.nn.Identity()
might make thresholding easier on the raw scores.- One member noted that due to Sigmoid providing relative results, it might not be the best approach, as relevance scores can vary between queries.
- Bi-encoders vs Cross-encoders Discussed: It was mentioned that a common strategy involves using bi-encoders to narrow down the data points (e.g., top 100) and then using cross-encoders to reduce the results further (e.g., top 10).
- In the specific use case, the challenge lies in dynamically determining k documents out of the total documents, with the data scattered across multiple documents.
- The Problem of Dynamic K: One member described a scenario where the number of documents needed to answer a query is unknown, posing a challenge for selecting a fixed k value.
- They sought references to papers or documentation addressing this problem of dynamic K in document retrieval.
HuggingFace ā· #agents-course (16 messagesš„):
Hugging Face Inference Endpoints, Public Inference Endpoints, Generative AI Article, Unit 1 Course Certificate, Smolagents CodeAgent
- Hugging Face Inference Endpoints Are Down: Users reported that Hugging Face inference endpoints are currently down.
- This prevents finding the public endpoint for models like
meta-llama/Llama-3.3-70B-Instruct
listed in the dummy_agent_library.ipynb notebook.
- This prevents finding the public endpoint for models like
- Where to Find Public Inference Endpoints: A user shared a link to find public inference endpoints: Hugging Face Models.
- However, it was noted that the endpoints might not be visible when they are down.
- Generative AI Article Shared: A member shared an article on Generative AI, exploring its impact on technology and providing practical insights for developers, available at Generative AI Article.
- The author invited feedback and questions from the community.
- Smolagents CodeAgent Might Be Lazy: A member shared experiences from a project, noting that when a supervisor is given a multitool like Smolagents CodeAgent, it might offload every problem to it unnecessarily.
- Additionally, the DuckDuckGoSearchTool is easy to use but heavily rate-limited.
- HfApiModel possibly deprecated: A member reported an
ImportError
forHfApiModel
fromsmolagents
version 1.19.0 and is seeking clarification on whether it is deprecated.- They are considering using
InferenceClientModel
instead and asked for advice.
- They are considering using
Eleuther ā· #general (17 messagesš„):
Open Research Hackathon, Conference Travel Funding, Independent Research Mentoring
- Open Research Hackathon Approaching!: There is an Open Research Hackathon happening in August and they are still looking for community researchers to propose projects.
- Navigating Conference Funding for Independent Researchers: A member inquired about conference attendance requirements and funding options, given potential financial constraints.
- Others pointed out that conference organizers may offer opportunities to present online in rare cases if travel grants get rejected or a visa isnāt approved, while some conferences have travel grants.
- Seek Mentoring to Start Independent Research: A member requested specific mentoring on how to start doing independent research.
- Another member, @seonresearch, responded that they should check their dms.
Eleuther ā· #research (93 messagesš„š„):
Open Research Hackathon, 1-layer transformer, KV caching, TinyStories paper, llama.cpp
- Reminder: Open Research Hackathon: There is an Open Research Hackathon happening in August and community researchers are welcome to propose projects.
- Topics of interest include the performance of 1-layer transformers, KV caching methods, and potential projects leveraging community research.
- KV Cache Explored: Inference Cost Reduction: Members discussed KV caching and potential for cheap inference by storing Q, K, and V for each token (6dV bytes fp16) and applying RoPE on the fly.
- They referenced works, such as YOCO, using a shared KV cache between layers, and agreed that adjusting the base embedding then adding in an MLA style can optimize key and value matrices.
- Contrastive Flow Matchingās Uselessness:: A member found contrastive flow matching to be useless and algebraically equivalent to multiplying the targets by a constant factor and applying a loss weighting.
- They noted that multiplying the loss of every sample by the same constant will have no effect on training when using Adam, but scaling the loss is not the same as scaling the target.
- NN Distribution Modeling: GMM Head on Transformers?: A member inquired about good ways of having neural networks directly model a distribution besides diffusion/flow matching, particularly for continuous parameters in a physics-based dynamics model.
- They were considering using a GMM head on a transformer, but found the arbitrary nature of choosing K heads in GMMs inelegant.
Eleuther ā· #interpretability-general (1 messages):
Open Research Hackathon, Community research projects
- EleutherAI Hosts Open Research Hackathon in August: EleutherAI is hosting an Open Research Hackathon in August, inviting community researchers to propose projects.
- The hackathon aims to foster collaborative research and innovation within the EleutherAI community.
- Community Researchers Invited to Propose Projects: EleutherAI is seeking community researchers to propose projects for the Open Research Hackathon in August.
- This call for proposals encourages diverse participation and contributions to open research initiatives.
Eleuther ā· #lm-thunderdome (21 messagesš„):
lm-evaluation-harness standardization, lm_eval init optimization, task discoverability, gpqa benchmark details, Optimizing lm_eval startup time
- LM Evaluation Harness Library Standardization in Progress!: The library is undergoing standardization to enhance intuitiveness, tracked via issues #3083, #3082, and #3081.
- The goal is to complete this month, with simplification of the init script and improved task discoverability as key focus areas.
- lm_eval startup time optimized: The startup time of
lm_eval -h
was significantly reduced using lazy loading and refactoring imports; went from ~9 seconds to 0.05 seconds.- The improvements involved lazy-loading
simple_evaluate
andevaluate
in__init__.py
and movinglm_eval
imports insidecli_evaluate
, as highlighted in PEP 562.
- The improvements involved lazy-loading
- Inquiry into Prompting Method Details for GPQA Benchmarks: A user sought clarification on prompt formatting and model responses within GPQA benchmarks, particularly the extraction of answers from the modelās responses.
- Specifically, they noted that the stored
resps
entry in the JSONL files forcot_zeroshot
seems to reflect the initial promptās response, differing from the final response used for answer extraction, referencing page 20 of the GPQA paper for context.
- Specifically, they noted that the stored
- Deeper Dive into GPQAās Regular ZeroShot and N-Shot Tasks: A user questioned the meaning of the
arguments
andresps
entries in GPQAās zeroshot and n_shot tasksā JSONL outputs.- They speculated about the log probabilities in the
resps
field and noted the absence of achoices
entry in thedocs
for n_shot and zeroshot, unlike other subsets.
- They speculated about the log probabilities in the
Eleuther ā· #multimodal-general (3 messages):
Kaiming's talk, Mean flow matching
- Kaiming He Discusses Mean Flow Matching: A member shared a YouTube video of a workshop and highlighted Kaiming Heās talk starting at 2:22:01.
- The member noted that they especially liked Kaimingās description about mean flow matching, starting at 2:43:32 in the video.
- Workshop Video Shared: A member shared a YouTube video from a workshop.
- The user clarified that they were new to the channel, and asked if sharing video links was appropriate.
LM Studio ā· #general (65 messagesš„š„):
VPN setup for LM Studio, Serving LLMs without LM Studio UI, Trusting Hugging Face models, Running LM Studio headless, AnythingLLM mobile app
- VPN Vortex: Port Forwarding Predicaments!: Users discussed using a VPN like NordVPN to bypass port forwarding issues when hosting LM Studio from home, as some ISPs restrict port forwarding.
- One user reported using NordVPN for remote AI connections and even Steam gaming, praising its low latency, āgood enough latency for gamingā.
- LM Studio UI: Serving LLMs Server-Side!: Members suggested serving LLMs without the LM Studio UI by using tools like llama-cpp or llama-swap along with a frontend like OpenWebUI.
- They emphasized the need for GPU instances for performance and noted that LM Studio comes as a complete package with both server and UI components.
- Hugging Face: Trusting the Models!: The topic of verifying the trustworthiness of models from multiple uploaders on Hugging Face was discussed, but it was stated that it is physically impossible for them to āescapeā.
- One member suggested only pulling models from the LM Studio Community page to mitigate concerns about malicious code, explaining that āllm files u download are just huge chunks of inter-connected āknowledgeā basicallyā.
- LM Studio Headless: Remote GPU Power!: Users explored running LM Studio headless on a remote server with GPUs and using the LM Studio UI on a workstation for processing.
- It was noted that LM Studio itself cannot connect to a remote LLM, and alternatives like OpenWebUI or AnythingLLM (as a chat frontend) are recommended, though one user noted āa major pain point in Anything LLM is that it doesnāt show thinkingā, with another praising the integration of MCP.
- AnythingLLM: Mobile Marvels!: A user shared a link to the new AnythingLLM mobile app, potentially benefiting users awaiting mobile accessibility.
- Users also discussed context window percentages, with one explaining, āHow full the context window is. One full the LLM will forget parts of the conversationā, suggesting users click on the token counter to see exact usage.
LM Studio ā· #hardware-discussion (36 messagesš„):
GPU driver update, Shared VRAM, AMD vs Nvidia, Run 24B param on RTX 4080, Run LLAMA 3.3 70B
- GPU Driver Update Boosts Performance: A user reported that updating their GPU driver improved performance and allowed them to use more VRAM before experiencing crashes, processing 250k tokens in 4 hours after the update.
- The user also noted that keeping shared GPU usage low is key, with crashes occurring when VRAM usage exceeds 15.3GB/16GB.
- Shared VRAM Issue Debated: Users discussed the issue of shared VRAM on dGPUs, with one user suggesting it might be a Windows issue, while another stated they never experienced this on RTX cards.
- One user noted that using shared RAM can be faster than offloading layers to the CPU, depending on the amount of shared RAM and the PCIe version.
- AMD vs Nvidia GPU Debate: Users debated the merits of AMD versus Nvidia GPUs, with one user recommending sticking with Nvidia for an āit just worksā solution for all AI use cases, because AMD is the āprice of using AMDā.
- The poster noted that AMD is possible but not supported by Nvidia.
- 70B LLAMA 3.3 too big for 16GB VRAM: A user asked about running LLAMA 3.3 70B Q3/Q4/Q5/Q6 on an RTX 4080 with 16GB VRAM and 32GB RAM.
- Other users advised that the model is too large, with one user suggesting using a good 14B model instead, with a GIF expressing the futility of the attempt.
- Model loading fails in Linux: A user reported a āFailed to load modelā error on Linux (Kubuntu 25), while the model loads successfully on Windows.
- They were unable to initialize the context because it failed to allocate buffer for kv cache, and asked for help.
GPU MODE ā· #general (18 messagesš„):
Industrial PhD in Denmark, SWE vs MLE role, Work-life balance in Europe, Pursuing CUDA, Perfectionist mindset
- Industrial PhD in Denmark Consdered: A member is considering an opportunity to do an industrial PhD in Denmark focused on LLM inference optimizations and is seeking advice on trade-offs between staying in the US vs Europe.
- They are interested in the work-life balance, research quality, and growth prospects, as well as compensation ranges and personal fulfillment.
- SWE to MLE role provides better work-life balance: A member shared their experience of switching from a SWE role to an MLE role and finding it more fulfilling with a better work-life balance.
- They cited interaction with end-users as a key factor and linked to a blog post on feature factory environments.
- European PhD not worth it?: A member advised that a PhD in Europe is only worth doing if itās in a top lab under a famous professor, emphasizing the importance of working with a well-respected professor and publishing in top conferences.
- They cautioned against the potential for ending up in a series of low-quality postdocs with poor pay and suggested focusing on finding an ML engineering position in US tech instead.
- Burnout avoided by not doing a PhD?: A member cautioned against viewing a PhD as a softer path, describing it as a competition to figure things out before anyone else.
- They shared their own experience of working non-stop during their PhD, leading to health issues and hospitalization due to the stress of publishing papers, dealing with reviewers, and managing research and teaching.
- CUDA learning sought: A member inquired about learning CUDA and sought someone to discuss it with.
- Another member advised asking questions publicly in the forum.
GPU MODE ā· #triton (14 messagesš„):
Torch Compile, Autotuning, PTX, SASS, CUDA
- Torch Compileās Inductor Back End: Torch.compileās inductor backend is highly optimized because Dynamo traces Python into an FX graph, and the inductor then fuses ops, pre-packs weights (e.g. for GEMM), and emits device-specific Triton or CUDA code with profile-guided tiling and scheduling heuristics.
- This means every kernel is shape-specialized, weight-prepped, and hand-optimized for your GPU from the get-go, whereas a handwritten Triton kernel often relies on fixed block sizes and simpler heuristics that youād have to autotune yourself.
- Torch Compile is AOT compiled: Torch compile is AOT compiled, as opposed to triton which is JIT compiled, therefore, torch compile triggers tritonās JIT in the AOT compilation phase, so there is no runtime compilation overhead, assuming no graph breaks.
- Autotuning: A developer added autotune to their code and after autotuning, the handwritten kernel runs as fast as the triton kernel without needing to warmup after the initial autotune.
- PTX to SASS: One user wanted to look at what the triton code maps to in SASS, not looking at PTX.
GPU MODE ā· #cuda (6 messages):
Kernel Benchmarking for LLM Inference, Warm-up Iterations for Kernel Benchmarking, Compiler Explorer's NVCC Support Delay, PTX Instruction Availability
- Debate heats up over Warm-up Iterations for LLM Kernel Benchmarking: A member inquired about whether to use warm-up iterations when benchmarking a custom kernel for LLM inference latency.
- Another member asked about the different overheads that can be avoided/minimized by doing warmup and whether those overheads also occur during inference.
- Deep Dive on Compiler Explorerās NVCC Support Delay: A member asked why Compiler Explorer often takes a long time to add support for the latest NVCC, linking to two NVIDIA GTC talks on inference warmup and optimizing deep learning inference
- Turns out there is someone regularly adding the newest HPC Toolkits and with that also the corresponding CTK, but for 12.9 there was some kind of failure and Matt Godbolt himself commented it out on Github.
- New PTX Instruction Sparks Excitement: A member expressed excitement about a PTX instruction (ld.global.L2::evict_last.v8.f32) added in PTX 8.8/NVCC 12.9 for global memory copy optimization.
- They noted itās still unavailable on Compiler Explorer.
GPU MODE ā· #cool-links (1 messages):
simon_57893: https://semianalysis.com/2025/07/03/deepseek-debrief-128-days-later/
GPU MODE ā· #beginner (16 messagesš„):
Oldest GPU for beginners, Compute vs Memory bound kernel, Renting GPU time, Second hand RTX 3060
- Oldest GPU card for PMPP exercises: A member inquired about the oldest GPU suitable for beginners doing exercises and implementing newer algorithms like attention, considering a GTX 1650 Super or a GTX 1070Ti Mini.
- Another member mentioned the GTX 1650 Super has Turing architecture (compute capability 7.5) but it is an entry level card, recommending a second-hand RTX 3060 with 12GB if the budget allows.
- Debating Compute vs Memory bound kernels: A member inquired about whether a memory bound kernel of the same program can be faster than a compute bound kernel.
- Another member clarified that you want a kernel that has arithmetic intensity equal to your target hardware, so you wonāt be bottlenecked by compute or memory transfer, citing early crypto mining algorithms as largely compute bound.
- Renting GPU Time as a Budget Alternative: A member suggested renting GPU time as an alternative, recommending Lightning which costs $0.28/hour for a T4 GPU, with $15 free per month.
GPU MODE ā· #torchao (4 messages):
torch.distributed.checkpoint.StateDictOptions, Sharded Parameters, Dtensor
- State Dictionaries using torch.distributed.checkpoint.StateDictOptions: A member pointed out that the intended way to load a full state dict is via
torch.distributed.checkpoint.StateDictOptions
.- Another member confirmed that they specify that the state dict is full and should be broadcasted.
- Sharded vs. Unsharded Parameters Debate: A member mentioned that some original parameters are dtensors (sharded) while others are unsharded, which disrupts the internal logic.
- The member believes that editing unsharded parameters is problematic because they are discarded during resharding, at least in torch 2.7.1.
GPU MODE ā· #rocm (6 messages):
Register lifetime, Avoiding register spills, Inline ASM, Kernel hacking, Compiler optimization
- Assembly Dump to the Rescue: A member suggests that the best way to recognize the lifetime of registers and avoid random register spills is by dumping the assembly and using inline ASM.
- They added that fighting the compiler is inevitable when aiming for optimal performance, especially with spills.
- Scooby Doo Mystery: Compiler Optimization: A member shared that in their experience a large part of register lifetime is specific constructions that the compiler is bad at optimizing.
- They usually have a decent idea of what the code should look like, and if it doesnāt look like that they go on a scooby doo mystery.
GPU MODE ā· #self-promotion (1 messages):
CuTeDSL, WGMMA, TMA, Hopper Architecture, TV-Layouts
- CuTeDSL blogpost explains WGMMA and TMA atoms: A new blogpost, CuTeDSL on H100 - Understand WGMMA and TMA atoms in CuTeDSL, aims to explain WGMMA and TMA concepts for leveraging Hopperās full potential.
- The blogpost series derives TV-Layouts for WGMMA instructions and explains the compositional logic used to obtain swizzled Layouts for the TMA unit.
- CUTLASS Example: The post references a relevant example in the CUTLASS repository.
- The example can be found here: dense_gemm.py.
- NVIDIA Docs for WGMMA and TMA: The post references the PTX docs for WGMMA and the CUDA C++ docs for TMA.
GPU MODE ā· #šæ (1 messages):
Project Popcorn, Weights&Biases conference, PyTorch PM
- Popcorn Project Popping Off: The community is noticing that Project Popcorn continues to gain traction after being discussed in a recent Weights&Biases conference.
- PyTorch PMās Vision for Popcorn: A PyTorch PM spoke at length about the goals of the project.
- This development puts more responsibility to the community now.
GPU MODE ā· #gpuęØ”å¼ (1 messages):
leung3035: ä½äøŗęå·äŗŗļ¼å¾č“č“£ēåčÆä½ ļ¼ęå·ē©ēå°ę¹č®å¤ļ¼ä½ęÆļ¼åå°±ē®äŗļ¼ē®ē“å°±ęÆē¾é£čę¼ ć
GPU MODE ā· #general-leaderboard (1 messages):
Handling Missing Data, Replacing Zero Values with Mean, Dropping Rows, Handling missing data
- Strategies to Handle Missing Data: Members seek advice on handling missing data, specifically about replacing zero values with the mean or simply dropping the row.
- Theyāre curious about methods to decide which is best for a given scenario.
- Choosing the Right Imputation Method: The conversation revolves around the best approaches for dealing with missing data in datasets.
- Specifically, thereās interest in whether to replace missing values with the mean or to drop rows containing missing data.
GPU MODE ā· #submissions (1 messages):
Leaderboard Submission, A100 performance
- New Trimul Leaderboard Submission!: A member achieved 5th place on the
trimul
leaderboard using an A100 with a timing of 22.2 ms. - A100 Smashes Leaderboard: An A100 achieved a 22.2ms timing to get 5th place on the Trimul leaderboard
GPU MODE ā· #factorio-learning-env (3 messages):
Factorio Client Desync Logs, Github review interface
- Desync Logs Sought for Factorio Fixes: A user requested client desync logs from two other users, potentially for review by a Factorio developer friend to fix issues.
- Another user confirmed they forwarded the logs, though no public link to a bug forum was provided in the messages.
- Github Review Interface discussion initiated.: Users initiated a discussion about the Github Review Interface.
- This discussion may be pertinent to developers seeking feedback on code changes, to improve the development experience.
GPU MODE ā· #amd-competition (1 messages):
MI300 Access, Competition Leaderboard Resources
- MI300 Real Access Investigated: A participant inquired whether top competitors had āreal accessā to an MI300 during the competition.
- They were curious if sub-200us performance on
amd-fp8-mm
was achievable using only competition leaderboard resources, without profiling or streamlined access.
- They were curious if sub-200us performance on
- Leaderboard-Only Achievement Feasibility: The participant specifically asked if achieving sub-200us on
amd-fp8-mm
was possible solely using the competition leaderboard.- This implies interest in whether optimization without direct profiling access was sufficient for top performance.
GPU MODE ā· #cutlass (9 messagesš„):
Cutlass Analytical Cost Model, GEMM Kernels, cuBLASLt Heuristics, Claude Code CLI for Cutlass, PyTorch Autotuner Model
- Cutlass chases Analytical Cost Model: Cutlass is actively investigating an analytical cost model based kernel selection, aiming for a release this year.
- A member mentioned that although Cutlassās metaprogramming-friendly architecture is supposed to enable this, autotuning is still the only real way, though heuristics can prune the search space.
- cuBLASLt Optimizes for Heuristics Cache: A member questioned what cuBLASLt does, noting that the docs talk about optimizing for the heuristics cache, but in practice seemed to mostly choose Split-K vs non-Split-K.
- They thought CUTLASS has a massive advantage with full autotuning, but itād be nice to get most of the way there with an analytical model.
- Claude Codes Fastest Cutlass Settings: A member asked Claude Code CLI to find the fastest settings for a problem shape in the grouped_gemm.py example, achieving approximately 96% MFU.
- PyTorch Autotuner Launches Simple Model: A member mentioned working on shipping an autotuner in PyTorch thatās a simple 400K parameter model.
- CuTe 4.0 DSL Missing MX Dtypes: A member asked whether the CuTe 4.0 Python DSL supports MX dtypes, referencing a line in the CuTeDSL base_dsl typing.py.
GPU MODE ā· #singularity-systems (2 messages):
c\/cuda c compiler, codegen, instruction selection, instruction scheduling, register allocation
- New c/cuda c compiler project kicks off: A member is starting a new project to create a c/cuda c compiler and is looking for contributors with experience in codegen.
- They linked to the main pipeline for the AST compiler here.
- Contributor Callout for c/cuda compiler: An individual is seeking collaborators with expertise in codegen, including instruction selection, instruction scheduling, and register allocation, to help develop a new c/cuda c compiler.
- The initial AST compiler pipeline is available at https://github.com/j4orz/picoc/blob/master/src/ast/mod.rs for reference.
Nous Research AI ā· #general (46 messagesš„):
Open Source Industry Dying, Nous Research's Open Source Commitment, Meta's Open Source Future, Rejection Sampling definition, Llama 4 Failure
- Open Source Industry Might Be Dying: Some members discussed whether the open source industry might be dying, citing current difficulties, with the exception of China.
- Others noted that OpenAI might release open models ironically, despite the overall trend.
- Nous Stays Fully Open: Nous Research remains committed to being fully open, with Hermes 3 dataset, reject sampling RL environment datasets, and Hermes 4 in the pipeline.
- Reject sampling involves generating responses from the model and using the reward model or verifier to select samples for supervised fine-tuning (SFT).
- Meta to abandon open source models?: Members expressed concern about Meta potentially abandoning open source models for a closed source approach.
- They emphasized the importance of Nous Research continuing to champion open source, especially if Meta shifts strategy.
- Llama 4 failure?: Some members speculated that Llama 4 was a relative failure and that Meta might skip it in favor of Llama 5 or another successor.
- Additionally, others noted that Nvidia axes fp64/int8 for the B300 series.
- Thinking Lenghts by AJ Kourabi: Some members shared insights into thinking lengths using a link to Twitter.
- One member noted that Claude is the only model that returns the length of the transcribed CoT instead of the number of tokens of the real CoT.
Nous Research AI ā· #research-papers (5 messages):
Independent Research Mentoring, Reproducing Research Results
- Seek Mentorship for Independent Research: A member sought mentorship to start doing independent research.
- This member specified they are not-an-absolute beginner.
- Repeat Reproducing Research Results: A member suggested reading papers, reproducing their results, and repeating the process.
- The member simply wrote read papers, reproduce their results, rinse and repeat.
Nous Research AI ā· #interesting-links (4 messages):
Symbolic Intelligence architecture, AREU Codex framework, Interpretability and alignment, Narrative Destabilization
- AREU Codex Framework Proposes Novel Alignment Architecture: A conceptual framework called AREU Codex models human-LLM interaction and civilization-scale feedback loops as recursive symbolic traps, proposing an alternative host architecture based on ego collapse, mirror integrity, and narrative destabilization.
- The framework aims to improve interpretability and alignment through symbolic-layer modeling, robustness to multi-narrative and contradictory signal environments, and psychological resilience in the presence of emergent behaviors and feedback loops.
- Call for Critique on Symbolic Intelligence Alignment Approach: The author is seeking feedback on a conceptual Codex exploring how humans and AI might fail or succeed to stay aligned in complex, contradictory, highāsignal environments.
- They are particularly interested in whether others have seen similar ideas or would like to exchange references and critique, offering the full draft and details via DM.
Nous Research AI ā· #research-papers (5 messages):
Independent Research Mentoring, Reproducing Research Results
- Newbie Seeks Mentoring on Independent Research: A member, described as not-an-absolute beginner, is seeking specific mentoring on how to start doing independent research.
- The member requested interested mentors to DM them.
- Reproduce Results as Key to Research: A member suggested the key to learning independent research is to read papers, reproduce their results, and repeat.
- This advice emphasizes hands-on experience and validation of existing work as a foundation for further exploration.
Latent Space ā· #ai-general-chat (47 messagesš„):
Anthropic Experimental APIs, Microsoft Layoffs, DeepSWE RL Agent, Chamath Palihapitiya & Tobi Lütke on AI, GPT for summarizing news
- Nuttall gets Keys to Anthropicās Prompt APIs: Ian Nuttall announced he has gained access to Anthropicās experimental prompt generator and improvement APIs and asked for ideas on what to build - see his tweet.
- Other users expressed envy, asked about getting access, and suggested ideas such as an AI agent for talking to the endpoint, and a tool to organize and analyze user-generated content from various devices using AI.
- Microsoft Cuts 9,000 Roles: Microsoft is reportedly laying off 9,000 workers, sparking discussions from the role of AI in job displacement to the broader economic impact, as seen in this tweet.
- Some wondered if this was just the normal eb and flow of large companies going through purging cycles, as the people I know there have been feeling very unsteady with layoffs for the last year.
- Agentica releases DeepSWE open source RL Agent: Agentica introduces DeepSWE, a new open-source software engineering agent trained purely with Reinforcement Learning (RL) on Qwen3-32B, as introduced in this tweet.
- DeepSWE achieved 59% on SWEBench-Verified and 42.2% Pass@1, leading open-weight models and is a collaboration between Agentica and Together AI.
- Tobi and Chamath Talk AI and the Great Societal Refactor: Chamath Palihapitiya and Tobi Lütke discussed AI, internal tools, energy, and the systemic rebuild of society over the next 50 years at Toronto Tech Week, see the tweet.
- Key topics included AI and the OSI Model, the Software Industrial Complex, the case for internal tools, Shopifyās AI memo, AI infrastructure, power and productivity, staying technical, Canadaās potential, and the āMouse Experimentā (Power of Hope).
- Gross Exits from SSI startup: Daniel Gross tweeted that he assisted in getting SSI off the ground and expects āmiracles to followā in this tweet.
- This tweet appears to be a response to Ilya Sutskeverās message to the SSI team and investors, announcing Daniel Grossās official departure from SSI as of June 29th, and naming Ilya as the formal CEO and Daniel Levy as President.
MCP (Glama) ā· #general (44 messagesš„):
MCP as the Application, MCP servers, Resources and Prompts in MCP, Connecting to MCP servers, Remote MCP server issue
- Mulling MCP: More than Mere Integrations: A member wondered if MCP servers could be the application itself, not just integrations, with agentic workflows and prompt engineering built in for tools like display-restaurants.
- Another member pointed out that this sounds like APIs, questioning if the community is already overcomplicating things.
- Navigating Networked Nuances: MCP Servers Calling Servers: One member suggested the possibility of an MCP call triggering other MCP servers, raising concerns about potential hallucinations.
- Another proposed an MCP-Routing layer to manage context windows, such as trimming AWS EventBridge Scheduler docs for Claude Chat.
- Resources vs. Tools: A Matter of Control: The difference between Resources and Tools lies in the locus of control: Resources are controlled by the application, while Tools are controlled by the LLM.
- One member disagreed about prompts, suggesting that servers should distribute good prompts for specific use cases, as they are not different from any other code.
- Systemd Snafu: Subprocesses Save the Day: One member discovered that running MCP servers as systemd services was unnecessary, as they can be activated as subprocesses by the client.
- This realization came after working on setting up MCP servers to be systemd services and figuring out how to connect to them remotely over the LAN.
- Ngrokās Networking Ninja: Remote MCP Server Mystery: A user encountered an issue where a remote MCP server failed to integrate with Claude.ai, despite the
/.well-known/oauth-authorization-server
request reaching the server.- Strangely, using an ngrok URL as a proxy resolved the problem, even though request headers appeared identical, leaving the root cause a mystery.
MCP (Glama) ā· #showcase (2 messages):
Hypermode Agents Bootcamp, Agent Sandboxing Marketplace
- Hypermode Agents Bootcamp Kicks Off: The team has announced the kickoff of a 30-day Agents Bootcamp designed to help individuals transition from being merely agent-curious to becoming proficient agent builders using Hypermode Agents, detailed in their official documentation.
- They are actively soliciting feedback, particularly regarding the types of agents people are interested in building and which MCP servers should be featured.
- Agent Marketplace Sandboxing Solution Emerges: A member is developing a sandboxing solution centered around a marketplace, featuring a meta MCP for orchestration and monitoring, showcased in an early beta video.
- Early insights and feedback on this project would be greatly appreciated by the developer.
Yannick Kilcher ā· #general (40 messagesš„):
LSTM comeback, Universal Function Approximators, Semantic Search with Cross Encoders, Diffusion-based VLMs, Tokenizer Rebalancing
- LSTMās Existing Literature Aids Comparison: Despite the trend towards newer architectures, one member notes that LSTMs are valuable due to the extensive existing literature, making comparisons easier.
- Dense Architectures as Universal Function Approximators: A member posits that at modern scales, for dense feed forward architectures, the actual arch doesnāt matter because theyāre all universal function approximators, referencing this paper.
- Cross Encoders for Semantic Search: Someone inquired about using cross encoders, optimized for asymmetric semantic search, in symmetric semantic search scenarios, and how to set a threshold for selecting top passages, given their effectiveness on smaller datasets.
- They also noted that with cosine similarity they used to set a threshold of around 0.7.
- Diffusion Models and Vision Language Models: One member mentioned interesting work happening in VLMs with the rise of diffusion based language models, linking to this paper on diffusion based vision language model.
- LLMs Tokenization needs Rebalancing: One member shared this paper as a gem and talked about rebalancing tokenizers as well as a tool to score training data importance based on token overlap complexity.
Yannick Kilcher ā· #paper-discussion (3 messages):
Linear Transformers, Delta Rule, RWKV Optimization
- Delta Rule to Parallelize Linear Transformers: A paper discussion was planned for Parallelizing Linear Transformers with the Delta Rule over Sequence Length (link to paper), with focus on understanding parallelization of eq 18 from the RWKV-7 paper.
- DeltaNet Scales Up: The DeltaNet model, a type of linear transformer using the delta rule, can be scaled to standard language modeling settings using a hardware-efficient algorithm that computes products of Householder matrices.
- A 1.3B parameter model trained on 100B tokens outperforms recent linear-time baselines like Mamba and GLA in perplexity and zero-shot performance, according to the paper.
- Discussion Gets Postponed: The original discussion of the DeltaRule paper was postponed due to a conflict, with plans to reschedule it for the following day.
Yannick Kilcher ā· #ml-news (2 messages):
The Atlantic, Eleven Labs
- Atlantic trials ElevenLabs for Voice: The Atlantic is using ElevenLabs to voice their articles, as exemplified by this audio file for an article titled āCustomer Service Sludgeā.
- The article link is here.
- The Atlantic Experiments with AI Voice: The Atlantic is experimenting with AI-generated voiceovers using ElevenLabs for some of its articles.
- This initiative aims to provide an audio version of their content, enhancing accessibility for readers who prefer listening over reading.
Notebook LM ā· #use-cases (14 messagesš„):
NotebookLM setup, Readwise style workflow, NotebookLM audio overview function, interactive PDF mind maps
- Brainstorming NotebookLM Setups: A user is setting up NotebookLM for a personal journal (reflections, media logs, chats) and a searchable notes database (articles, ideas, reference material), prioritizing privacy and data control.
- They considered using Google Docs as the single source of truth but are exploring alternative input methods for a resilient and easy-to-maintain system.
- Readwise-style workflow incoming?: A user inquired about a Readwise-style workflow to automatically add sources to NotebookLM for daily digests of news.
- No concrete solutions were shared in the channel.
- Audio Overviews: One user shared that they use NotebookLM mostly to help with explaining their work-in-progress books, especially using the audio overview function.
- Other users are using audio to bypass length limits by prompting with ācomprehensive super-podcast drawn from the entire source. NO MATTER how long the audio generated will beā .
- Interactive Mind Map PDFs: A user wants an interactive PDF version of the mind maps to share with their audience, but the current printing solution (ctrl + p) doesnāt generate it.
- Another user suggested using the share button in the top-right corner to share the Notebook directly, or downloading the picture.
Notebook LM ā· #general (15 messagesš„):
Edit capability request, NotebookLM access issues, Combine Notebooks, Family Plan Limits, Latex rendering
- Edit Capability Feature Request Bumps!: Users are requesting the ability to edit in NotebookLM and requesting feature requests for edit capability.
- One user is very annoyed, saying itāll get fixed the moment it generates an Avocado in the wrong context.
- Users Ask how to Merge Notebooks!: A user asks about combining all the notebooks theyāve made into one to review for a final.
- Another user suggested to try adding all your sources into one notebook.
- Family Plan Limit Questions!: A user asks if, when upgrading their plan, the increased limits for NotebookLM extend to members of their family group.
- They say they have read multiple help pages, and it is still super unclear and that answers about what extends to family members seems to have changed recently.
- Latex Rendering Lagging?: A user asks if Latex rendering in NotebookLM answers is still not a thing.
- There was no update given in the messages.
- Gemini Model Use Question: A user asks what does notebooklm gemini model uses now?
Modular (Mojo š„) ā· #general (7 messages):
Modular Customers, Native Network Programming, GPU HTTP server
- Modular Teases Customer Stories: Members noted that folks are starting to tell their stories, with more things that arenāt public yet, linking to the Modular Customers page.
- The page has a few customer stories and case studies, highlighting how companies are leveraging Modularās technologies.
- Native Network Programming Delayed for Concurrency Model: The team confirmed that native network programming interface in Mojo is going to be a bit delayed to figure out the concurrency model, threads, async and allocators first, linking to a very early version proposal.
- They are prioritizing figuring out the concurrency model with threads, async, and allocators.
- Mojo Eyes GPU-Based HTTP Servers: Modular is exploring running an HTTP server from a GPU entirely without a host CPU.
- While acknowledging the significant investment in GPUs, they aim to tackle new challenges such as minimizing CPU usage, even for tasks like booting up the GPU.
Modular (Mojo š„) ā· #mojo (17 messagesš„):
Dependent Type Systems in Mojo, NumPy Array Conversion to LayoutTensor, ExtraMojo Package for I/O, Mojo Compiler Hanging Issue, UnsafePointer.alloc Alignment
- Mojo Eyes Dependent Types with Caveats: Mojo is exploring a dependent type system, mindful of compile-time check constraints and feasibility for systems programming, contrasting with Idris, Agda, and Rocqās runtime checks, as discussed in this paper.
- The goal is to balance advanced features with reasonable compile times (30 million lines of code) and runtime performance, drawing from a different body of work than Rust for ownership.
- NumPy Array Conversion Troubles Solved!: Users were struggling with the complicated I/O and inability to convert a NumPy array to a LayoutTensor or buffer directly within Mojo.
- A member suggested using the underlying NumPy pointer,
node_argmax.ctypes.data.unsafe_get_as_pointer[DType.uint64]()
, to feed into aLayoutTensor
with the correct shape, and another member confirmed this was helpful!
- A member suggested using the underlying NumPy pointer,
- ExtraMojo to the Rescue for I/O Tasks: For those struggling with I/O, a member shared a helper library called ExtraMojo with a gist demonstrating how to read a delimited file.
- Another member also shared a hacky path using Mojo to NumPy, using this example code.
- Compiler Hangs on CNN Method: A user reported that the compiler appeared to be hanging during a specific function call,
accumulateFromOther
, within their CNN implementation.- The user requested assistance, but no specific solution or cause was identified in the snippet from the channel.
Modular (Mojo š„) ā· #max (4 messages):
Modular Max Offline Inference, Quantization Encoding, Apple MLX Support
- Offline Inference Troubles with Quantization Encoding: A user encountered a
ValueError
when trying to use Q6_K quantization with the llm4decompile-1.3b-v1.5 model, despite documentation suggesting itās supported, using Modular Max for offline inference.- Removing the quantization parameter resulted in another
ValueError
related to encoding incompatibility with the CPU, and the user found that it works on the stable version ofmax
with quantization specified.
- Removing the quantization parameter resulted in another
- Apple Silicon MLX Integration: A user inquired about compiling
max
into Apple native tensors (MLX) to potentially replace learning MLX for performance gains on Apple hardware.- Another user responded that Apple GPUs are not supported yet, but that Modularās team is interested in supporting Apple GPUs in the future.
Cohere ā· #š§µ-general-thread (11 messagesš„):
ML Summer School Channel, Cohere Labs Open Weights, AYA Vision Models, ML Summer School Recordings
- Members Seek Elusive ML Summer School Channel: Members are looking for the #ml-summer-school channel as advertised here.
- Other members pointed them to the Cohere Labs discord and sign-up guides.
- Cohere Labs Shares Open Weights: A member shared a link to Cohere Labsā open weights on Hugging Face: c4ai-command-a-03-2025.
- They advised to check our cohere labs for open weights instead of cohere repo.
- AYA Vision Models Released: A member announced the release of AYA vision models with a link to the Cohere blog post.
- A member reacted with šš.
- ML Summer School Recordings Now Available: Members shared the link to the ML Summer School recordings on YouTube: ML Summer School recordings.
- One user shared it with ā¤ļøā¤ļø.
Cohere ā· #š-api-discussions (4 messages):
Cohere Embedding Model, Trial key, Rate Limits, Production Keys, Monthly Limits
- Embeddings Accessible via Trial Keys: The Cohere embedding model can be used with a trial key, though rate limits are more restrictive.
- Trial keys and production keys offer the same features, with the distinction being a monthly limit on the free trial key.
- Trial Keys have Monthly Limits: Trial keys are free but have a monthly usage limit.
- Production keys offer higher usage limits, suitable for more demanding applications.
Cohere ā· #š-introduce-yourself (7 messages):
Cohere Summer School, New member introductions, Support channels, Community Discord Server
- Member seeks Cohere Summer School Channel: A new member inquired about the #ml-summer-school channel and whether they were in the correct place, referencing the Cohere Labs Community page.
- Fullstack AI Developer Joins: Khanh Tran, a Senior Fullstack & AI Developer, introduced themself with over 8 years of experience, highlighting their expertise in ASP.NET, Blazor, Node.js, and Python/Django, along with databases like PostgreSQL, MySQL, Supabase, and Firebase.
- They specialize in RAG pipelines, custom agents, and integrating AI with LangChain, LangGraph, LLMs, and vector databases such as Pinecone, Weaviate, and Qdrant, and are open to freelance or contract opportunities.
- NLP Engineer Joins Community: Inacio, an NLP engineer and researcher, introduced themselves, mentioning their work at Alpha CRC involving machine-translation evaluation and adaptation, including fine-tuning Llama 3.1 8B models.
- With an MSc in Computing from Dublin City University, they are interested in cross-lingual robustness, evaluation methodology, and efficient adaptation techniques, and their thesis was published at AMTA 2024.
- Cohere Support Channels Highlighted: Cohereās Technical Support Engineer, Varun, welcomed new members and directed them to specific support channels, mentioning new members fit right in at Cohere Labs.
- For general support, members should post in <#1324436975436038184>, and for API-specific discussions, they should head to <#1168578329423642786> and encourages everyone to visit Cohere Research to join the Discord community.
aider (Paul Gauthier) ā· #general (9 messagesš„):
Claude overloaded, Polyglot benchmark speed, Gemini-cli performance, API token rate limits
- Claude overloaded, running slow: Members reported that Claude was overloaded for over 2 hours, beginning around 6:47 UTC.
- Additionally, today in general is also running slow as hell.
- Polyglot Benchmark Speed Surfaces: Members discussed where to find the speed of running the Polyglot benchmark, suggesting a balance between speed, cost, and accuracy for optimal Aider UX.
- To find the speed, one can select detail and look for āSeconds per caseā then multiply that by number of cases (225).
- Gemini-cli trashed for eternity: Multiple members derided gemini-cli, complaining that it takes eternity to edit a single file.
- Another member suggested it was due to the free googleapi being slow, but usable since itās free.
- API token rate limits met: Members complained that they were unable to use the gemini-cli even with provided API tokens with credits.
- The error message was a 429 rate limit.
aider (Paul Gauthier) ā· #questions-and-tips (5 messages):
Local Model Performance, Aider --no-always option, Switching Model Edit Formats
- Local Models Stumble, Fail to Aider: Users report poor performance with local models such as Qwen3:32b, qwen2.5-coder:32b, and codellama:34b-instruct in aider, questioning if they are doing something wrong.
- A member inquired about the backend used (ollama, lmstudio, transformers, vllm), context window length, model template, and usage of features like RoPE or kvcache, also mentioning that 30B+ parameter models likely require quantization.
- Aider needs opposite of āyes-always flag: A user inquired about an equivalent of
--no-always
option in aider, to reverse the effect of--yes-always
.- The request has not been satisfied within the messages in this channel so far.
- Format follows the Model: A user observed that the edit format changes from diff-fenced to whole when switching between gemini-2.5-pro and 2.5-flash.
- No solution was provided, the user attached a screenshot to illustrate the issue.
aider (Paul Gauthier) ā· #links (1 messages):
claude-code-api, api/providers
- Sharing experience with claude-code-api: A member shared their experience using claude-code-api.
- api/providers: They indicated that they had built many similar api/providers too.
Nomic.ai (GPT4All) ā· #general (10 messagesš„):
Llama 3, Android local LLMs, Multimind SDK, AI News Sources, r/LocalLLaMA
- Android Users Eager for Llama 3: Android users are requesting Llama 3 be brought to Android, claiming phones like the Poco F7 Ultra are more powerful than PCs.
- Another user suggested trying anythingLLM or ALLM for local LLMs on Android.
- Multimind SDK Wraps Conversions & Fine-tuning: A user introduced the open-source Multimind SDK (Repo, Website), describing it as wrapping model conversion, fine-tuning, and inference for OpenAI, HuggingFace, and Ollama.
- It supports Python, CLI, and NPM, and is described as LangChain meets LiteLLM with extra powers.
- r/LocalLLaMA offers quick news: A user recommended r/LocalLLaMA as a good source of daily information on AI, stating that no other place is as quick.
- The user added that the news there is not even about Metaās Llama model anymore and it is about all the local LLMs nowadays, as Llama kinda went by the wayside.
DSPy ā· #general (8 messagesš„):
DSPy module creation, LLM-RAG-Agent with DSPy, Recipes for starting with little to no data, dspy.Tool and dspy.ToolCalls vs OpenAI functions/tools, Weaviate vectordb multi tenancy fix
- Runtime Signatures, Compile-Time Solved: A member faced a challenge creating a new DSPy module in the forward method of another module where a signature depended on runtime information from an LLM call.
- They solved it by ensuring the signature is known at compile time for optimization.
- LLM-RAG-Agent rides DSPy: A member shared a cool project that uses DSPy under the hood, linking to a Nature article and its GitHub repo.
- Low-Data Recipe Craving: A member inquired about recipes for starting with little to no data, aiming to sequentially tune an eval module and then optimize their actual module.
- Another member noted that this approach is similar to reinforcement learning.
- DSPy Tools Trumping OpenAIās?: A member questioned why DSPy seems to prefer text prompts over the bespoke API of OpenAIās functions/tools, specifically regarding the new dspy.Tool and dspy.ToolCalls.
- They inquired if thereās a reason for always using text content instead.
- Weaviate Multi-Tenancy Fix is PRād: A member requested a maintainer to review a PR to fix Weaviate vectordb multi-tenancy with DSPy.
- They believe the fix will be helpful for others using Weaviate with DSPy.
Manus.im Discord ā· #general (6 messages):
Usage visibility, Video generation, Manus down, Big update
- Usage Visibility Vanishes: A user noted the disappearance of the real-time usage tracker at the bottom left during task execution, which previously allowed monitoring credit consumption.
- Now, users must navigate back to the main menu or keep a separate window open to observe credit usage, a feature previously deemed handy.
- Video Generation Speculation: A user inquired about the availability of video generation for free users, either currently or in the future.
- No definitive answer was provided, leaving the possibility open.
- Manus Down or Big Update?: Some users expressed concern about Manus being down, and speculated about a big update.
- No definitive response or confirmation was given.
Torchtune ā· #dev (3 messages):
Generic Tokenizer Parity, HF Tokenizer Parity, Special Tokens, Chat Templates
- Generic/HF Tokenizer Parity: Resolved?: A user inquired about the status of generic/HF tokenizer parity, specifically if issues related to token count have been resolved.
- They expressed a desire to standardize behind one loader to allow users to tweak the tokenizer in the familiar HF environment, use
save_pretrained
, and operate entirely withintorchtune
for training.
- They expressed a desire to standardize behind one loader to allow users to tweak the tokenizer in the familiar HF environment, use
- HF Tokenizer should support Chat Templates: A user suggested it would be awesome if the
hf_tokenizer
also supported chat templates. - Special Tokens Desired by Users: A user indicated that their users are interested in adding special tokens.
tinygrad (George Hotz) ā· #general (2 messages):
Tensor.stack Tuple Support, SDPA Enable GQA
- Tensor.stack Seeks Tuple Type: A member requested tuple support for
Tensor.stack
to match PyTorch, questioning if itās desirable given the methodās nature, at least suggesting improved error handling.- The conversation implies a need to align
tinygrad
āsTensor.stack
functionality with PyTorchās for better compatibility and user experience, debating whether to fully implement tuple support or focus on clearer error messages when tuples are encountered.
- The conversation implies a need to align
- SDPA Eyes Enable GQA Feature: A contributor inquired about adding the
enable_gqa
feature totinygrad
ās Scaled Dot-Product Attention (SDPA) to align with PyTorch.- This suggests an effort to enhance
tinygrad
ās SDPA implementation by incorporating Grouped Query Attention (GQA) capabilities, mirroring PyTorchās functionality for potentially improved performance and broader applicability.
- This suggests an effort to enhance
LLM Agents (Berkeley MOOC) ā· #mooc-lecture-discussion (1 messages):
Securing OpenAI API Keys, Tracking API Usage, Multi-Service Key Access
- Seeking advice on securing OpenAI API keys: Members are requesting advice on how to secure OpenAI API keys and other LLM API keys when building Agentic AI workflows and AI agents.
- The user emphasizes the need for never losing API keys, tracking API usage, and per Agent API Usage, especially in setups with multiple services sharing access and no real infra/security team yet.
- Desire for API key tracking strategies: The user is looking for general advice and thoughts on securing OpenAI keys or other LLM API keys.
- They want to learn how to never lose API keys and track API usage, particularly per Agent API Usage, within setups that involve AI agents/AI workflows calling APIs, multiple services sharing access, and lacking dedicated infrastructure or security teams.