All tags
Company: "github"
DeepSeek V3.1: 840B token continued pretrain, beating Claude 4 Sonnet at 11% of its cost
deepseek-v3.1 seed-oss-36b computerrl gemini-2.5-pro gpt-5 claude-code gpt-oss-120b gpt-oss-20b deepseek bytedance zhipu-ai github microsoft anthropic together-ai baseten huggingface token-efficiency coding agentic-benchmarks long-context reinforcement-learning developer-tools fine-tuning multinode-training model-release teortaxestex rasbt lukehoban burkeholland _catwu cline winglian
DeepSeek released DeepSeek V3.1, a quietly rolled out open model with an 128K context window and improvements in token efficiency, coding, and agentic benchmarks. ByteDance launched the permissive Seed-OSS 36B model on Hugging Face, noted for long-context and reasoning capabilities. Zhipu AI introduced ComputerRL, a reinforcement learning framework for computer-use agents, achieving strong benchmark results. In developer tooling, GitHub Copilot expanded globally, Microsoft VS Code integrated Gemini 2.5 Pro and updated GPT-5 agent prompts, and Anthropic launched Claude Code seats with spend controls. Open-source fine-tuning advances include Together AI adding SFT for gpt-oss-120B/20B and Baseten enabling multinode 120B training with Truss CLI. The community noted mixed performance and ongoing post-training adjustments for DeepSeek V3.1.
Creating a LLM-as-a-Judge
claude-3.5-sonnet claude-3.5 notebooklm simpleqa recraft-v3 anthropic openai deepmind apple zep perplexity-ai github critique-shadowing llm-judging domain-experts dataset-creation prompt-engineering error-analysis temporal-knowledge-graphs memory-layer ai-agent-memory hallucination-reduction integration hamel-husain swyx
Anthropic released details on Claude 3.5 SWEBench+SWEAgent, while OpenAI introduced SimpleQA and DeepMind launched NotebookLM. Apple announced new M4 Macbooks, and a new SOTA image model, Recraft v3, emerged. Hamel Husain presented a detailed 6,000-word treatise on creating LLM judges using a method called critique shadowing to align LLMs with domain experts, addressing the problem of untrusted and unused data in AI teams. The workflow involves expert-reviewed datasets and iterative prompt refinement. Additionally, Zep introduced a temporal knowledge graph memory layer to improve AI agent memory and reduce hallucinations. Anthropic also integrated Claude 3.5 Sonnet with GitHub Copilot, expanding access to Copilot Chat users.
GitHub Copilot Strikes Back
claude-3-5-sonnet gemini-1.5-pro o1-preview gemini-flash-8b github anthropic google-deepmind openai weights-biases model-picker-ui multi-model-integration natural-language-applications deployment-free-hosting model-prompting multimodal-observability audio-tracing codebase-optimization price-performance-ratio cassidy-williams fchollet rohanpaul_ai jxmnop
GitHub's tenth annual Universe conference introduced the Multi-model Copilot featuring Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.5 Pro, and OpenAI's o1-preview models in a new picker UI, allowing developers to choose from multiple companies' models. The event also showcased GitHub Spark, an AI-native tool for building natural language applications with deployment-free hosting and integrated model prompting. Additionally, GitHub updated its Copilot Workspace with new agents and security Autofix features. Weights & Biases launched Weave with multimodal observability supporting audio, text, and images, integrating the OpenAI Realtime API. Twitter recaps highlighted tinygrad's codebase optimization and discussions on GenAI adoption and Gemini Flash-8B's cost efficiency at $0.0375 per million tokens.