All tags
Topic: "batch-inference"
not much happened today.
gpt-5.2-codex glm-4.7 openai cursor github cerebras modal artificial-analysis vllm long-running-tasks autonomous-agents code-generation inference-speed latency batch-inference gpu-scaling model-evaluation agent-systems operational-scaling swyx kevinweil pierceboggan mntruell scaling01
OpenAI launched GPT-5.2-Codex API, touted as their strongest coding model for long-running tasks and cybersecurity. Cursor integrated GPT-5.2-Codex to autonomously run a browser for a week, producing over 3 million lines of Rust code. GitHub incorporated it into their code tools, easing enterprise adoption. Discussions highlight the importance of review loops in agent systems and debate evaluation metrics for coding models. OpenAI partnered with Cerebras to improve inference speed and latency, with Cerebras serving GLM-4.7 at 1,445 tokens/sec and low latency. Provider benchmarking reveals tradeoffs in throughput, latency, and context window sizes. Modal shared operational scaling insights for self-hosted inference fleets of 20k GPUs, focusing on batch inference optimization with vLLM and FlashInfer backend. This reflects a focus on inference infrastructure, long-horizon autonomous agents, and coding model evaluation.
not much happened today
gemini-robotics-1.5 gemini-live embeddinggemma veo-3 gemini-2.5-flash code-world-model-32b qwen3-coder-30b vllm-v1 mlx-lm flashattention-4 google meta-ai-fair perplexity-ai baseten spatial-reasoning temporal-reasoning agentic-ai code-semantics code-execution-traces coding-infrastructure runtime-optimization batch-inference embedding-latency api model-optimization model-performance osanseviero _anniexie rmstein scaling01 giffmana cline redhat_ai awnihannun charles_irl bernhardsson akshat_b aravsrinivas
Google released a dense September update including Gemini Robotics 1.5 with enhanced spatial/temporal reasoning, Gemini Live, EmbeddingGemma, and Veo 3 GA powering creative workflows. They also introduced agentic features like restaurant-reservation agents and reduced pricing for Gemini 2.5 Flash. Meta AI unveiled the open-weight Code World Model (CWM) 32B, excelling in code semantics and math benchmarks, with innovations in training code models via execution traces. Local-first coding setups highlight Qwen3-Coder-30B running efficiently on consumer GPUs, paired with tools like Cline and LM Studio. Runtime improvements include vLLM v1 supporting hybrid models and mlx-lm adding batch inference on Apple silicon. In infrastructure, FlashAttention 4 was reverse-engineered revealing a ~20% speedup from architectural optimizations. Perplexity AI advances its independent web index and browsing API with upcoming feed refreshes. Embedding latency improvements were achieved by Superhuman using Baseten.