Model: "mimo"

eagle-3.1 unigram-tokenizer qwen-3.5 deepseek-v4-pro mimo deep-agents-v0.6 397b-parameter-model eaglecorp vllm_project perplexity_ai alibaba lightseek nvidia mooncake flashattention kimmonismus deepseek xiaomi langchain baseten trajectory clay harvey decagon mercor rogo rlm inference-optimization long-context speculative-decoding tokenization attention-mechanisms kv-cache cache-hierarchy agent-engineering model-harness-memory-fit continual-learning quantization autoscaling memory-centric-agents evaluation-automation kimmonismus _luofuli vtrivedy10

Inference optimization is increasingly architectural, with EAGLE 3.1 improving speculative decoding and long-context handling, collaborating with vLLM and TorchSpec. Perplexity open-sourced a rebuilt Unigram tokenizer cutting CPU use by 5–6× and achieving 63 µs at 514 tokens. Qwen3.5 hits 580 tokens/s via joint efforts from Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Price cuts in APIs from Chinese labs are sustainable due to structural KV-cache and attention improvements, exemplified by DeepSeek V4-Pro and Xiaomi MiMo reducing caching costs significantly. Agent engineering shifts focus from model quality to model-harness-memory fit, with LangChain releasing Deep Agents v0.6 and tools like LangSmith Engine automating evaluation loops. Trajectory launched a continual learning platform with $15M funding and partners like Clay and Harvey, supporting large models including a 397B-parameter model deployed on autoscaled H100 infrastructure. Open-source memory-centric agents and minimal training harnesses also gained attention.

You can also subscribe by rss .

Press Esc or click anywhere to close