All tags
Topic: "sparse-moe"
Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models
mistral-large-3 ministral-3 clara-7b-instruct gen-4.5 claude-code mistral-ai anthropic apple runway moondream sparse-moe multimodality benchmarking open-source model-licensing model-performance long-context inference-optimization instruction-following local-inference code-generation model-integration anjney_midha _akhaliq alexalbert__ _catwu mikeyk
Mistral has launched the Mistral 3 family including Ministral 3 models (3B/8B/14B) and Mistral Large 3, a sparse MoE model with 675B total parameters and 256k context window, all under an Apache 2.0 open license. Early benchmarks rank Mistral Large 3 at #6 among open models with strong coding performance. The launch includes broad ecosystem support such as vLLM, llama.cpp, Ollama, and LM Studio integrations. Meanwhile, Anthropic acquired the open-source Bun runtime to accelerate Claude Code, which reportedly reached a $1B run-rate in ~6 months. Anthropic also announced discounted Claude plans for nonprofits and shared insights on AI's impact on work internally.
MiniMax M2 230BA10B — 8% of Claude Sonnet's price, ~2x faster, new SOTA open model
minimax-m2 hailuo-ai huggingface baseten vllm modelscope openrouter cline sparse-moe model-benchmarking model-architecture instruction-following tool-use api-pricing model-deployment performance-evaluation full-attention qk-norm gqa rope reach_vb artificialanlys akhaliq eliebakouch grad62304977 yifan_zhang_ zpysky1125
MiniMax M2, an open-weight sparse MoE model by Hailuo AI, launches with ≈200–230B parameters and 10B active parameters, offering strong performance near frontier closed models and ranking #5 overall on the Artificial Analysis Intelligence Index v3.0. It supports coding and agent tasks, is licensed under MIT, and is available via API at competitive pricing. The architecture uses full attention, QK-Norm, GQA, partial RoPE, and sigmoid routing, with day-0 support in vLLM and deployment on platforms like Hugging Face and Baseten. Despite verbosity and no tech report, it marks a significant win for open models.
Qwen 1.5 Released
qwen-1.5 mistral-7b sparsetral-16x7b-v2 bagel-7b-v0.4 deepseek-math-7b-instruct deepseek qwen mistral-ai hugging-face meta-ai-fair quantization token-context multilinguality retrieval-augmented-generation agent-planning code-generation sparse-moe model-merging fine-tuning direct-preference-optimization character-generation ascii-art kanji-generation vr retinal-resolution light-field-passthrough frozen-networks normalization-layers
Chinese AI models Yi, Deepseek, and Qwen are gaining attention for strong performance, with Qwen 1.5 offering up to 32k token context and compatibility with Hugging Face transformers and quantized models. The TheBloke Discord discussed topics like quantization of a 70B LLM, the introduction of the Sparse MoE model Sparsetral based on Mistral, debates on merging vs fine-tuning, and Direct Preference Optimization (DPO) for character generation. The Nous Research AI Discord covered challenges in Japanese Kanji generation, AI scams on social media, and Meta's VR headset prototypes showcased at SIGGRAPH 2023. Discussions also included fine-tuning frozen networks and new models like bagel-7b-v0.4, DeepSeek-Math-7b-instruct, and Sparsetral-16x7B-v2.