All tags
Topic: "hybrid-architecture"
NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B
nemotron-3-nano qwen3-30b-a3b-base nvidia huggingface togethercompute baseten vllm llamaindex hybrid-architecture mixture-of-experts reinforcement-learning long-context model-release open-source-models model-training model-optimization benchmarking agent-training ctnzr andrew_n_carr awnihannun
NVIDIA has released Nemotron 3 Nano, a fully open-source hybrid Mamba-Transformer Mixture-of-Experts (MoE) model with a 30B parameter size and a 1 million token context window. It includes open weights, training recipes, datasets, and an RL environment suite called NeMo Gym, supporting commercial use under the NVIDIA Open Model License. The model achieves state-of-the-art results on benchmarks like SWE-Bench and Artificial Analysis Intelligence Index, outperforming Qwen3-30B A3B. Ecosystem support is immediate with integrations into inference stacks like vLLM, llama.cpp, and Baseten. Upcoming larger models, Nemotron Super and Ultra, will feature NVFP4 pretraining and LatentMoE routing to optimize compute. This release marks a significant milestone for open-source American AI with comprehensive open assets and advanced hybrid architecture.
Qwen3-Next-80B-A3B-Base: Towards Ultimate Training & Inference Efficiency
qwen3-next qwen3 mixtral-8x7b gemini-2.5-pro alibaba mistral-ai deepseek snowflake hugging-face baseten nvidia mixture-of-experts model-sparsity gated-attention hybrid-architecture rmsnorm model-stability model-training inference-optimization multi-token-prediction model-deployment justinlin610 teortaxestex yuchenj_uw
MoE (Mixture of Experts) models have become essential in frontier AI models, with Qwen3-Next pushing sparsity further by activating only 3.7% of parameters (3B out of 80B) using a hybrid architecture combining Gated DeltaNet and Gated Attention. This new design includes 512 total experts (10 routed + 1 shared), Zero-Centered RMSNorm for stability, and improved MoE router initialization, resulting in ~10× cheaper training and 10× faster inference compared to previous models. Alibaba's Qwen3-Next reportedly outperforms Gemini-2.5-Flash-Thinking and approaches the flagship 235B model's performance, with deployments on Hugging Face, Baseten, and native vLLM support for efficient inference.
Mixtral 8x22B Instruct sparks efficiency memes
mixtral-8x22b llama-2-7b olmo-7b mistral-ai hugging-face google microsoft intel softbank nvidia multilinguality math code-generation context-window model-performance model-release retrieval-augmented-generation deepfake ai-investment ai-chip hybrid-architecture training-data guillaume-lample osanseviero _philschmid svpino
Mistral released an instruct-tuned version of their Mixtral 8x22B model, notable for using only 39B active parameters during inference, outperforming larger models and supporting 5 languages with 64k context window and math/code capabilities. The model is available on Hugging Face under an Apache 2.0 license for local use. Google plans to invest over $100 billion in AI, with other giants like Microsoft, Intel, and SoftBank also making large investments. The UK criminalized non-consensual deepfake porn, raising enforcement debates. A former Nvidia employee claims Nvidia's AI chip lead is unmatchable this decade. AI companions could become a $1 billion market. AI has surpassed humans on several basic tasks but lags on complex ones. Zyphra introduced Zamba, a novel 7B parameter hybrid model outperforming LLaMA-2 7B and OLMo-7B with less training data, trained on 128 H100 GPUs over 30 days. GroundX API advances retrieval-augmented generation accuracy.