All tags
Model: "gemma-4"
not much happened today
gemma-4 google huggingface intel ollama unsloth reasoning agentic-workflows multimodality on-device-ai local-inference model-benchmarking moe vision audio-processing memory-optimization open-source model-performance fchollet demishassabis clementdelangue quixiai googlegemma ggerganov osanseviero maartengr basecampbernie prince_canuma measure_plan kimmonismus anemll arena stochasticchasm reach_vb zeneca everlier erick_lindberg_ anomalistg
Gemma 4 was launched by Google under an Apache 2.0 license, marking a significant open-model release focused on reasoning, agentic workflows, multimodality, and on-device use. It outperforms models 10x larger and has immediate ecosystem support including vLLM, llama.cpp, Ollama, Intel hardware, Unsloth, and Hugging Face Inference Endpoints. Local inference benchmarks showed strong performance on consumer hardware, including RTX 4090 and Mac mini M4. Early benchmarking praised its efficiency and ranking improvements over previous versions. Meanwhile, Hermes Agent emerged as a popular open-source agent harness, noted for stability and capability on long tasks, with users switching from OpenClaw to Hermes.
Gemma 4
gemma-4 gemma-4-31b gemma-4-26b-a4b google-deepmind multimodality long-context model-architecture moe local-inference model-optimization function-calling quantization jeffdean _philschmid rasbt ggerganov clattner_llvm julien_c clementdelangue
Google DeepMind released Gemma 4, a family of open-weight, multimodal models with long-context support up to 256K tokens under an Apache 2.0 license, marking a major capability and licensing shift. The lineup includes 31B dense, 26B MoE (A4B), and two edge models (E4B, E2B) optimized for local and edge deployment with native multimodal support (text, vision, audio). Early benchmarks show Gemma-4-31B ranking #3 among open models and strong scientific reasoning performance with 85.7% GPQA Diamond. Day-0 ecosystem support includes llama.cpp, Ollama, vLLM, and LM Studio, with notable local inference performance on hardware like M2 Ultra and RTX 4090. The architecture features hybrid attention and MoE layering, diverging from standard transformers. Community and developer engagement is high, with rapid adoption and tooling integration.