All tags
Topic: "sliding-window-attention"
not much happened today
gpt-oss-120b gpt-oss-20b kimi-k2 deepseek-r1 qwen-3-32b openai huggingface microsoft llamaindex ollama baseten fireworksai cerebras groq together anthropic google uk-aisi sliding-window-attention mixture-of-experts rope context-length mxfp4-format synthetic-data reasoning-core-hypothesis red-teaming benchmarking coding-benchmarks model-performance fine-tuning woj_zaremba sama huybery drjimfan jxmnop scaling01 arunv30 kevinweil xikun_zhang_ jerryjliu0 ollama basetenco reach_vb gneubig shxf0072 _lewtun
OpenAI released its first open models since GPT-2, gpt-oss-120b and gpt-oss-20b, which quickly trended on Hugging Face. Microsoft supports these models via Azure AI Foundry and Windows Foundry Local. Key architectural innovations include sliding window attention, mixture of experts (MoE), a RoPE variant, and a 256k context length. The models use a new MXFP4 format supported by llama.cpp. Hypotheses suggest gpt-oss was trained on synthetic data to enhance safety and performance, supporting the Reasoning Core Hypothesis. OpenAI announced a $500K bounty for red teaming with partners including Anthropic, Google, and the UK AISI. Performance critiques highlight inconsistent benchmarking results, with GPT-OSS-120B scoring 41.8% on the Aider Polyglot coding benchmark, trailing competitors like Kimi-K2 and DeepSeek-R1. Some users note the model excels in math and reasoning but lacks common sense and practical utility.
Anthropic releases Claude 4 Sonnet and Opus: Memory, Agent Capabilities, Claude Code, Redteam Drama
claude-4 claude-4-opus claude-4-sonnet claude-3.5-sonnet anthropic instruction-following token-accounting pricing-models sliding-window-attention inference-techniques open-sourcing model-accessibility agent-capabilities-api extended-context model-deployment
Anthropic has officially released Claude 4 with two variants: Claude Opus 4, a high-capability model for complex tasks priced at $15/$75 per million tokens, and Claude Sonnet 4, optimized for efficient everyday use. The release emphasizes instruction following and extended work sessions up to 7 hours. Community discussions highlight concerns about token pricing, token accounting transparency, and calls for open-sourcing Claude 3.5 Sonnet weights to support local model development. The news also covers Claude Code GA, new Agent Capabilities API, and various livestreams and reports detailing these updates. There is notable debate around sliding window attention and advanced inference techniques for local deployment.
1/4/2024: Jeff Bezos backs Perplexity's $520m Series B.
wizardcoder-33b-v1.1 mobilellama-1.4b-base shearedllama tinyllama mixtral-8x7b perplexity anthropic google nous-research mistral-ai hugging-face document-recall rnn-memory synthetic-data benchmarking multi-gpu-support context-length model-architecture sliding-window-attention model-parallelism gpu-optimization jeff-bezos
Perplexity announced their Series B funding round with notable investor Jeff Bezos, who previously invested in Google 25 years ago. Anthropic is raising $750 million, projecting at least $850 million in annualized revenue next year and implementing "brutal" changes to their Terms of Service. Discussions in Nous Research AI Discord cover topics such as document recall limits from gigabytes of data, RNN memory and compute trade-offs, synthetic datasets, and benchmarking of models like WizardCoder-33B-V1.1, MobileLLaMA-1.4B-Base, ShearedLLaMA, and TinyLLaMA. Other highlights include UnsLOTH optimizations for multi-GPU systems, AI rap voice models, context-extending code, and architectural innovations like applying Detectron/ViT backbones to LLMs, sliding window attention in Mistral, and parallelizing Mixtral 8x7b with FSDP and HF Accelerate.