All tags
Topic: "model-inference"
The new OpenAI Agents Platform
reka-flash-3 o1-mini claude-3-7-sonnet llama-3-3-70b sonic-2 qwen-chat olympiccoder openai reka-ai hugging-face deepseek togethercompute alibaba ai-agents api model-releases fine-tuning reinforcement-learning model-training model-inference multimodality voice-synthesis gpu-clusters model-distillation performance-optimization open-source sama reach_vb
OpenAI introduced a comprehensive suite of new tools for AI agents, including the Responses API, Web Search Tool, Computer Use Tool, File Search Tool, and an open-source Agents SDK with integrated observability tools, marking a significant step towards the "Year of Agents." Meanwhile, Reka AI open-sourced Reka Flash 3, a 21B parameter reasoning model that outperforms o1-mini and powers their Nexus platform, with weights available on Hugging Face. The OlympicCoder series surpassed Claude 3.7 Sonnet and much larger models on competitive coding benchmarks. DeepSeek built a 32K GPU cluster capable of training V3-level models in under a week and is exploring AI distillation. Hugging Face announced Cerebras inference support, achieving over 2,000 tokens/s on Llama 3.3 70B, 70x faster than leading GPUs. Reka's Sonic-2 voice AI model delivers 40ms latency via the Together API. Alibaba's Qwen Chat enhanced its multimodal interface with video understanding up to 500MB, voice-to-text, guest mode, and expanded file uploads. Sama praised OpenAI's new API as "one of the most well-designed and useful APIs ever."
not much happened today
helium-1 qwen-2.5 phi-4 sky-t1-32b-preview o1 codestral-25.01 phi-3 mistral llama-3 gpt-3.5 llama-3 gpt-3.5 llmquoter kyutai-labs lmstudio mistralai llamaindex huggingface langchainai hyperbolic-labs replit fchollet philschmid multilinguality token-level-distillation context-windows model-performance open-source reasoning coding retrieval-augmented-generation hybrid-retrieval multiagent-systems video large-video-language-models dynamic-ui voice-interaction gpu-rentals model-optimization semantic-deduplication model-inference reach_vb awnihannun lior_on_ai sophiamyang omarsar0 skirano yuchenj_uw fchollet philschmid
Helium-1 Preview by kyutai_labs is a 2B-parameter multilingual base LLM outperforming Qwen 2.5, trained on 2.5T tokens with a 4096 context size using token-level distillation from a 7B model. Phi-4 (4-bit) was released in lmstudio on an M4 max, noted for speed and performance. Sky-T1-32B-Preview is a $450 open-source reasoning model matching o1's performance with strong benchmark scores. Codestral 25.01 by mistralai is a new SOTA coding model supporting 80+ programming languages and offering 2x speed.
Innovations include AutoRAG for optimizing retrieval-augmented generation pipelines, Agentic RAG for autonomous query reformulation and critique, Multiagent Finetuning using societies of models like Phi-3, Mistral, LLaMA-3, and GPT-3.5 for reasoning improvements, and VideoRAG incorporating video content into RAG with LVLMs.
Applications include a dynamic UI AI chat app by skirano on Replit, LangChain tools like DocTalk for voice PDF conversations, AI travel agent tutorials, and news summarization agents. Hyperbolic Labs offers competitive GPU rentals including H100, A100, and RTX 4090. LLMQuoter enhances RAG accuracy by identifying key quotes.
Infrastructure updates include MLX export for LLM inference from Python to C++ by fchollet and SemHash semantic text deduplication by philschmid.
Skyfall
gemini-1.5-pro gemini-1.5-flash yi-1.5 kosmos-2.5 paligemma falcon-2 deepseek-v2 hunyuan-dit gemini-1.5 gemini-1.5-flash yi-1.5 google-deepmind yi-ai microsoft hugging-face langchain maven multimodality mixture-of-experts transformer model-optimization long-context model-performance model-inference fine-tuning local-ai scaling-laws causal-models hallucination-detection model-distillation model-efficiency hamel-husain dan-becker clement-delangue philschmid osanseviero arankomatsuzaki jason-wei rohanpaul_ai
Between 5/17 and 5/20/2024, key AI updates include Google DeepMind's Gemini 1.5 Pro and Flash models, featuring sparse multimodal MoE architecture with up to 10M context and a dense Transformer decoder that is 3x faster and 10x cheaper. Yi AI released Yi-1.5 models with extended context windows of 32K and 16K tokens. Other notable releases include Kosmos 2.5 (Microsoft), PaliGemma (Google), Falcon 2, DeepSeek v2 lite, and HunyuanDiT diffusion model. Research highlights feature an Observational Scaling Laws paper predicting model performance across families, a Layer-Condensed KV Cache technique boosting inference throughput by up to 26×, and the SUPRA method converting LLMs into RNNs for reduced compute costs. Hugging Face expanded local AI capabilities enabling on-device AI without cloud dependency. LangChain updated its v0.2 release with improved documentation. The community also welcomed a new LLM Finetuning Discord by Hamel Husain and Dan Becker for Maven course users. "Hugging Face is profitable, or close to profitable," enabling $10 million in free shared GPUs for developers.
DeepSeek-V2 beats Mixtral 8x22B with >160 experts at HALF the cost
deepseek-v2 llama-3-120b llama-3-400b gpt-4 mistral phi claude gemini mai-1 med-gemini deepseek-ai mistral-ai microsoft openai scale-ai tesla nvidia google-deepmind mixture-of-experts multi-head-attention model-inference benchmarking overfitting robotics teleoperation open-source multimodality hallucination-detection fine-tuning medical-ai model-training erhartford maximelabonne bindureddy adcock_brett drjimfan clementdelangue omarsar0 rohanpaul_ai
DeepSeek V2 introduces a new state-of-the-art MoE model with 236B parameters and a novel Multi-Head Latent Attention mechanism, achieving faster inference and surpassing GPT-4 on AlignBench. Llama 3 120B shows strong creative writing skills, while Microsoft is reportedly developing a 500B parameter LLM called MAI-1. Research from Scale AI highlights overfitting issues in models like Mistral and Phi, whereas GPT-4, Claude, Gemini, and Llama maintain benchmark robustness. In robotics, Tesla Optimus advances with superior data collection and teleoperation, LeRobot marks a move toward open-source robotics AI, and Nvidia's DrEureka automates robot skill training. Multimodal LLM hallucinations are surveyed with new mitigation strategies, and Google's Med-Gemini achieves SOTA on medical benchmarks with fine-tuned multimodal models.
12/8/2023 - Mamba v Mistral v Hyena
mistral-8x7b-moe mamba-3b stripedhyena-7b claude-2.1 gemini gpt-4 dialogrpt-human-vs-machine cybertron-7b-v2-gguf falcon-180b mistral-ai togethercompute stanford anthropic google hugging-face mixture-of-experts attention-mechanisms prompt-engineering alignment image-training model-deployment gpu-requirements cpu-performance model-inference long-context model-evaluation open-source chatbots andrej-karpathy tri-dao maxwellandrews raddka
Three new AI models are highlighted: Mistral's 8x7B MoE model (Mixtral), Mamba models up to 3B by Together, and StripedHyena 7B, a competitive subquadratic attention model from Stanford's Hazy Research. Discussions on Anthropic's Claude 2.1 focus on its prompting technique and alignment challenges. The Gemini AI from Google is noted as potentially superior to GPT-4. The community also explores Dreambooth for image training and shares resources like the DialogRPT-human-vs-machine model on Hugging Face. Deployment challenges for large language models, including CPU performance and GPU requirements, are discussed with references to Falcon 180B and transformer batching techniques. User engagement includes meme sharing and humor.