All tags
Topic: "gpu-inference"
Mozilla's AI Second Act
llama-3 claude-3-opus gemini-1.5 deepseek-coder-v2 gpt-4 mozilla llamaindex anthropic etched-ai sohu deepseek openai vector-search inference-speed hardware-benchmarks context-windows open-source-models coding reasoning model-benchmarking gpu-inference agentic-ai justine-tunney stephen-hood tim-dettmers bindureddy
Mozilla showcased detailed live demos of llamafile and announced sqlite-vec for vector search integration at the AIE World's Fair. LlamaIndex launched llama-agents. Anthropic introduced new UI features and Projects for Claude with a 200K context window. Etched AI revealed a specialized inference chip claiming 500k tokens/sec, though benchmark claims are questioned. Sohu chip enables 15 agent trajectories/sec. Tim Dettmers shared theoretical GPU inference limits of ~300k tokens/sec for 8xB200 NVLink on 70B Llama. Deepseek Coder v2 outperforms Gemini and GPT-4 variants in coding and reasoning. The PyTorch documentary launched to little attention.
Not much happened today
jamba-v0.1 command-r gpt-3.5-turbo openchat-3.5-0106 mixtral-8x7b mistral-7b midnight-miqu-70b-v1.0.q5_k_s cohere lightblue openai mistral-ai nvidia amd hugging-face ollama rag mixture-of-experts model-architecture model-analysis debate-persuasion hardware-performance gpu-inference cpu-comparison local-llm stable-diffusion ai-art-bias
RAGFlow open sourced, a deep document understanding RAG engine with 16.3k context length and natural language instruction support. Jamba v0.1, a 52B parameter MoE model by Lightblue, released but with mixed user feedback. Command-R from Cohere available on Ollama library. Analysis of GPT-3.5-Turbo architecture reveals about 7 billion parameters and embedding size of 4096, comparable to OpenChat-3.5-0106 and Mixtral-8x7B. AI chatbots, including GPT-4, outperform humans in debates on persuasion. Mistral-7B made amusing mistakes on a math riddle. Hardware highlights include a discounted HGX H100 640GB machine with 8 H100 GPUs bought for $58k, and CPU comparisons between Epyc 9374F and Threadripper 1950X for LLM inference. GPU recommendations for local LLMs focus on VRAM and inference speed, with users testing 4090 GPU and Midnight-miqu-70b-v1.0.q5_k_s model. Stable Diffusion influences gaming habits and AI art evaluation shows bias favoring human-labeled art.
12/24/2023: Dolphin Mixtral 8x7b is wild
dolphin glm3 chatglm3-ggml mistral-ai ollama google openai fine-tuning hardware-compatibility gpu-inference local-model-hosting model-integration rocm-integration performance-issues autogen linux model-training eric-hartford
Mistral models are recognized for being uncensored, and Eric Hartford's Dolphin series applies uncensoring fine-tunes to these models, gaining popularity on Discord and Reddit. The LM Studio Discord community discusses various topics including hardware compatibility, especially GPU performance with Nvidia preferred, fine-tuning and training models, and troubleshooting issues with LM Studio's local model hosting capabilities. Integration efforts with GPT Pilot and a beta release for ROCm integration are underway. Users also explore the use of Autogen for group chat features and share resources like the Ollama NexusRaven library. Discussions highlight challenges with running LM Studio on different operating systems, model performance issues, and external tools like Google Gemini and ChatGLM3 compilation.