Company: "lmsys"

Nov 20

Nano Banana Pro (Gemini Image Pro) solves text-in-images, infographic generation, 2-4k resolution, and Google Search grounding

gemini-3-pro gpt-5 google openai hugging-face togethercompute lmsys image-generation text-rendering model-provenance scientific-research proof-assistance multimodal-integration api-access fine-tuning jeffdean kevinweil demishassabis

Google launched Gemini 3 Pro Image (Nano Banana Pro), a next-generation AI image generation and editing model with integrated Google Search grounding, multi-image composition, and fine-grained visual controls, offering pricing at $0.134 per 2K image and $0.24 per 4K image. It features improved text rendering with error rates dropping from 56% to 8% compared to its predecessor, and includes SynthID watermark checks for provenance. The model is available via Gemini App, API, LM Arena, Hugging Face Spaces, Together AI, and Flow. Meanwhile, OpenAI shared early experiments with GPT-5 accelerating scientific research, including proofs of previously unsolved problems in math, physics, biology, and materials science. "GPT-5 accelerated research tasks in math/physics/biology/materials; in 4, it helped find proofs of previously unsolved problems."

Sep 05

Kimi K2‑0905 and Qwen3‑Max preview: two 1T open weights models launched

kimi-k2-0905 qwen-3-max qwen-3 moonshot-ai alibaba huggingface together-ai groq lmsys openrouter llamaindex long-context agents coding tool-use model-evaluation instruction-following context-windows semantic-search discriminator-models swyx karpathy willdepue levie bebischof andrew_n_carr bigeagle_xd

Moonshot AI updated their Kimi K2-0905 open model with doubled context length to 256k tokens, improved coding and tool-calling, and integration with agent scaffolds. Alibaba released Qwen 3 Max, a 1 trillion parameter model with agent-oriented behavior, available via Qwen Chat, Alibaba Cloud API, and OpenRouter. The community highlights China's dominance in open models and debates around meaningful evaluation methods for code agents, emphasizing long-horizon and domain-specific evals. Influential voices like @swyx and @karpathy discuss the importance of practical evals and discriminator models for ranking outputs.

Aug 15

not much happened today

gpt-5 gpt-5-high gpt-5-mini-high gpt-5-nano-high imagen-4 gemma-3-270m openai google lmsys model-releases model-performance prompt-engineering developer-tools image-generation model-optimization transformers tokenization model-scaling sama aidan_mclau kevinweil lmarena_ai edwinarbus gdb omarsar0 philschmid m4rkmc

OpenAI rolled out GPT-5 as the default in ChatGPT with new modes and a "warmer" personality, plus expanded message limits for Plus/Team users and Enterprise/Edu access. Performance rankings show gpt-5-high leading, with smaller variants also ranked, though critiques note some underperformance versus Chinese models and sensitivity to sycophancy. OpenAI enhanced developer tools with a "Quick eval" feature, coding tips, and an improved Playground. Google released Imagen 4 generally available with faster generation and higher resolution, plus the ultra-small Gemma 3 270M model with a large vocabulary and ecosystem support. Podcasts featured OpenAI leaders discussing GPT-5 systems, routing, and efficiency.

Mar 04

Anthropic's $61.5B Series E

gpt-4.5 claude-3.7-sonnet deepseek-r1 anthropic openai deepseek lmsys perplexity-ai deutsche-telekom model-performance benchmarking style-control coding multi-turn funding partnerships workflow lmarena_ai teortaxestex casper_hansen_ omarsar0 aidan_mclau willdepue vikhyatk teknim1 reach_vb _aidan_clark_ cto_junior aravsrinivas

Anthropic raised a $3.5 billion Series E funding round at a $61.5 billion valuation, signaling strong financial backing for the Claude AI model. GPT-4.5 achieved #1 rank across all categories on the LMArena leaderboard, excelling in multi-turn conversations, coding, math, creative writing, and style control. DeepSeek R1 tied with GPT-4.5 for top performance on hard prompts with style control. Discussions highlighted comparisons between GPT-4.5 and Claude 3.7 Sonnet in coding and workflow applications. The importance of the LMSYS benchmark was emphasized, though some questioned the relevance of benchmarks versus user acquisition. Additionally, Perplexity AI partnered with Deutsche Telekom to integrate the Perplexity Assistant into a new AI phone.

Sep 18, 2024

nothing much happened today

o1 chatgpt-4o llama-3-1-405b openai lmsys scale-ai cognition langchain qdrant rohanpaul_ai reinforcement-learning model-merging embedding-models toxicity-detection image-editing dependency-management automated-code-review visual-search benchmarking denny_zhou svpino alexandr_wang cwolferesearch rohanpaul_ai _akhaliq kylebrussell

OpenAI's o1 model faces skepticism about open-source replication due to its extreme restrictions and unique training advances like RL on CoT. ChatGPT-4o shows significant performance improvements across benchmarks. Llama-3.1-405b fp8 and bf16 versions perform similarly with cost benefits for fp8. A new open-source benchmark "Humanity's Last Exam" offers $500K in prizes to challenge LLMs. Model merging benefits from neural network sparsity and linear mode connectivity. Embedding-based toxic prompt detection achieves high accuracy with low compute. InstantDrag enables fast, optimization-free drag-based image editing. LangChain v0.3 releases with improved dependency management. Automated code review tool CodeRabbit adapts to team coding styles. Visual search advances integrate multimodal data for better product search. Experts predict AI will be default software by 2030.

Sep 11, 2024

not much happened today + AINews Podcast?

superforecaster-ai llama-3 reflection-70b glean sambanova cerebras stanford google apple hugging-face lmsys prompt-engineering research-ideas inference-speed retrieval-augmented-generation evaluation-methods visual-intelligence on-device-ai model-performance benchmarking novelty-detection danhendrycks benjamin-clavie bclavie bindureddy swyx borismpower corbtt drjimfan clementdelangue rohanpaul_ai

Glean doubled its valuation again. Dan Hendrycks' Superforecaster AI generates plausible election forecasts with interesting prompt engineering. A Stanford study found that LLM-generated research ideas are statistically more novel than those by expert humans. SambaNova announced faster inference for llama-3 models, surpassing Cerebras. Benjamin Clavie gave a notable talk on retrieval-augmented generation techniques. Strawberry is reported to launch in two weeks. Google Illuminate offers AI-generated podcast discussions about papers and books. Apple unveiled new AI features in iOS 18, including visual intelligence and improved Siri, with on-device and cloud processing for camera-based event additions. The Reflection 70B model sparked controversy over performance claims. Experts highlighted the unreliability of traditional benchmarks like MMLU and HumanEval, recommending alternative evaluation methods such as LMSys Chatbot Arena and Hugging Face's open-sourced Lighteval suite. The AI research community continues to explore AI's role in generating novel research ideas and improving benchmarking.

Aug 31, 2024

not much happened today

llama-3-1 claude-3-5-sonnet llama-3-1-405b ltm-2-mini qwen2-vl gpt-4o-mini meta-ai-fair hugging-face magic-ai-labs lmsys alibaba openai long-context style-control multimodality ai-safety model-evaluation web-crawling pdf-processing ai-hype-cycles call-center-automation sam-altman ajeya-cotra fchollet rohanpaul_ai philschmid

Meta announced significant adoption of LLaMA 3.1 with nearly 350 million downloads on Hugging Face. Magic AI Labs introduced LTM-2-Mini, a long context model with a 100 million token context window, and a new evaluation method called HashHop. LMSys added style control to their Chatbot Arena leaderboard, improving rankings for models like Claude 3.5 Sonnet and LLaMA 3.1 405B. Alibaba released Qwen2-VL, a multimodal LLM under Apache 2.0 license, competitive with GPT-4o mini. OpenAI CEO Sam Altman announced collaboration with the US AI Safety Institute for pre-release model testing. Discussions on AI safety and potential AI takeover risks were highlighted by Ajeya Cotra. Tools like firecrawl for web crawling and challenges in PDF processing were noted. AI hype cycles and market trends were discussed by François Chollet, and potential AI disruption in call centers was shared by Rohan Paul.

Aug 03, 2024

Execuhires: Tempting The Wrath of Khan

gemini-1.5-pro gpt-4o claude-3.5 flux-1 llama-3-1-405b character.ai google adept amazon inflection microsoft stability-ai black-forest-labs schelling google-deepmind openai anthropic meta-ai-fair lmsys langchainai execuhire model-benchmarking multilinguality math coding text-to-image agent-ide open-source-models post-training data-driven-performance noam-shazeer mostafa-mostaque david-friedman rob-rombach alexandr-wang svpino rohanpaul_ai

Character.ai's $2.5b execuhire to Google marks a significant leadership move alongside Adept's $429m execuhire to Amazon and Inflection's $650m execuhire to Microsoft. Despite strong user growth and content momentum, Character.ai's CEO Noam Shazeer returns to Google, signaling shifting vibes in the AI industry. Google DeepMind's Gemini 1.5 Pro tops Chatbot Arena benchmarks, outperforming GPT-4o and Claude-3.5, excelling in multilingual, math, and coding tasks. The launch of Black Forest Labs' FLUX.1 text-to-image model and LangGraph Studio agent IDE highlight ongoing innovation. Llama 3.1 405B is released as the largest open-source model, fostering developer use and competition with closed models. The industry is focusing increasingly on post-training and data as key competitive factors, raising questions about acquisition practices and regulatory scrutiny.

Aug 01, 2024

Gemma 2 2B + Scope + Shield

gemma-2b gemma-2-9b gemma-2-27b llama-3-1-405b sam-2 gpt-3.5 vicuna alpacaeval g-eval google-deepmind anthropic meta-ai-fair openai perplexity-ai nvidia lmsys knowledge-distillation leaderboards model-interpretability finetuning harm-detection video-segmentation voice publishers-program robotics-data-scaling quantization llm-evaluation prompt-engineering

Gemma 2B, a 2 billion parameter model trained on 2 trillion tokens and distilled from a larger unnamed LLM, has been released by Google DeepMind and shows strong leaderboard performance despite weaknesses in math. The Gemma series, including 9B and 27B models, has gained popularity since its June release. The team also released 400 SAEs for interpretability, inspired by Anthropic's research. A finetuned classifier called ShieldGemma outperforms Meta's LlamaGuard in harm detection. Meanwhile, Meta AI announced Llama-3.1-405B reaching #3 on the Overall Arena leaderboard, and released SAM 2, a video and image segmentation model with significant speed improvements. OpenAI is rolling out an advanced Voice Mode to Plus users. Perplexity AI launched a Publishers Program with major media partners and a status page. NVIDIA introduced Project GR00T for scaling robot data using Apple Vision Pro and generative simulation. Interest in quantization for compressing LLMs is growing, and LLM-as-a-Judge implementations from Vicuna, AlpacaEval, and G-Eval highlight the effectiveness of simple prompts and domain-specific evaluation.

Jul 19, 2024

Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)

gpt-4o-mini mistral-nemo llama-3 llama-3-400b deepseek-v2 openai nvidia mistral-ai togethercompute deepseek-ai lmsys model-quantization context-windows instruction-following model-performance cost-efficiency multimodality benchmarking open-source model-release sam-altman

GPT-4o-mini launches with a 99% price reduction compared to text-davinci-003, offering 3.5% the price of GPT-4o and matching Opus-level benchmarks. It supports 16k output tokens, is faster than previous models, and will soon support text, image, video, and audio inputs and outputs. Mistral Nemo, a 12B parameter model developed with Nvidia, features a 128k token context window, FP8 checkpoint, and strong benchmark performance. Together Lite and Turbo offer fp8/int4 quantizations of Llama 3 with up to 4x throughput and significantly reduced costs. DeepSeek V2 is now open-sourced. Upcoming releases include at least 5 unreleased models and Llama 4 leaks ahead of ICML 2024.

Jul 02, 2024

RouteLLM: RIP Martian? (Plus: AINews Structured Summaries update)

gpt-4 gemma-2-27b gemma-2-9b lmsys openai llm-routing cost-efficiency model-performance model-optimization data-augmentation syntax-based-routing mixture-of-experts inference-throughput software-2.0 computer-vision karpathy bindureddy armand-joulin

LMSys introduces RouteLLM, an open-source router framework trained on preference data from Chatbot Arena, achieving cost reductions over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while maintaining 95% of GPT-4's performance. This approach surpasses previous task-specific routing by using syntax-based Mixture of Experts (MoE) routing and data augmentation, beating commercial solutions by 40%. The update highlights advances in LLM routing, cost-efficiency, and model performance optimization across multiple models rather than single-model or MoE-level improvements. Additionally, the AI Twitter recap notes the Gemma 2 model family as a top open model, the Block Transformer architecture for improved inference throughput, and a proposal for a fully Software 2.0 computer vision system by karpathy.

Jun 26, 2024

Shall I compare thee to a Sonnet's day?

claude-3.5-sonnet claude-3.5 gpt-4o gemini-1.5-pro anthropic lmsys glif comfyui hard-prompts json json-extraction meme-generation instruction-following app-development fusion-energy nuclear-fission productivity fchollet mustafasuleyman

Claude 3.5 Sonnet from Anthropic achieves top rankings in coding and hard prompt arenas, surpassing GPT-4o and competing with Gemini 1.5 Pro at lower cost. Glif demonstrates a fully automated Wojak meme generator using Claude 3.5 for JSON generation and ComfyUI for images, showcasing new JSON extractor capabilities. Artifacts enables rapid creation of niche apps, exemplified by a dual monitor visualizer made in under 5 minutes. François Chollet highlights that fusion energy is not a near-term solution compared to existing nuclear fission plants. Mustafa Suleyman notes that 75% of desk workers now use AI, marking a shift toward AI-assisted productivity.

Jun 04, 2024

Not much happened today

gemini-1.5-flashmodel gemini-pro mixtral mamba-2 phi-3-medium phi-3-small gpt-3.5-turbo-0613 llama-3-8b llama-2-70b mistral-finetune twelve-labs livekit groq openai nea nvidia lmsys mistral-ai model-performance prompt-engineering data-curation ai-safety model-benchmarking model-optimization training sequence-models state-space-models daniel-kokotajlo rohanpaul_ai _arohan_ tri_dao _albertgu _philschmid sarahcat21 hamelhusain jachiam0 willdepue teknium1

Twelve Labs raised $50m in Series A funding co-led by NEA and NVIDIA's NVentures to advance multimodal AI. Livekit secured $22m in funding. Groq announced running at 800k tokens/second. OpenAI saw a resignation from Daniel Kokotajlo. Twitter users highlighted Gemini 1.5 FlashModel for high performance at low cost and Gemini Pro ranking #2 in Japanese language tasks. Mixtral models can run up to 8x faster on NVIDIA RTX GPUs using TensorRT-LLM. Mamba-2 model architecture introduces state space duality for larger states and faster training, outperforming previous models. Phi-3 Medium (14B) and Small (7B) models benchmark near GPT-3.5-Turbo-0613 and Llama 3 8B. Prompt engineering is emphasized for unlocking LLM capabilities. Data quality is critical for model performance, with upcoming masterclasses on data curation. Discussions on AI safety include a Frontier AI lab employee letter advocating whistleblower protections and debates on aligning AI to user intent versus broader humanity interests.

May 13, 2024

GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4O version)

gpt-4o gpt-4-turbo openai lmsys multion adept multimodality vision speech-recognition tokenization real-time-processing coding model-performance model-optimization desktop-agents sama gdb

OpenAI has released GPT-4o, a new multimodal model capable of reasoning across text, audio, and video in real time with low latency (~300ms). It features voice and vision capabilities, improved non-English language performance with an expanded 200k vocabulary tokenizer, and is available to all ChatGPT users including free plans. GPT-4o is half the price and twice as fast as GPT-4-turbo with 5x rate limits. The model supports real-time voice and video input/output and shows strong coding capabilities. The release includes a new desktop app that can read screen and clipboard history, challenging existing desktop agent startups. The announcement was accompanied by demos including image generation and 3D object handling, with OpenAI achieving state-of-the-art performance in ASR and vision tasks. The update was widely discussed on social media, with comparisons to GPT-4T highlighting GPT-4o's speed and versatility. "GPT-4o is smart, fast, natively multimodal, and a step towards more natural human-computer interaction" and "extremely versatile and fun to play with".

May 10, 2024

LMSys advances Llama 3 eval analysis

llama-3-70b llama-3 claude-3-sonnet alphafold-3 lmsys openai google-deepmind isomorphic-labs benchmarking model-behavior prompt-complexity model-specification molecular-structure-prediction performance-analysis leaderboards demis-hassabis sam-altman miranda-murati karina-nguyen joanne-jang john-schulman

LMSys is enhancing LLM evaluation by categorizing performance across 8 query subcategories and 7 prompt complexity levels, revealing uneven strengths in models like Llama-3-70b. DeepMind released AlphaFold 3, advancing molecular structure prediction with holistic modeling of protein-DNA-RNA complexes, impacting biology and genetics research. OpenAI introduced the Model Spec, a public standard to clarify model behavior and tuning, inviting community feedback and aiming for models to learn directly from it. Llama 3 has reached top leaderboard positions on LMSys, nearly matching Claude-3-sonnet in performance, with notable variations on complex prompts. The analysis highlights the evolving landscape of model benchmarking and behavior shaping.

May 03, 2024

$100k to predict LMSYS human preferences in a Kaggle contest

llama-3-70b llama-3 gpt-4 claude-3-opus prometheus-2 groq openai lmsys scale-ai ai2 nvidia benchmarking datasets fine-tuning reinforcement-learning model-alignment hallucination parameter-efficient-fine-tuning scalable-training factuality chatbot-performance bindureddy drjimfan percyliang seungonekim mobicham clefourrier

Llama 3 models are making breakthroughs with Groq's 70B model achieving record low costs per million tokens. A new Kaggle competition offers a $100,000 prize to develop models predicting human preferences from a dataset of over 55,000 user-LLM conversations. Open source evaluator LLMs like Prometheus 2 outperform proprietary models such as GPT-4 and Claude 3 Opus in judgment tasks. New datasets like WildChat1M provide over 1 million ChatGPT interaction logs with diverse and toxic examples. Techniques like LoRA fine-tuning show significant performance gains, and NVIDIA's NeMo-Aligner toolkit enables scalable LLM alignment across hundreds of GPUs. Factuality-aware alignment methods are proposed to reduce hallucinations in LLM outputs.

Apr 29, 2024

A quiet weekend

llama-3 dolphin-2.9 pixart-sigma llama-3-70b microsoft coca-cola uber lmsys nous-research mistral-ai ar-interfaces transformers algorithmic-tasks turing-test graph-algorithms embeddings generative-ai model-optimization llm-inference quantization model-deployment yann-lecun

Yann LeCun predicts a shift to AR interfaces with AI assistants in 10-15 years, moving away from smartphones. The Dolphin-2.9 model based on Llama-3 was released, improving quality issues. PixArt Sigma, a 0.6B parameter model, achieves Stable Diffusion 3.0 level performance with complete prompt adherence and local usability. Research shows transformers can use meaningless filler tokens for algorithmic tasks with dense supervision. AI-generated restaurant reviews can pass the Turing test, fooling humans and AI detectors. Uber uses graph algorithms and learned embeddings for ETA prediction. Coca-Cola and Microsoft announced a 5-year AI partnership to accelerate cloud and generative AI initiatives. The Llama-3 70B model can run on a single 4GB GPU using AirLLM optimization without quantization but is slow. Mistral.rs is introduced as a fast LLM inference platform with quantization and OpenAI API compatibility. Only 5% of LLMs make it from prototype to production due to challenges, especially in enterprise. EXL2 and GGUF quantization methods for Llama models show similar perplexity vs model size, with Llama-3 and Llama-2 degrading more under quantization compared to full precision.

Apr 26, 2024

Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM

snowflake-arctic phi-3 llama-3-70b llama-3 stable-diffusion-3 sd3-turbo gpt-3.5-turbo snowflake databricks deepseek deepspeed nvidia stable-diffusion adobe apple llamaindex lmsys openai mixture-of-experts curriculum-learning model-release image-generation video-upscaling quantization inference-speed benchmarking model-comparison open-source on-device-ai

Snowflake Arctic is a notable new foundation language model released under Apache 2.0, claiming superiority over Databricks in data warehouse AI applications and adopting a mixture-of-experts architecture inspired by DeepSeekMOE and DeepSpeedMOE. The model employs a 3-stage curriculum training strategy similar to the recent Phi-3 paper. In AI image and video generation, Nvidia introduced the Align Your Steps technique improving image quality at low step counts, while Stable Diffusion 3 and SD3 Turbo models were compared for prompt understanding and image quality. Adobe launched an AI video upscaling project enhancing blurry videos to HD, though with some high-resolution artifacts. Apple released open-source on-device language models with code and training logs, diverging from typical weight-only releases. The Llama-3-70b model ties for first place on the LMSYS leaderboard for English queries, and Phi-3 (4B params) outperforms GPT-3.5 Turbo in the banana logic benchmark. Fast inference and quantization of Llama 3 models were demonstrated on MacBook devices.

Apr 23, 2024

FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)

llama-3-70b llama-3 wizardlm-2-8x22b claude-opus mistral-8x7b gpt-4 huggingface meta-ai-fair dbrx reka-ai mistral-ai lmsys openai datasets benchmarking quantization zero-shot-learning reasoning code-error-detection token-generation security

2024 has seen a significant increase in dataset sizes for training large language models, with Redpajama 2 offering up to 30T tokens, DBRX at 12T tokens, Reka Core/Flash/Edge with 5T tokens, and Llama 3 trained on 15T tokens. Huggingface released an open dataset containing 15T tokens from 12 years of filtered CommonCrawl data, enabling training of models like Llama 3 if compute resources are available. On Reddit, WizardLM-2-8x22b outperformed other open LLMs including Llama-3-70b-instruct in reasoning and math benchmarks. Claude Opus demonstrated strong zero-shot code error spotting, surpassing Llama 3. Benchmarks revealed limitations in the LMSYS chatbot leaderboard due to instruction-tuned models gaming the system, and a new RAG benchmark showed Llama 3 70B underperforming compared to GPT-4, while Mistral 8x7B remained strong. Efficient quantized versions of Llama 3 models are available on Huggingface, with users reporting token generation limits around 9600 tokens on a 3090 GPU. Safety concerns include a UK sex offender banned from AI tool usage and GPT-4 demonstrating an 87% success rate exploiting real vulnerabilities, raising security concerns.

Apr 10, 2024

Music's Dall-E moment

griffin command-r-plus gpt-4-0613 gpt-4-0314 mistral-8x22b codegemma stable-diffusion-1.5 command-r gemini-1.5 google mistral-ai lmsys cohere model-architecture benchmarking open-source model-quantization memory-optimization inference-speed multimodality finetuning performance-optimization audio-processing andrej-karpathy

Google's Griffin architecture outperforms transformers with faster inference and lower memory usage on long contexts. Command R+ climbs to 6th place on the LMSYS Chatbot Arena leaderboard, surpassing GPT-4-0613 and GPT-4-0314. Mistral AI releases an open-source 8x22B model with a 64K context window and around 130B total parameters. Google open-sources CodeGemma models with pre-quantized 4-bit versions for faster downloads. Ella weights enhance Stable Diffusion 1.5 with LLM for semantic alignment. Unsloth enables 4x larger context windows and 80% memory reduction for finetuning. Andrej Karpathy releases LLMs implemented in pure C for potential performance gains. Command R+ runs in realtime on M2 Max MacBook using iMat q1 quantization. Cohere's Command R model offers low API costs and strong leaderboard performance. Gemini 1.5 impresses with audio capabilities recognizing speech tone and speaker identification from audio clips.

Jan 30, 2024

RWKV "Eagle" v5: Your move, Mamba

rwkv-v5 mistral-7b miqu-1-70b mistral-medium llama-2 mistral-instruct-v0.2 mistral-tuna llama-2-13b kunoichi-dpo-v2-7b gpt-4 eleutherai mistral-ai hugging-face llamaindex nous-research rwkv lmsys fine-tuning multilinguality rotary-position-embedding model-optimization model-performance quantization speed-optimization prompt-engineering model-benchmarking reinforcement-learning andrej-karpathy

RWKV v5 Eagle was released with better-than-mistral-7b evaluation results, trading some English performance for multilingual capabilities. The mysterious miqu-1-70b model sparked debate about its origins, possibly a leak or distillation of Mistral Medium or a fine-tuned Llama 2. Discussions highlighted fine-tuning techniques, including the effectiveness of 1,000 high-quality prompts over larger mixed-quality datasets, and tools like Deepspeed, Axolotl, and QLoRA. The Nous Research AI community emphasized the impact of Rotary Position Embedding (RoPE) theta settings on LLM extrapolation, improving models like Mistral Instruct v0.2. Speed improvements in Mistral Tuna kernels reduced token processing costs, enhancing efficiency. The launch of Eagle 7B with 7.52B parameters showcased strong multilingual performance, surpassing other 7B class models.

Dec 15, 2023

12/15/2023: Mixtral-Instruct beats Gemini Pro (and matches GPT3.5)

mixtral gemini-pro gpt-3.5 gpt-4.5 gpt-4 chatgpt lmsys openai deepseek cloudflare huggingface performance context-window prompt-engineering privacy local-gpu cloud-gpu code-generation model-comparison model-usage api-errors karpathy

Thanks to a karpathy shoutout, lmsys now has enough data to rank mixtral and gemini pro. The discussion highlights the impressive performance of these state-of-the-art open-source models that can run on laptops. In the openai Discord, users compared AI tools like perplexity and chatgpt's browsing tool, favoring Perplexity for its superior data gathering, pricing, and usage limits. Interest was shown in AI's ability to convert large code files with deepseek coder recommended. Debates on privacy implications for AI advancement and challenges of running LLMs on local and cloud GPUs were prominent. Users reported issues with chatgpt including performance problems, loss of access to custom GPTs, and unauthorized access. Discussions also covered prompt engineering for large context windows and speculations about gpt-4.5 and gpt-4 future developments.