All tags
Topic: "code-benchmarks"
Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0
Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.