All tags
Topic: "ocr"
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
not much happened today
o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0
OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.
Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)
command-a mistral-ai-small-3.1 smoldocling qwen-2.5-vl cohere mistral-ai hugging-face context-windows multilinguality multimodality fine-tuning benchmarking ocr model-performance model-releases model-optimization aidangomez sophiamyang mervenoyann aidan_mclau reach_vb lateinteraction
Cohere's Command A model has solidified its position on the LMArena leaderboard, featuring an open-weight 111B parameter model with an unusually long 256K context window and competitive pricing. Mistral AI released the lightweight, multilingual, and multimodal Mistral AI Small 3.1 model, optimized for single RTX 4090 or Mac 32GB RAM setups, with strong performance on instruct and multimodal benchmarks. The new OCR model SmolDocling offers fast document reading with low VRAM usage, outperforming larger models like Qwen2.5VL. Discussions highlight the importance of system-level improvements over raw LLM advancements, and MCBench is recommended as a superior AI benchmark for evaluating model capabilities across code, aesthetics, and awareness.
DeepSeek's Open Source Stack
qwen-qwq-32b start character-3 gemini gemini-2.0 mercury-coder gpt-4.5 jamba-mini-1.6 gemini-2.0-flash gpt-4o-mini mistral-small-3 mistral-ocr deepseek pyspur hugging-face togethercompute hedra-labs google-deepmind deeplearningai openai ai21-labs mistral-ai fine-tuning benchmarking multimodality code-generation diffusion-models model-performance model-optimization ocr embedding-models context-windows runtime-limits _akhaliq lmarena_ai reach_vb danielhanchen _philschmid aidan_mclau vikhyatk jerryjliu0
DeepSeek's Open Source Week was summarized by PySpur, highlighting multiple interesting releases. The Qwen QwQ-32B model was fine-tuned into START, excelling in PhD-level science QA and math benchmarks. Character-3, an omnimodal AI video generation model by Hedra Labs and Together AI, enables realistic animated content creation. Google DeepMind introduced the Gemini embedding model with an 8k context window, ranking #1 on MMTEB, alongside the Gemini 2.0 Code Executor supporting Python libraries and auto-fix features. Inception Labs' Mercury Coder is a diffusion-based code generation model offering faster token processing. OpenAI released GPT-4.5, their largest model yet but with less reasoning ability than some competitors. AI21 Labs launched Jamba Mini 1.6, noted for superior output speed compared to Gemini 2.0 Flash, GPT-4o mini, and Mistral Small 3. A new dataset of 1.9M scanned pages was released for OCR benchmarking, with Mistral OCR showing competitive but not top-tier document parsing performance compared to LLM/LVM-powered methods. "Cracked engineers are all you need."
not much happened today
jamba-1.6 mistral-ocr qwq-32b o1 o3-mini instella llama-3-2-3b gemma-2-2b qwen-2-5-3b babel-9b babel-83b gpt-4o claude-3-7-sonnet ai21-labs mistral-ai alibaba openai amd anthropic hugging-face multimodality ocr multilinguality structured-output on-prem-deployment reasoning benchmarking api open-source model-training gpu-optimization prompt-engineering function-calling
AI21 Labs launched Jamba 1.6, touted as the best open model for private enterprise deployment, outperforming Cohere, Mistral, and Llama on benchmarks like Arena Hard. Mistral AI released a state-of-the-art multimodal OCR model with multilingual and structured output capabilities, available for on-prem deployment. Alibaba Qwen introduced QwQ-32B, an open-weight reasoning model with 32B parameters and cost-effective usage, showing competitive benchmark scores. OpenAI released o1 and o3-mini models with advanced API features including streaming and function calling. AMD unveiled Instella, open-source 3B parameter language models trained on AMD Instinct MI300X GPUs, competing with Llama-3.2-3B and others. Alibaba also released Babel, open multilingual LLMs performing comparably to GPT-4o. Anthropic launched Claude 3.7 Sonnet, enhancing reasoning and prompt engineering capabilities.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
OLMo 2 - new SOTA Fully Open LLM
llama-3-1-8b olmo-2 qwen2-5-72b-instruct smolvlm tulu-3 ai2 huggingface intel reinforcement-learning quantization learning-rate-annealing ocr fine-tuning model-training vision
AI2 has updated OLMo-2 to roughly Llama 3.1 8B equivalent, training with 5T tokens and using learning rate annealing and new high-quality data (Dolmino). They credit Tülu 3 and its "Reinforcement Learning with Verifiable Rewards" approach. On Reddit, Qwen2.5-72B instruct model shows near lossless performance with AutoRound 4-bit quantization, available on HuggingFace in 4-bit and 2-bit versions, with discussions on MMLU benchmark and quantization-aware training. HuggingFace released SmolVLM, a 2B parameter vision-language model running efficiently on consumer GPUs, supporting fine-tuning on Google Colab and demonstrating strong OCR capabilities with adjustable resolution and quantization options.
Common Corpus: 2T Open Tokens with Provenance
qwen-2.5-coder claude-3.5-sonnet janusflow-1.3b ocronos-vintage pleais huggingface langchainai deepseek alibaba anthropic provenance ocr multilingual-datasets prompt-engineering multimodality image-generation code-generation quantization model-scaling inference-efficiency tim-dettmers tom-doerr omarsar0 swyx madiator reach_vb
Pleais via Huggingface released Common Corpus, the largest fully open multilingual dataset with over 2 trillion tokens including detailed provenance information. They also introduced OCRonos-Vintage, a 124M-parameter OCR correction model that efficiently fixes digitization errors on CPU and GPU, unlocking knowledge from PDFs. On AI tools, LangChainAI launched Prompt Canvas for collaborative prompt engineering, while DeepSeek released JanusFlow 1.3B, a unified multimodal LLM integrating autoregressive and rectified flow models for enhanced image understanding and generation. Alibaba Cloud announced Qwen2.5-Coder, a code-focused LLM with advanced coding capabilities, and Claude 3.5 Sonnet was highlighted for superior code generation. Discussions on quantization challenges and scaling laws for precision by Tim Dettmers and others emphasized the impact of low-precision training on model scalability and inference efficiency. "Scaling Laws for Precision" paper insights and alternative efficiency methods were also noted.
The AI Nobel Prize
claude-3.5-sonnet reka-flash got openai anthropic reka-ai zep artificial-neural-networks nobel-prize knowledge-graphs memory-layers real-time-voice-api vision fine-tuning prompt-caching multimodality function-calling ocr open-source single-sign-on software-testing ai-assisted-coding ai-ethics geoff-hinton john-hopfield philschmid alexalbert mervenoyann clementdelangue svpino bindureddy ylecun rohanpaul_ai
Geoff Hinton and John Hopfield won the Nobel Prize in Physics for their work on Artificial Neural Networks. The award citation spans 14 pages highlighting their contributions. Zep released a new community edition of their low-latency memory layer for AI agents, emphasizing knowledge graphs for memory. At OpenAI's DevDay, new features like real-time voice API, vision model fine-tuning, and prompt caching with a 50% discount on reused tokens were introduced. Anthropic's Claude 3.5 Sonnet was recognized as the best model currently. Reka AI Labs updated their Reka Flash model with enhanced multimodal and function calling capabilities. The GOT (Generic OCR Transformer) achieved 98.79% accuracy on OCR benchmarks. Discussions on open-source AI models highlighted their role in fostering competition and decentralization. Software development insights included the importance of Single Sign-On (SSO), thorough testing, and AI-assisted coding workflows. Ethical and societal topics covered critiques of tax policies and the appointment of France's first Minister of AI.
Pixtral 12B: Mistral beats Llama to Multimodality
pixtral-12b mistral-nemo-12b llama-3-1-70b llama-3-1-8b deeps-eek-v2-5 gpt-4-turbo llama-3-1 strawberry claude mistral-ai meta-ai-fair hugging-face arcee-ai deepseek-ai openai anthropic vision multimodality ocr benchmarking model-release model-architecture model-performance fine-tuning model-deployment reasoning code-generation api access-control reach_vb devendra_chapilot _philschmid rohanpaul_ai
Mistral AI released Pixtral 12B, an open-weights vision-language model with a Mistral Nemo 12B text backbone and a 400M vision adapter, featuring a large vocabulary of 131,072 tokens and support for 1024x1024 pixel images. This release notably beat Meta AI in launching an open multimodal model. At the Mistral AI Summit, architecture details and benchmark performances were shared, showing strong OCR and screen understanding capabilities. Additionally, Arcee AI announced SuperNova, a distilled Llama 3.1 70B & 8B model outperforming Meta's Llama 3.1 70B instruct on benchmarks. DeepSeek released DeepSeek-V2.5, scoring 89 on HumanEval, surpassing GPT-4-Turbo, Opus, and Llama 3.1 in coding tasks. OpenAI plans to release Strawberry as part of ChatGPT soon, though its capabilities are debated. Anthropic introduced Workspaces for managing multiple Claude deployments with enhanced access controls.
That GPT-4o Demo
gpt-4o gemma-2 meta-code-llama openai google-deepmind meta-ai-fair voice-generation ocr screen-sharing vision code-understanding model-customization efficiency textual-intelligence multimodal-agents sft distillation rlhf model-merging model-optimization safety romain-huet fchollet
Romain Huet demonstrated an unreleased version of GPT-4o on ChatGPT Desktop showcasing capabilities like low latency voice generation, whisper tone moderation, camera mode streaming video to GPT-4o, rapid OCR, screen sharing with ChatGPT for programming help, clipboard reading, and vision-based code conversation. OpenAI's four investment areas highlighted include textual intelligence, efficiency/cost, model customization, and multimodal agents. Google DeepMind released Gemma 2 models in 9B and 27B sizes trained on 8T and 13T tokens respectively, using SFT, distillation, RLHF, and model merging, optimized for TPUv5e with strong performance and safety measures. Meta AI announced the Meta LLM Compiler built on Meta Code Llama with enhanced code optimization and compiler features.
Zero to GPT in 1 Year
gpt-4-turbo claude-3-opus mixtral-8x22b zephyr-141b medical-mt5 openai anthropic mistral-ai langchain hugging-face fine-tuning multilinguality tool-integration transformers model-evaluation open-source-models multimodal-llms natural-language-processing ocr model-training vik-paruchuri sam-altman greg-brockman miranda-murati abacaj mbusigin akhaliq clementdelangue
GPT-4 Turbo reclaimed the top leaderboard spot with significant improvements in coding, multilingual, and English-only tasks, now rolled out in paid ChatGPT. Despite this, Claude Opus remains superior in creativity and intelligence. Mistral AI released powerful open-source models like Mixtral-8x22B and Zephyr 141B suited for fine-tuning. LangChain enhanced tool integration across models, and Hugging Face introduced Transformer.js for running transformers in browsers. Medical domain-focused Medical mT5 was shared as an open-source multilingual text-to-text model. The community also highlighted research on LLMs as regressors and shared practical advice on OCR/PDF data modeling from Vik Paruchuri's journey.
Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU
inflection-2.5 claude-3-sonnet claude-3-opus gpt-4 yi-9b mistral inflection anthropic perplexity-ai llamaindex mistral-ai langchain retrieval-augmented-generation benchmarking ocr structured-output video-retrieval knowledge-augmentation planning tool-use evaluation code-benchmarks math-benchmarks mustafa-suleyman amanda-askell jeremyphoward abacaj omarsar0
Mustafa Suleyman announced Inflection 2.5, which achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. Pi's user base is growing about 10% weekly, with new features like realtime web search. The community noted similarities between Inflection 2.5 and Claude 3 Sonnet. Claude 3 Opus outperformed GPT-4 in a 1.5:1 vote and is now the default for Perplexity Pro users. Anthropic added experimental tool calling support for Claude 3 via LangChain. LlamaIndex released LlamaParse JSON Mode for structured PDF parsing and added video retrieval via VideoDB, enabling retrieval-augmented generation (RAG) pipelines. A paper proposed knowledge-augmented planning for LLM agents. New benchmarks like TinyBenchmarks and the Yi-9B model release show strong code and math performance, surpassing Mistral.
Trust in GPTs at all time low
llama-3 mistral-medium llava-1.6 miquella-120b-gguf tinymodels miqumaid harmony-4x7b-bf16 smaug-34b-v0.1 openai hugging-face mistral-ai nous-research bittensor context-management fine-tuning model-merging quantization gpu-servers visual-reasoning ocr dataset-release incentive-structures nick-dobos manojbh teknium arthurmensch
Discord communities were analyzed with 21 guilds, 312 channels, and 8530 messages reviewed, saving an estimated 628 minutes of reading time. Discussions highlighted challenges with GPTs and the GPT store, including critiques of the knowledge files capability and context management issues. The CUDA MODE Discord was introduced for CUDA coding support. Key conversations in the TheBloke Discord covered Xeon GPU server cost-effectiveness, Llama3 and Mistral Medium model comparisons, LLaVA-1.6's visual reasoning and OCR capabilities, and the leaked Miqu 70B model. Technical topics included fine-tuning TinyLlama and MiquMaid+Euryale models, and model merging with examples like Harmony-4x7B-bf16 and Smaug-34B-v0.1. The Nous Research AI Discord discussed style influence in LLMs, quantization issues, Bittensor incentives for AI model improvements, and the identification of MIQU as Mistral Medium. The release of the Open Hermes 2.5 dataset on Hugging Face was also announced. "Discussions pointed towards the need for better context management in GPTs, contrasting with OpenAI's no-code approach."