All tags
Person: "mark_k"
GPT-Image-2
gpt-image-2 qwen3-1.7b codex openai hugging-face figma canva adobe nous-research image-generation multilingual-models model-integration benchmarking agent-infrastructure multi-process-systems fine-tuning scientific-reasoning healthcare-ai hierarchical-decomposition clementdelangue lewtun gdb nickaturley mark_k petergostev tekninum mayank_022
OpenAI launched GPT-Image-2, enhancing image generation with improved text rendering, layout fidelity, editing, multilingual support, and "thinking" capabilities. It supports generating slides, infographics, diagrams, UI mockups, and QR codes, and integrates with tools like Figma, Canva, Adobe Firefly, and Hermes Agent. Benchmarks show GPT-Image-2 leads image generation tasks with a +242 Elo advantage. Hugging Face released ml-intern, an open-source agent automating post-training research loops, improving scientific reasoning and healthcare benchmarks significantly. Hermes is evolving into a richer local/open agent platform with enhanced multi-process orchestration capabilities.
not much happened today
arc-agi-3 claude-code anthropic langchain arcprize primeintellect agentic-reasoning interactive-environments benchmarking efficiency-metrics zero-preparation-generalization agent-infrastructure trainable-agents classifier-approval fchollet mikeknoop scaling01 _rockt mark_k andykonwinski bradenjhancock jeremyphoward togelius bracesproul hwchase17 caspar_br _catwu
ARC-AGI-3 benchmark introduced by @arcprize and François Chollet resets the frontier for general agentic reasoning with humans solving 100% of tasks versus under 1% for current models, focusing on zero-preparation generalization and human-like learning efficiency. The scoring protocol sparked debate over its harsh efficiency-based metric compared to prior ARC versions and other benchmarks like NetHack. The community acknowledges the benchmark highlights weaknesses in current LLM agents in interactive, sparse-feedback environments. Concurrently, agent infrastructure advances with LangChain launching Fleet shareable skills for reusable domain knowledge, and Anthropic revealing Claude Code auto mode for classifier-mediated approval balancing autonomy and manual confirmation. Browser and coding agents are evolving into trainable systems beyond prompt wrappers, exemplified by BrowserBase and Prime Intellect collaboration.