All tags
Topic: "coding-agents"
Gemini's AlphaEvolve agent uses Gemini 2.0 to find new Math and cuts Gemini cost 1% — without RL
gemini gpt-4.1 gpt-4o-mini o3 o4-mini google-deepmind openai algorithm-discovery coding-agents matrix-multiplication optimization reinforcement-learning model-weights training-efficiency safety-evaluations instruction-following coding-tasks model-releases _philschmid scott_swingle alex_dimakis henry jason_wei kevinweil michpokrass scaling01 gdb
Deepmind's AlphaEvolve, a 2025 update to AlphaTensor and FunSearch, is a Gemini-powered coding agent for algorithm discovery that designs faster matrix multiplication algorithms, solves open math problems, and improves data center and AI training efficiency. It achieves a 23% faster kernel speedup in Gemini training and surpasses state-of-the-art on 20% of applied problems, including improvements on the Minimum Overlap Problem and Kissing number problem. Unlike Deep-RL, it optimizes code pieces rather than model weights. Meanwhile, OpenAI released GPT-4.1 in ChatGPT, specializing in coding and instruction following, with a faster alternative GPT-4.1 mini replacing GPT-4o mini for all users. OpenAI also launched the Safety Evaluations Hub and the OpenAI to Z Challenge using o3/o4 mini and GPT-4.1 models to discover archaeological sites. "Maybe midtrain + good search is all you need for AI for scientific innovation" - Jason Wei.
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
not much happened today
phi-4 reinforce++ arc-agi-2 ai21-labs ollama langchain togethercompute groq reinforcement-learning ppo model-optimization memory-efficiency python-packages vision text-extraction frontend-code-generation workflow-automation coding-agents compute-cost-reduction ethical-ai agi-benchmarks scam-alerts sebastien-bubeck fchollet tom-doerr arohan_ bindureddy hwchase17 jonathanross321 clementdelangue vikhyatk
Sebastien Bubeck introduced REINFORCE++, enhancing classical REINFORCE with PPO-inspired techniques for 30% faster training. AI21 Labs released Phi-4 under the MIT License, accessible via Ollama. François Chollet announced plans for ARC-AGI-2 and a next-generation AGI benchmark. LangChain launched 10 new integration packages to boost LLM application development. Tom Doerr introduced Ollama-OCR, a Python package for text extraction using vision language models. Arohan optimized Shampoo for memory efficiency, reducing usage from 20 to 6 bytes per parameter. Bindu Reddy showcased CodeLLM's v1 for frontend code generation and highlighted LlamaIndex Workflows for academic summarization and slide generation. Hwchase17 collaborated with Together Compute to enhance WebDev Arena with complex coding agents for LLM coding evaluations. Jonathan Ross detailed Groq's mission to reduce compute costs by 1000x amid rising generative AI spending. Clement Delangue warned about scam alerts involving false claims of association with AI21. Vikhyat K raised concerns about the ethical implications and trade-offs of AGI. Memes and humor included creative AI prompts and critiques of LLM behaviors.
ReALM: Reference Resolution As Language Modeling
flan-t5 gpt-4 apple openai hugging-face stability-ai reference-resolution finetuning quantization retrieval-augmented-generation open-source coding-agents podcast-generation image-generation ai-industry-trends takuto-takizawa
Apple is advancing in AI with a new approach called ReALM: Reference Resolution As Language Modeling, which improves understanding of ambiguous references using three contexts and finetunes a smaller FLAN-T5 model that outperforms GPT-4 on this task. In Reddit AI news, an open-source coding agent SWE-agent achieves 12.29% on the SWE-bench benchmark, and RAGFlow introduces a customizable retrieval-augmented generation engine. A new quantization method, QuaRot, enables efficient 4-bit inference. AI applications include a t-shirt design generator, podgenai for GPT-4 based podcast generation, and an open-source model from HuggingFace that runs without a GPU. Industry discussions focus on the impact of large language models on the AI field and efforts to decentralize AI development. Takuto Takizawa joins Stability AI Japan as Head of Sales & Partnerships.