All tags
Company: "princeton"
not much happened today
claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7 anthropic sakana-ai meta-ai-fair princeton recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments kimmonismus lechmazur teortaxestex hardmaru andrew_n_carr steverab pauliusztin_
Anthropic's Mythos/Opus cycle sparked mixed reactions with praise for Claude Mythos's one-shot workflows and concerns over Opus 4.8 benchmark regressions. Opus 4.7 showed strong chemistry task performance, "making Claude a chemist." Sakana AI launched an RSI Lab focusing on recursive self-improvement under compute constraints, marking RSI as a formal research program. New benchmarks like Agents' Last Exam (ALE) and SWE-Marathon test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. Princeton's ICML 2026 paper found models like GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7 still lack meaningful reliability improvements. Tooling trends favor RL-environment-style frameworks for agent evaluation, exemplified by Meta's OpenEnv.
not much happened today
kimi-linear kimi-delta-attention minimax-m2 looped-llms aardvark-gpt-5 moonshot-ai minimax bytedance princeton mila openai cursor cognition hkust long-context attention-mechanisms agentic-ai tool-use adaptive-compute coding-agents performance-optimization memory-optimization reinforcement-learning model-architecture kimi_moonshot scaling01 uniartisan omarsar0 aicodeking songlinyang4 iscienceluvr nrehiew_ gdb embeddedsec auchenberg simonw
Moonshot AI released Kimi Linear (KDA) with day-0 infrastructure and strong long-context metrics, achieving up to 75% KV cache reduction and 6x decoding throughput. MiniMax M2 pivoted to full attention for multi-hop reasoning, maintaining strong agentic coding performance with 200k context and ~100 TPS. ByteDance, Princeton, and Mila introduced Looped LLMs showing efficiency gains comparable to larger transformers. OpenAI's Aardvark (GPT-5) entered private beta as an agentic security researcher for scalable vulnerability discovery. Cursor launched faster cloud coding agents, though transparency concerns arose regarding base-model provenance. Cognition released a public beta for a desktop/mobile tool-use agent named Devin. The community discussed advanced attention mechanisms and adaptive compute techniques.
We Solved Hallucinations
gpt-2 flashattention-3 lynx meta-ai-fair nvidia princeton colfax patronus-ai databricks mosaic-ai openai compute-hardware gpu-optimization flashattention llm-evaluation hallucination-detection vision benchmarking synthetic-data model-training karpathy tri_dao giffmana vikhyatk dbrxmosaicai
Reddit's URL structure causes link errors in AI-generated summaries, especially with NSFW content affecting models like Claude and GPT-4. The team fixed this glitch while still leveraging LLMs for summarizing Reddit content. GPT-2 training costs have dramatically dropped to ~$672 using H100 GPUs and software improvements like CUDA and FlashAttention. FlashAttention-3 was released, achieving up to 740 TFLOPS on H100 GPUs, with FP8 nearing 1.2 PFLOPS, developed collaboratively by Meta, NVIDIA, Princeton, and Colfax. Hopper GPUs enable major speedups with new hardware features. Synthetic data may not improve vision tasks, as shown in recent research. The Avocado360 benchmark evaluates vision-language models' ability to detect avocados in images. Lynx, a hallucination detection model for LLMs, was introduced for real-world healthcare and fintech applications, trained by Patronus AI on Databricks Mosaic AI using Composer.