All tags
Person: "giffmana"
Apple exposes Foundation Models API and... no new Siri
chatgpt apple openai langchain llamaindex on-device-ai foundation-models reasoning reinforcement-learning voice translation software-automation agentic-workflows gdb scaling01 giffmana kevinweil
Apple released on-device foundation models for iOS developers, though their recent "Illusion of Reasoning" paper faced significant backlash for flawed methodology regarding LLM reasoning. OpenAI updated ChatGPT's Advanced Voice Mode with more natural voice and improved translation, demonstrated by Greg Brockman. LangChain and LlamaIndex launched new AI agents and tools, including a SWE Agent for software automation and an Excel agent using reinforcement learning for data transformation. The AI community engaged in heated debate over reasoning capabilities of LLMs, highlighting challenges in evaluation methods.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.
not much happened today
gpt-4o deepseek-v3 claude-3.7-sonnet o3-mini gemini-2.5-pro openai deepseek anthropic google-deepmind togethercompute hypertecgroup coreweave cursor-ai windsurf-ai coding instruction-following image-generation policy-compliance long-context audio-processing video-processing gpu-clusters ai-infrastructure api-access sama kevinweil joannejang nrehiew_ giffmana _philschmid scaling01 saranormous
GPT-4o was praised for its improved coding, instruction following, and freedom, becoming the leading non-reasoning coding model surpassing DeepSeek V3 and Claude 3.7 Sonnet in coding benchmarks, though it still lags behind reasoning models like o3-mini. Concerns about policy compliance in image generation were noted, with efforts to improve adherence. Gemini 2.5 Pro was highlighted for its advanced audio and video understanding, long context capabilities, and integration with platforms like Cursor AI and Windsurf AI. AI infrastructure developments include a partnership between Together AI and Hypertec Group to deliver large-scale GPU clusters, and CoreWeave's IPO was celebrated for advancing AI infrastructure. GPU and TPU usage is expected to increase significantly. "GPT-4o's transparency and background generation feature" and "Gemini 2.5 Pro scored above 50% on Simple-Bench AI Explanation" were key highlights.
not much happened today
gpt-4o deepseek-v3-0324 gemini-2.5-pro gemini-3 claude-3.7-sonnet openai hugging-face sambanova google-cloud instruction-following image-generation content-filtering model-performance api coding model-deployment benchmarking model-release abacaj nrehiew_ sama joannejang giffmana lmarena_ai _philschmid
OpenAI announced the new GPT-4o model with enhanced instruction-following, complex problem-solving, and native image generation capabilities. The model shows improved performance in math, coding, and creativity, with features like transparent background image generation. Discussions around content filtering and policy for image generation emphasize balancing creative freedom and harm prevention. DeepSeek V3-0324 APIs, available on Hugging Face and powered by SambaNovaAI, outperform benchmarks and models like Gemini 2.0 Pro and Claude 3.7 Sonnet. Gemini 2.5 Pro is recommended for coding, and Gemini 3 can be deployed easily on Google Cloud Vertex AI via the new Model Garden SDK. The Gemma 3 Technical Report has been released on arXiv.
AI Engineer Summit Day 1
grok-3 o3-mini deepseek-r1 qwen-2.5-vl openai anthropic xai togethercompute alibaba sakana-ai benchmarking model-performance cuda model-training open-source debugging inference-speed batch-size reinforcement-learning aidan_mclau giffmana nrehiew_ teortaxestex epochairesearch andrew_n_carr borismpower yuhu_ai_
The AIE Summit in NYC highlighted key talks including Grace Isford's Trends Keynote, Neo4j/Pfizer's presentation, and OpenAI's first definition of Agents. Speakers announced $930 million in funding. On AI Twitter, discussions focused on Grok-3 and o3-mini models, with debates on performance and benchmarking, including Grok-3's record compute scale of 4e26 to 5e26 FLOP. The o3-mini model uncovered a critical CUDA kernel bug in Sakana AI's code. DeepSeek-R1 was promoted as an open-source alternative with notable training batch sizes. Additionally, Alibaba announced the Qwen 2.5-VL model release.
not much happened today
o1-full sora gpt-4.5 gpt-4 claude-3.5-sonnet llama-3-1-nemotron-51b llama-3-1 llama-3 nemotron-51b openai google-deepmind anthropic nvidia huggingface vision model-performance neural-architecture-search model-optimization multimodality model-release model-training reinforcement-learning image-generation lucas-beyer alexander-kolesnikov xiaohua-zhai aidan_mclau giffmana joannejang sama
OpenAI announced their "12 Days of OpenAI" event with daily livestreams and potential releases including the O1 full model, Sora video model, and GPT-4.5. Google DeepMind released the GenCast weather model capable of 15-day forecasts in 8 minutes using TPU chips, and launched Genie 2, a model generating playable 3D worlds from single images. Leading vision researchers Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai moved from DeepMind to OpenAI, which is opening a Zürich office. Criticism arose over OpenAI's strategy and model quality compared to Anthropic and Claude 3.5 Sonnet. On Reddit, a modified llama.cpp supports Nvidia's Llama-3_1-Nemotron-51B, matching performance of larger 70B models via NAS optimization.
not much happened today
claude-3.5-sonnet opencoder anthropic microsoft sambanova openai langchain llamaindex multi-agent-systems natural-language-interfaces batch-processing harmful-content-detection secret-management retrieval-augmented-generation error-analysis memory-management web-scraping autonomous-agents sophiamyang tom_doerr omarsar0 _akhaliq andrewyng giffmana
This week in AI news, Anthropic launched Claude Sonnet 3.5, enabling desktop app control via natural language. Microsoft introduced Magentic-One, a multi-agent system built on the AutoGen framework. OpenCoder was unveiled as an AI-powered code cookbook for large language models. SambaNova is sponsoring a hackathon with prizes up to $5000 for building real-time AI agents. Sophiamyang announced new Batch and Moderation APIs with 50% lower cost and multi-dimensional harmful text detection. Open-source tools like Infisical for secret management, CrewAI for autonomous agent orchestration, and Crawlee for web scraping were released. Research highlights include SCIPE for error analysis in LLM chains, Context Refinement Agent for improved retrieval-augmented generation, and MemGPT for managing LLM memory. The week also saw a legal win for OpenAI in the RawStory copyright case, affirming that facts used in LLM training are not copyrightable.
not much happened today
llama-3-2-vision gpt-2 meta-ai-fair ollama amd llamaindex gemini gitpod togethercompute langchainai weights-biases stanfordnlp deeplearningai model-scaling neural-networks multi-gpu-support skip-connections transformers healthcare-ai automated-recruitment zero-trust-security small-language-models numerical-processing chain-of-thought optical-character-recognition multi-agent-systems agent-memory interactive-language-learning bindureddy fstichler stasbekman jxmnop bindureddy omarsar0 giffmana rajammanabrolu
This week in AI news highlights Ollama 0.4 supporting Meta's Llama 3.2 Vision models (11B and 90B), with applications like handwriting recognition. Self-Consistency Preference Optimization (ScPO) was introduced to improve model consistency without human labels. Discussions on model scaling, neural networks resurgence, and AMD's multi-GPU bandwidth challenges were noted. The importance of skip connections in Transformers was emphasized. In healthcare, less regulation plus AI could revolutionize disease treatment and aging. Tools like LlamaParse and Gemini aid automated resume insights. Gitpod Flex demonstrated zero-trust architecture for secure development environments. Research includes surveys on Small Language Models (SLMs), number understanding in LLMs, and DTrOCR using a GPT-2 decoder for OCR. Multi-agent systems in prediction markets were discussed by TogetherCompute and LangChainAI. Community events include NeurIPS Happy Hour, NLP seminars, and courses on Agent Memory with LLMs as operating systems.
not much happened this weekend
o1-preview claude-3.5-sonnet 21b-flash-model openai meta-ai-fair reka langchainai entropix prompting-techniques finetuning entropy-based-sampling temporal-understanding native-audio tool-use instruction-chaining multimodality retrieval-augmented-generation synthetic-data-generation rnn parallel-training biologically-inspired-ai-safety text-to-video-generation video-editing lex-fridman imrat jjitsev giffmana _philschmid karpathy rasbt adcock_brett glennko rohanpaul_ai labenz
AI news from 10/4/2024 to 10/7/2024 highlights several developments: OpenAI's o1-preview shows strong performance on complex tasks but struggles with simpler ones, while Claude 3.5 Sonnet can match its reasoning through advanced prompting techniques. Meta introduced Movie Gen, a cutting-edge media foundation model for text-to-video generation and editing. Reka updated their 21B Flash Model with temporal video understanding, native audio, and tool use capabilities. Interest grows in "open o1" reproductions focusing on prompting and finetuning, with Entropix exploring entropy-based sampling. LangChainAI demonstrated a Retrieval Agent for complex Q&A, and synthetic data generation research surveyed 417 models. A resurgence in RNNs shows efficient parallel training making them competitive with Transformers. Biologically-inspired AI safety approaches were also noted. "A quiet weekend and air conditioning is all you need."
The DSPy Roadmap
dspy litel-lm gemini chatgpt-4o grok-2 hermes-3 databricks mit google openai x-ai nous-research astribot apple sakana-ai model-optimization fine-tuning optimizers interactive-optimization robotics autonomous-systems voice image-generation open-source-models scientific-research streaming caching omar-khattab giffmana
Omar Khattab announced joining Databricks before his MIT professorship and outlined the roadmap for DSPy 2.5 and 3.0+, focusing on improving core components like LMs, signatures, optimizers, and assertions with features such as adopting LiteLLM to reduce code and enhance caching and streaming. The roadmap also includes developing more accurate, cost-effective optimizers, building tutorials, and enabling interactive optimization tracking. On AI Twitter, Google launched Gemini Live, a mobile conversational AI with voice and 10 voices, alongside Pixel Buds Pro 2 with a custom Tensor A1 chip. OpenAI updated ChatGPT-4o, reclaiming the top spot on LMSYS Arena. xAI released Grok-2 in beta, achieving SOTA in image generation with FLUX 1. Nous Research released open-source Hermes 3 models in 8B, 70B, and 405B sizes, with the 405B model achieving SOTA. Robotics updates include Astribot's humanoid robot and Apple's tabletop robot with Siri voice commands. Sakana AI introduced "The AI Scientist," an autonomous AI research system.
We Solved Hallucinations
gpt-2 flashattention-3 lynx meta-ai-fair nvidia princeton colfax patronus-ai databricks mosaic-ai openai compute-hardware gpu-optimization flashattention llm-evaluation hallucination-detection vision benchmarking synthetic-data model-training karpathy tri_dao giffmana vikhyatk dbrxmosaicai
Reddit's URL structure causes link errors in AI-generated summaries, especially with NSFW content affecting models like Claude and GPT-4. The team fixed this glitch while still leveraging LLMs for summarizing Reddit content. GPT-2 training costs have dramatically dropped to ~$672 using H100 GPUs and software improvements like CUDA and FlashAttention. FlashAttention-3 was released, achieving up to 740 TFLOPS on H100 GPUs, with FP8 nearing 1.2 PFLOPS, developed collaboratively by Meta, NVIDIA, Princeton, and Colfax. Hopper GPUs enable major speedups with new hardware features. Synthetic data may not improve vision tasks, as shown in recent research. The Avocado360 benchmark evaluates vision-language models' ability to detect avocados in images. Lynx, a hallucination detection model for LLMs, was introduced for real-world healthcare and fintech applications, trained by Patronus AI on Databricks Mosaic AI using Composer.