All tags
Company: "mit"
not much happened today
qwen-o1 qvq claude-3.5-sonnet gpt-4o o3 o3-mini alibaba openai mit idsia llamaindex ollama vision benchmarking llm-calibration intentionality alignment-faking deliberative-alignment artificial-life gdpr-compliance contract-review-agent app-creation synthetic-data post-transformers smol-models agents bret-taylor
The Qwen team launched QVQ, a vision-enabled version of their experimental QwQ o1 clone, benchmarking comparably to Claude 3.5 Sonnet. Discussions include Bret Taylor's insights on autonomous software development distinct from the Copilot era. The Latent Space LIVE! talks cover highlights of 2024 AI startups, vision, open models, post-transformers, synthetic data, smol models, and agents. Twitter recaps by Claude 3.5 Sonnet highlight proposals for benchmarks measuring LLM calibration and falsehood confidence, with QVQ outperforming GPT-4o and Claude Sonnet 3.5. AI alignment debates focus on intentionality and critiques of alignment faking in models like Claude. Updates from OpenAI include new o3 and o3-mini models and a deliberative alignment strategy. The ASAL project is a collaboration between MIT, OpenAI, and Swiss AI Lab IDSIA to automate artificial life discovery. Personal stories reveal frustrations with USCIS green card denials despite high qualifications. New tools like GeminiCoder enable rapid app creation, and a contract review agent using Reflex and Llama Index checks GDPR compliance. Holiday greetings and memes were also shared.
The DSPy Roadmap
dspy litel-lm gemini chatgpt-4o grok-2 hermes-3 databricks mit google openai x-ai nous-research astribot apple sakana-ai model-optimization fine-tuning optimizers interactive-optimization robotics autonomous-systems voice image-generation open-source-models scientific-research streaming caching omar-khattab giffmana
Omar Khattab announced joining Databricks before his MIT professorship and outlined the roadmap for DSPy 2.5 and 3.0+, focusing on improving core components like LMs, signatures, optimizers, and assertions with features such as adopting LiteLLM to reduce code and enhance caching and streaming. The roadmap also includes developing more accurate, cost-effective optimizers, building tutorials, and enabling interactive optimization tracking. On AI Twitter, Google launched Gemini Live, a mobile conversational AI with voice and 10 voices, alongside Pixel Buds Pro 2 with a custom Tensor A1 chip. OpenAI updated ChatGPT-4o, reclaiming the top spot on LMSYS Arena. xAI released Grok-2 in beta, achieving SOTA in image generation with FLUX 1. Nous Research released open-source Hermes 3 models in 8B, 70B, and 405B sizes, with the 405B model achieving SOTA. Robotics updates include Astribot's humanoid robot and Apple's tabletop robot with Siri voice commands. Sakana AI introduced "The AI Scientist," an autonomous AI research system.
Evals: The Next Generation
gpt-4 gpt-5 gpt-3.5 phi-3 mistral-7b llama-3 scale-ai mistral-ai reka-ai openai moderna sanctuary-ai microsoft mit meta-ai-fair benchmarking data-contamination multimodality fine-tuning ai-regulation ai-safety ai-weapons neural-networks model-architecture model-training model-performance robotics activation-functions long-context sam-altman jim-fan
Scale AI highlighted issues with data contamination in benchmarks like MMLU and GSM8K, proposing a new benchmark where Mistral overfits and Phi-3 performs well. Reka released the VibeEval benchmark for multimodal models addressing multiple choice benchmark limitations. Sam Altman of OpenAI discussed GPT-4 as "dumb" and hinted at GPT-5 with AI agents as a major breakthrough. Researchers jailbroke GPT-3.5 via fine-tuning. Global calls emerged to ban AI-powered weapons, with US officials urging human control over nuclear arms. Ukraine launched an AI consular avatar, while Moderna partnered with OpenAI for medical AI advancements. Sanctuary AI and Microsoft collaborate on AI for general-purpose robots. MIT introduced Kolmogorov-Arnold networks with improved neural network efficiency. Meta AI is training Llama 3 models with over 400 billion parameters, featuring multimodality and longer context.