All tags
Topic: "cybersecurity"
not much happened today
gpt-5.5 codex thinking-machines openai anthropic multimodality real-time-interaction visual-proactivity deployment cybersecurity threat-modeling automation continuous-audio-video-text-processing security-models field-engineering enterprise-ai johnschulman2 soumithchintala chillee liliyu_lili rown kimmonismus giffmana swyx eliebakouch gdb sama therundownai lukolejnik matvelloso
Thinking Machines previewed their new native interaction models designed for full-duplex multimodal interaction enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach emphasizes continuous audio, video, and text processing, with innovations like visual proactivity and background tool use, implemented using SGLang. Meanwhile, OpenAI announced the OpenAI Deployment Company, a new unit with 150 Forward Deployed Engineers and $4B initial investment to help enterprises deploy frontier models, signaling a move into the deployment layer of the AI economy. OpenAI also launched Daybreak, a security-focused initiative integrating GPT-5.5 and Codex for cyber defense, threat modeling, and automated patching, offering differentiated access tiers including GPT-5.5-Cyber. This contrasts with Anthropic's more restrictive cyber approach, highlighting tensions in AI security strategies.
not much happened today
gpt-5.5 gpt-image-2 gpt-5.5-pro gpt-5.5-instant gpt-realtime-2 gpt-5.5-cyber codex zaya1-74b-preview zaya1-vl-8b qwen3-omni openai zyphra amd deepseek vllm_project model-release model-training mixture-of-experts inference model-optimization sandboxing alignment cybersecurity agent-runtime throughput quantization telemetry real-time-detection reach_vb dhh gdb patience_cave ithilgore cryps1s sama deredleritt3r
OpenAI rapidly expanded the GPT-5.5 family with multiple variants including gpt-image-2, GPT-5.5 Pro, and GPT-5.5 Cyber, receiving positive feedback for efficiency and usability. Codex evolved into a long-running agent runtime with a new /goal mechanism, achieving 61% success on ARC-AGI-3 games after extensive testing. OpenAI also introduced cybersecurity-focused models like GPT-5.5-Cyber targeting enterprise and government sectors. Meanwhile, Zyphra released the open-model ZAYA1-74B-Preview, a 74B parameter mixture-of-experts model trained on AMD hardware under Apache 2.0 license, alongside a vision-language model ZAYA1-VL-8B. Inference infrastructure competition intensified with vLLM updates improving throughput and latency, including support for DeepSeek V4 and enhanced quantization/backends.
GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
gpt-realtime-2 gpt-5.5 codex openai anthropic goodfireai scale-ai voice-models streaming-translation transcription benchmarking context-windows browser-automation cybersecurity interpretability neural-geometry manifolds ai-safety rlhf micahcarroll milesbrundage ryanpgreenblatt
OpenAI released GPT-Realtime-2, a voice model with GPT-5-class reasoning, tool use, interruption handling, and extended context windows up to 128K tokens, achieving top scores on Big Bench Audio and Conversational Dynamics benchmarks. They also launched a Chrome plugin for Codex enabling browser control and multitasking, and introduced GPT-5.5 with Trusted Access for Cyber for secure defensive workflows and red teaming. Anthropic introduced Natural Language Autoencoders for interpreting model activations as human-readable text, aiding interpretability and debugging, while Goodfire proposed a neural geometry research agenda focusing on manifolds as primitives for neural network behavior. Anthropic also announced The Anthropic Institute to advance AI safety and economic resilience research.
not much happened today
gpt-5.5 claude-mythos-preview gpt-5.5-pro qwen3.6-27b hy3-preview grok-4.3 gemma-4-31b glm-5.1 deepseek-v4-flash openai anthropic x-ai tencent deepseek cybersecurity model-efficiency multimodality model-benchmarking agentic-ai model-cost-optimization context-windows model-performance open-weight-models software-integration security-updates sama scaling01 cryps1s polynoamial ajambrosino arix
OpenAI's GPT-5.5 achieves top-tier performance in long-horizon cyber tasks, matching or surpassing Claude Mythos Preview with a 71.4% pass rate and showing ongoing improvement beyond 100M tokens inference. OpenAI also released an Advanced Account Security update for ChatGPT enhancing phishing resistance. The Codex update expands beyond coding to general computer tasks, improving speed by up to 42% and introducing role-based onboarding and app integrations. Economically, GPT-5.5 Pro shows a slight SOTA improvement on CritPt with ~60% lower cost and token use compared to GPT-5.4 Pro. In open-weight models, Qwen3.6 27B leads under 150B parameters with an Intelligence Index score of 46, featuring 262K context, native multimodal input, and efficient BF16 weights. Tencent's Hy3-preview (295B total, 21B active MoE) scores 42 on the Intelligence Index with strong scientific reasoning on CritPt. xAI's Grok 4.3 shows sharp improvements on agentic benchmarks with reduced cost.
not much happened today
mythos anthropic openai langchain nous-research cybersecurity sandboxing reinforcement-learning agent-architecture memory-management model-deployment software-security evaluation-methods kimmonismus paul_cal gneubig kentonvarda boazbaraktcs ylecun deanwball hwchase17 vtrivedy10 sarahcat21 aijoey
Anthropic's Mythos and OpenAI's upcoming restricted cyber-capable models are central to recent discussions, with debates on their security realism and evaluation methods. LangChain's Deep Agents deploy introduces an open memory, model-agnostic agent harness architecture emphasizing open protocols and memory ownership. Sandboxes are gaining prominence as a core infrastructure for reinforcement learning, with labs running up to 100K concurrent sandboxes aiming for 1M. The Hermes Agent by Nous continues to gain traction with new integrations and features like a web-based HUD and token cost tracking.
not much happened today
claude-opus-4.6 capybara glm-5.1 qwen-3.5-14b qwen-27b qwen3.5-35b anthropic google zhipu model-scaling coding academic-reasoning cybersecurity quantization local-inference model-benchmarking inference-optimization model-performance agent-products scaling01 yuchenj_uw kimmonismus m1astra dejavucoder iscienceluvr gaoj0017
Anthropic is reportedly introducing a new AI model tier called Capybara, which is larger and more intelligent than Claude Opus 4.6, showing improved performance in coding, academic reasoning, and cybersecurity. The model is speculated to be around 10 trillion parameters, with Google potentially funding Anthropic's data center expansion. Meanwhile, Zhipu released GLM-5.1, advancing open coding models and narrowing the gap with closed models. Local inference economics are improving, highlighted by efficient deployments of Qwen 3.5 14B, Qwen 27B, and Qwen3.5-35B models with quantization techniques like TurboQuant vLLM. However, TurboQuant's benchmarking claims face criticism from researchers. Overall, the AI landscape shows aggressive scaling, local model deployment, and agent products gaining traction.
not much happened today
gpt-5.3-codex claude-opus-4.6 openai anthropic cursor_ai github microsoft builder-tooling cybersecurity api-access model-rollout agentic-ai long-context serving-economics throughput-latency token-efficiency workflow-design sama pierceboggan kylebrussell natolambert omarsar0 sam_altman
OpenAI launched GPT-5.3-Codex with a Super Bowl ad emphasizing "You can just build things" as a product strategy, focusing on builder tooling over chat interfaces. The model is rolling out across Cursor, VS Code, and GitHub with phased API access and is flagged as their first "high cybersecurity capability" model. Sam Altman reported over 1M Codex app downloads in the first week and strong weekly user growth. Meanwhile, Anthropic's Claude Opus 4.6 is recognized as a leading "agentic generalist" model, topping text and code leaderboards but noted for high token usage. Discussions around serving economics and "fast mode" behavior highlight practical deployment considerations. Additionally, Recursive Language Models (RLMs) introduce a novel approach using a second programmatic context space to extend long-context capabilities.
not much happened today
claude-3-sonnet claude-3-opus gpt-5-codex grok-4-fast qwen-3-next gemini-2.5-pro sora-2-pro ray-3 kling-2.5 veo-3 modernvbert anthropic x-ai google google-labs openai arena epoch-ai mit luma akhaliq coding-agents cybersecurity api model-taxonomy model-ranking video-generation benchmarking multi-modal-generation retrieval image-text-retrieval finbarrtimbers gauravisnotme justinlin610 billpeeb apples_jimmy akhaliq
Anthropic announces a new CTO. Frontier coding agents see updates with Claude Sonnet 4.5 showing strong cybersecurity and polished UX but trailing GPT-5 Codex in coding capability. xAI Grok Code Fast claims higher edit success at lower cost. Google's Jules coding agent launches a programmable API with CI/CD integration. Qwen clarifies its model taxonomy and API tiers. Vision/LM Arena rankings show a tight competition among Claude Sonnet 4.5, Claude Opus 4.1, Gemini 2.5 Pro, and OpenAI's latest models. In video generation, Sora 2 Pro leads App Store rankings with rapid iteration and a new creator ecosystem; early tests show it answers GPQA-style questions at 55% accuracy versus GPT-5's 72%. Video Arena adds new models like Luma's Ray 3 and Kling 2.5 for benchmarking. Multi-modal video+audio generation model Ovi (Veo-3-like) is released. Retrieval models include ModernVBERT from MIT with efficient image-text retrieval capabilities. "Claude Sonnet 4.5 is basically the same as Opus 4.1 for coding" and "Jules is a programmable team member" highlight key insights.
Ideogram 2 + Berkeley Function Calling Leaderboard V2
llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development
Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.