ThursdAI - Highest signal weekly AI news show
ThursdAI - The top AI news from the past week
April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews
0:00
-1:59:14

April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews

From Weights & Biases - 3 interviews, 3 Breaking news (Opus 4.7 and Codex Computer use) + a discussion about ZL Continuums among the cohosts. Let us catch you up!

Hey ya’ll, Alex here with your weekly AI news catch up.

It’s one of those Thursday’s where no matter how well I prep, the big AI labs are hell bent to show up before each other. Alibaba dropped Qwen 3.6 with Apache 2, confirming their commitment to Open Source, then Anthropic released Claude Opus 4.7 (not quite Mythos) and OpenAI followed with a huge Codex update that includes Computer Use among other things. The highlight of Computer User is the background usage, more on that below. This is all just from today!

Previously in the week we had 2 incredible 3D world generators, Lyra 2.0 from Nvidia and HYWorld 2 from Tencent, Windsurf dropping 2.0 version with Devin integration and Google releasing a Gemini TTS, with over 90+ languages support and incredible emotions range, and Baidu open sources Ernie Image, rivaling Nano Banana.

Today on the show we had 3 awesome guests, Theodor from Cognition joined to cover the new Windsurf, Kwindla is back on the show to talk about “the side project that escaped containment” Gradient-Bang, a multi agent, voice based space game and Trevor from Marimo joined to talk about pairing your agents with a Marimo notebook. Let’s dive in! 👇

ThursdAI - We’re over 16K on YT today, my goal is to get to parity with Substack, please subscribe.

Codex can now really use your computer: OpenAI updates Codex with CUA, Image Generation, Browser, SSH (X, Blog)

Codex from OpenAI has been the major focus inside OpenAI for a while now. We’ve reported previously that OpenAI is closing down SORA and other “side-quests” to focus, and that they will join Codex, ChatGPT and the Atlas browser into one “superapp” and today, it seems, that we’ve gotten an early glimpse of what that app will be.

The Codex team (which seems to be growing from day to day), have been on a TEAR feature wise lately, trying to beat Claude Code, and they pushed an update with a LOT of features and updates, among them a new memory system, internal browser and image generation.

The highlight for me though, was absolutely the polished computer use experience. Computer use is not new, Claude has a computer use feature flag, many others. Hell, we told you about computer use with Open Interpreter, back in Sep of 2023. But, this.... this feels different.

You see, OpenAI has quietly purchased a company called Software Apps Inc, that almost launched a macos AI companion a year ago called Sky. This team is obsessed with Mac, and somehow, they were able to build a magical experience, a huge part of which, is the fact that they are controlling the mac, in the background. This is like black magic stuff. You work on one document, Codex clicks buttons and does things in another, without interrupting you.

You may ask, Alex, why do you even care so much about computer use, when most of the work happens in the browser anyway, and Claude (and Codex) can control my browser anyway?

Well, true, but not ALL work is happening there, for example, file system integration. It’s notoriously big part of browser automation that fails, when you need to upload/download files. I’ve spent countless cycles trying to get this to work with OpenClaw, and this, just does it. This closes the loop between knowledge work in the browser (yes, this thing can use your browser) and the broader OS.

It’s so so polished, I truly recommend you try it. It’s as easy as @ tagging any app that you have running and asking Codex to do stuff there. Pro Tip: Enable fast mode for a much smoother experience.

Anthropic Opus 4.7 is here, not quite Mythos, 64.3% Swe-bench Pro, tuned for long running tasks (X, System Card)

Claude Opus 4.7 Benchmarks

What is there to say? Is this the model we expected from Anthropic after releasing the news about Claude Mythos last week? no. But hey, we’ll take it. I new Claude Opus, with a significantly improved multimodality capabilities, and a long horizon coding task improvements? For the same price?

Image

Well, not quite! Apparently, this model could be a “from scratch” trained model, given that the tokenizer (the thing that converts words into tokens for the LLM to understand) is a different one. It also uses 1.3x more tokens for the same tasks, which means, that the new and default model from Anthropic became effectively more expensive (A note they acknowledged by raising the usage limits, to an unknown amount in Anthropic subscription plans, but it’ll still be a token tax on the API use)

Image

How about performance? Well, hard to judge on Evals alone, but they are great. A huge jump in Swe-bench Pro, over 10% improvement, puts this model as the best out there, except Mythos. It’s also the best at real world knowledge via GPQA Diamond (except Mythos). Are you seeing a trend here? Anthropic released a preview of a model, but for the first time, it’s not their “absolute best” model, and in a weird move, they have compared it on Evals to an unreleased model (presumably 10x the size?)

As far as we’ve tested this, it gave an incredibly detailed response on the Mars question we constantly test on, both for me and Nisten, Opus 4.7 produced an incredibly detailed 3D rendered result, much better than out previous tries. I’ll be keeping an eye on this model and keep you guys up to date on what else we find. Vibe checks are .. it’s more expensive, long context is unclear but it’s a great vibe model.

Alibaba is back - Qwen 3.6 is Apache 2.0 35B with 3B active parameters (X, HF, Blog)

The coolest thing about this release is not the evals (though they claim to outperform the much denser Qwen 3.5-27B on multple benchmarks) is that Alibabab is putting models with open weights and an Apache 2.0 license!

We previouly reported on rumors from inside Alibaba, that a few internal restructuring caused many of us to doubt if they would commit to OSS, and they answered!

Another highlight for me in this model, is that Alibaba has an OpenClaw bench (that they are promising to release soon) and that this model does as well as the dense model and beating Gemma 4 by a wide margin on that task.

This model is also natively multimodal, with 262K context extensible to 1M via YaRN.

MiniMax M2.7 Open Weights - 230B MoE with only 10B active (X, HF)

Our friends at MiniMax finally dropped M2.7 in open weights (technically not fully Apache, commercial use requires their authorization, but free for research, personal, and coding agents). It’s a 230B parameter MoE with only 10B active parameters, and it’s matching GPT-5.3-Codex on SWE-Pro at 56.22%. On Terminal-Bench 2 it hits 57%. But the real story here, the part that made me stop scrolling, is the self-evolution piece.

They let an internal version of M2.7 run its own RL optimization loop for 100+ rounds with zero human intervention. The model analyzed its own failure trajectories, modified its own scaffold code, ran evals, and decided whether to keep or revert changes. It got a 30% performance improvement on internal metrics. The model improved itself.

Shoutout to the MiniMax team — longtime friends of the pod and they keep delivering (as they promised to release the weights for this one and they did)


This weeks buzz - news from Weights & Biases from CoreWeave

This week was a very big one in our corner of the AI world. Our parent company CoreWeave announced not one, not two but 3 major deals, including one with Anthropic, a renewed commitment from Meta and a renewal from Jane Street.

CoreWeave now serves 9 out of the top 10 AI model providers in the world. 🎉

Oh and a small plug, if you want to get tokens powered by the same infrastructure, our Coreweve Inference service is open and very cheap, and we’ve recently added Gemma 4 and GLM 5.1 both to our inference service.

This week on the pod, I’ve chatted with Trevor, founding engineer at Marimo Notebooks (also part of CW) about their recent highlight of pairing an AI agent with Marimo notebooks, they went quite viral on hacker news and I wanted to understand why. I understood why, it’s really cool. Check Trevor out on the pod starting around 01:05:00 timestamp.


Tools & Agentic Engineering

Windsurf 2.0 - Agent Command Center + Devin in the IDE - interview with Theodor Marcu (X, Blog)

The first big post-Cognition-acquisition move for Windsurf dropped this week, and I got to chat with Theodor Marcu from Cognition about it on the show. The headline: Windsurf 2.0 brings an Agent Command Center; think Kanban-style mission control for all your agents, plus native Devin integration baked right into the IDE, and Spaces (persistent project containers that group your agent sessions, PRs, files, and context).

The framing Theodor gave me: local agents are pair programmers bounded by your attention (they stop when you close the laptop), while cloud agents are independent hires. Windsurf 2.0 tries to unify both paradigms in one interface. You can plan locally with Cascade using the Socratic method — going back and forth, challenging assumptions, building up context — and then with one click, hand off execution to Devin which runs in its own cloud VM, opens PRs, runs tests, and even tests its own work using computer use on its own Linux desktop. You can close your laptop and it keeps shipping.

One reality check from the community: Devin is great but not cheap. One early tester burned $25 in credits for a 15-20 minute bug fix that produced “okay” results. Something to watch on the Max plan economics. Devin access is rolling out gradually to Windsurf users over 48 hours from launch.

Shoutout to Swyx that helped design the Spaces three months ago whilst at Cognition!

Warp terminal now supports any CLI agent with vertical tabs and mobile control (X, Blog)

This one is for the terminal enjoyers. Warp, which in my opinion is the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI, all running side by side in vertical tabs with live status indicators.

The killer feature here, and this solves what I think is the single worst part about using Claude Code, is notifications when agents need you. If you’ve used Claude Code you know the pain of constantly checking if it’s waiting for a permission or input. Warp notifies you. You step in, approve, go back to what you were doing. They also added integrated code review inside the terminal, a rich multimodal input editor, and — this is wild — remote control from mobile. Monitor and interact with your running CLI agents from your phone.

Voice & Audio

Gradient Bang - the first massively multiplayer LLM-driven game, interview with Kwindla (X, Play it)

Kwindla, co-CEO of Daily and maintainer of Pipecat, came on the show to talk about Gradient Bang, a game he described as “a side project that escaped containment.” He told me about this back in December, and folks, it’s finally live and it’s genuinely the first fully LLM-driven multiplayer game I’ve seen. It’s inspired by an old BBS door game called Trade Wars that Kwindla used to play as a baby programmer on a 386 DX, but reimagined so your ship’s computer is an LLM you can just… talk to.

You pilot a spaceship through a procedurally generated universe, but instead of clicking buttons, you talk to the thing, and say things like “take me to the nearest mega port and trade along the way” — and your ship AI delegates to sub-agents to actually do the work. You can run corporations, buy more ships, task them to do 5 exploration loops while you do trade runs. It’s Factorio-meets-Ender’s-Game-meets-voice-AI. I’ve been playing it, my ship is currently roaming the universe as we speak (with 0 credits as someone robbed me!)

What makes this technically fascinating is that it’s basically a production-grade stress test for multi-agent orchestration. Sub-agents with shared context, episodic memory across sessions, dynamic LLM-generated UIs (the React front-end is literally rendered from JSON thrown over by a UI agent LLM), and long-running contexts that go for weeks. The architecture is now shipping as a Pipecat library called Pipecat Sub-Agents. Tech stack: Deepgram for STT, GPT-4.1 for the voice agent, GPT-5.2 medium-thinking for task agents, and a dedicated benchmark called GB Benchmarks because tasking these agents is genuinely hard.

Fun detail: Kwindla’s rule for this project was to not write or read any code since November. His colleague John lasted about one day before he broke and started reading React. The Z/L Continuum claims another victim. Go play it, it’s free and fun: gradientbang.com.

Google launches Gemini 3.1 Flash TTS (X, Blog, Try it)

Google dropped a new TTS model this week and folks, it’s not quite the speed-of-light real-time conversational TTS we’re all dreaming of (it’s about 3 seconds time-to-first-token, so batch-mode only), but the controllability is wild. We’re talking inline audio tags — [laughs], [sighs], [gasp] — natural language scene direction, two distinct speakers per generation, 70+ languages with auto-detection, and you can switch emotion and pacing mid-sentence with natural language.

I tested it live on the show with a “shocked/whispering” tag combo asking “Who came to ThursdAI?” and it absolutely nailed it.

It hit 1,211 Elo on the Artificial Analysis TTS Arena, 4 points behind Inworld TTS 1.5 Max and ahead of ElevenLabs v3. Pricing is about $0.03 per 60 seconds of audio, roughly 4.7x cheaper than ElevenLabs v3.Kwindla’s take: this is part of the broader shift from traditional TTS architectures toward fully steerable, prompt-able speech models — which is great for expressive use cases but means you need to test heavily for hallucinations and word skipping.

AI Art, Video & 3D

Tencent HYWorld 2.0 and NVIDIA Lyra 2.0 - actual 3D worlds from one image

This week we got not one but two major single-image-to-3D-world open releases, and they’re genuinely different from the video world models (Genie 3, Cosmos) we’ve been covering.

Tencent HYWorld 2.0 takes a single image (or text, or video) and produces actual 3D Gaussian Splats, meshes, and point clouds that you can import directly into Unity, Unreal, Blender, or NVIDIA Isaac Sim. Not video. Real editable 3D assets. Their framing: “watch a video, then it’s gone” vs “build a world, keep it forever.” The WorldMirror 2.0 reconstruction model is a 1.2B parameter feed-forward model that predicts dense point clouds, depth, normals, camera params, and 3DGS in a single pass. All open source.

NVIDIA Lyra 2.0 (Apache 2.0) takes a single image and progressively generates an explorable 3D world as you navigate through it. The breakthrough here is solving two classic failure modes of generative world models: spatial forgetting (hallucinating new structures when you revisit an area) and temporal drifting (errors accumulating until the scene turns to mush). They solve both with per-frame 3D geometry retrieval and this elegant self-augmented training trick where they train the model on its own degraded outputs so it learns to correct drift. DMD distillation gets you 4-step inference. Apache 2.0, Hugging Face, code and weights.

Both of these together feel like the end of video-only world models as the state of the art. We’re going straight to editable, persistent, importable 3D worlds.

Baidu open-sources ERNIE-Image - 8B parameter text-to-image (HF)

Not to be outdone, Baidu dropped ERNIE-Image, an 8B parameter DiT that’s now #1 on GenEval among open-weight models (0.8856), beating Qwen-Image, FLUX.2-klein, and Z-Image. Built from scratch in 3 months. Runs on a 24GB consumer GPU, and someone already quantized it to NF4 so it runs under 10GB VRAM on an RTX 3060. The text rendering story is the headline — clean multilingual text rendering for posters, infographics, comics, the stuff every other model has been historically terrible at. There’s also a Turbo variant that does it in 8 inference steps.

The craziest AI video I’ve ever seen - “Pi Hard” (X)

You have to watch this AI video. It’s one of the crazier ones I ever saw, and I do reporting on AI for a living. I showed this to my Fiancee Darya, and she only asked me “is this AI” in the middle of it, after saying “yeah, let’s watch this 😂)

Closing thoughts

What a week. Opus 4.7 dropped live on the show, Codex is now controlling your mac in the background like black magic, Qwen gave us another Apache 2.0 banger, MiniMax shipped a self-evolving model, and we got two “image-to-actual-3D-world” open source releases on the same week. Oh and a shoe company is now an AI compute company.

The Z/L Continuum keeps shifting — I feel like every week I drift a little more toward L, especially after seeing Kwindla ship Gradient Bang without reading code since November. And every week the agents get better at babysitting themselves (Claude Code Routines, Windsurf’s Agent Command Center, Warp’s unified CLI agent UX, Codex’s computer use in the background), which means more FOMAT for all of us.

Thanks for reading, share this with a friend, and if you enjoyed this, drop a comment with what you want more or less of. Feedback keeps me going.

— Alex

TL;DR - ThursdAI, April 16, 2026

  • Hosts and Guests

  • Show Notes

    • Recap essay on the Z/L Continuum from AI Engineer Europe (Blog): should AI engineers still read code? Ryan Lopopolo says no, Mario Zechner says yes for critical paths, everyone in between has FOMAT.

    • Mario Zechner talk is finally live on AI Engineer youtube (Watch)

    • Super Gemma 4 26B Uncensored v2 by @songjunkr — trending on HF, 0/100 refusals, fixed tool calls (HF GGUF, HF MLX 4bit)

    • Gemma 4 21B REAP — 20% expert-pruned Gemma 4 26B MoE by 0xSero using Cerebras REAP (HF)

    • Parcae (Together AI + UCSD) — stable looped transformer architecture with scaling laws, matches 2x-sized transformer quality (Paper/blog)

    • Claude Desktop app — rewritten from scratch, completely new app

    • Gemma 4 on W&B Inference — reply on the announcement post with code Gem Drop for $20 in inference credits, also supports LoRA inference via link

  • Big CO LLMs + APIs

    • Anthropic launches Claude Opus 4.7 - 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 3x vision resolution, new xhigh effort level, /ultrareview in Claude Code, same pricing as 4.6 but new tokenizer uses ~1.0-1.35x more tokens (X, Blog)

    • OpenAI Codex major update: macOS background computer use, 90+ plugins, gpt-image-1.5 image generation, in-app browser, memory, self-scheduling automations, multi-terminal SSH (X, Blog)

    • CoreWeave signs deals with Anthropic (multibillion), Meta ($21B expansion, $35B+ total), and Jane Street ($6B cloud + $1B equity), now serves 9 of the top 10 AI providers

  • Open Source LLMs

    • Qwen 3.6-35B-A3B - Apache 2.0, 35B MoE with 3B active, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M (X, HF, Blog)

    • MiniMax M2.7 open weights - 230B MoE with 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex, self-evolved via 100+ rounds of autonomous RL (X, HF)

  • Tools & Agentic Engineering

    • Windsurf 2.0 with Agent Command Center and Devin integration - interview with Theodor Marcu (X, Blog)

    • Warp now supports any CLI agent with vertical tabs, notifications, code review, mobile remote control (X, Blog)

    • Claude Code Routines - cron, GitHub event, and API-triggered autonomous agents running on Anthropic’s cloud (Docs)

  • This Week’s Buzz - Weights & Biases / CoreWeave

    • Marimo Pair - drop Claude Code / Codex / OpenCode agents directly inside reactive Python notebooks - interview with Trevor Manz (Blog, GitHub)

    • Gemma 4 now live on W&B Inference on CoreWeave infrastructure, with LoRA inference support

  • Vision & Video

    • Craziest AI video of the year: Pi Hard / Neil deGrasse Tyson (X)

  • Voice & Audio

    • Gradient Bang - first massively multiplayer fully LLM-driven game, Pipecat sub-agents - interview with Kwindla (Play, GitHub)

    • Google Gemini 3.1 Flash TTS - 1,211 Elo on TTS Arena, inline audio tags, 70+ languages, ~$0.03/60s (Blog)

  • AI Art, Diffusion & 3D

    • Baidu ERNIE-Image - 8B DiT, #1 GenEval among open models, precise multilingual text rendering (HF)

    • Tencent HYWorld 2.0 - single image to editable 3D Gaussian Splats/meshes, Unity/Unreal/Isaac Sim ready (GitHub)

    • NVIDIA Lyra 2.0 - single image to explorable persistent 3D worlds, Apache 2.0 (Project, HF)

  • Other news

    • Unitree humanoid breaks 100m dash world record at ~10m/s (X)

    • Allbirds shoe company loses 99.5%, rebrands as “NewBird AI”, raises $50M to buy GPUs, stock up 600-800% (X)

Discussion about this episode

User's avatar

Ready for more?