Hola AI aficionados, itâs yet another ThursdAI, and yet another week FULL of AI news, spanning Open Source LLMs, Multimodal video and audio creation and more!
Shiptember as they call it does seem to deliver, and it was hard even for me to follow up on all the news, not to mention we had like 3-4 breaking news during the show today!
This week was yet another Qwen-mas, with Alibaba absolutely dominating across open source, but also NVIDIA promising to invest up to $100 Billion into OpenAI.
So letâs dive right in! As a reminder, all the show notes are posted at the end of the article for your convenience.
Table of Contents
This Weekâs Buzz: W&B Fully Connected is coming to London and Tokyo & Another hackathon in SF
Vision & Video: Wan 2.2 Animate, Kling 2.5, and Wan 4.5 preview
Open Source AI
This was a Qwen-and-friends week. I joked on stream that I should just count how many times âAlibabaâ appears in our show notes. Itâs a lot.
Qwen3-VL Announcement (Qwen3-VL-235B-A22B-Thinking): (X, HF, Blog, Demo)
Qwen 3 launched earlier as a text-only family; the vision-enabled variant just arrived, and itâs not timid. The âthinkingâ version is effectively a reasoner with eyes, built on a 235B-parameter backbone with around 22B active (their mixture-of-experts trick). What jumped out is the breadth of evaluation coverage: MMU, video understanding (Video-MME, LVBench), 2D/3D grounding, doc VQA, chart/table reasoningâpages of it. Theyâre showing wins against models like Gemini 2.5 Pro and GPTâ5 on some of those reports, and doc VQA is flirting with ânearly solvedâ territory in their numbers.
Two caveats. First, whenever scores get that high on imperfect benchmarks, you should expect healthy skepticism; known label issues can inflate numbers. Second, the model is big. Incredible for server-side grounding and long-form reasoning with vision (theyâre talking about scaling context to 1M tokens for two-hour video and long PDFs), but not something you throw on a phone.
Still, if your workload smells like âreasoning + grounding + long context,â Qwen 3 VL looks like one of the strongest open-weight choices right now.
Qwen3-Omni-30B-A3B: end-to-end SOTA omni-modal AI unifying text, image, audio, and video (HF, GitHub, Qwen Chat, Demo, API)
Omni is their end-to-end multimodal chat model that unites text, image, and audioâand crucially, it streams audio responses in real time while thinking separately in the background. Architecturally, itâs a 30B MoE with around 3B active parameters at inference, which is the secret to why it feels snappy on consumer GPUs.
In practice, that means you can talk to Omni, have it see what you see, and get sub-250 ms replies in nine speaker languages while it quietly plans. It claims to understand 119 languages. When I pushed it in multilingual conversational settings it still code-switched unexpectedly (Chinese suddenly appeared mid-flow), and it occasionally suffered the classic âstuck in thoughtâ behavior weâve been seeing in agentic voice modes across labs. But the responsiveness is real, and the footprint is exciting for local speech streaming scenarios. I wouldnât replace a top-tier text reasoner with this for hard problems, yet being able to keep speech native is a real UX upgrade.
Qwen Image Edit, Qwen TTS Flash, and QwenâGuard
Qwenâs image stack got a handy upgrade with multi-image reference editing for more consistent edits across shotsâuseful for brand assets and style-tight workflows. TTS Flash (API-only for now) is their fast speech synth line, and QâGuard is a new safety/moderation model from the same team. Itâs notable because Qwen hasnât really played in the moderation-model space before; historically Metaâs Llama Guard led that conversation.
DeepSeek V3.1 Terminus: a surgical bugfix that matters for agents (X, HF)
DeepSeek whale resurfaced to push a small 0.1 update to V3.1 that reads like a âquality and stabilityâ releaseâbut those matter if youâre building on top. It fixes a code-switching bug (the âsudden Chineseâ syndrome youâll also see in some Qwen variants), improves tool-use and browser execution, andâimportantlyâmakes agentic flows less likely to overthink and stall. On the numbers, Humanities Last Exam jumped from 15 to 21.7, while LiveCodeBench dipped slightly. Thatâs the story here: they traded a few raw points on coding for more stable, less dithery behavior in end-to-end tasks. If youâve invested in their tool harness, this may be a net win.
Liquid Nanos: small models that extract like theyâre big (X, HF)
Liquid Foundation Models released âLiquid Nanos,â a set of open models from roughly 350M to 2.6B parameters, including âextractâ variants that pull structure (JSON/XML/YAML) from messy documents. The pitch is cost-efficiency with surprisingly competitive performance on information extraction tasks versus models 10Ă their size. If youâre doing at-scale doc ingestion on CPUs or small GPUs, these look worth a try.
Tiny IBM OCR model that blew up the charts (HF)
We also saw a tiny IBM model (about 250M parameters) for image-to-text document parsing trending on Hugging Face. Run in 8-bit, it squeezes into roughly 250 MB, which means Raspberry Pi and âtoasterâ deployments suddenly get decent OCR/transcription against scanned docs. Itâs the kind of tiny-but-useful release that tends to quietly power entire products.
Metaâs 32B Code World Model (CWM) released for agentic code reasoning (X, HF)
Nisten got really excited about this one, and once he explained it, I understood why. Meta released a 32B code world model that doesnât just generate code - it understands code the way a compiler does. Itâs thinking about state, types, and the actual execution context of your entire codebase.
This isnât just another coding model - itâs a fundamentally different approach that could change how all future coding models are built. Instead of treating code as fancy text completion, itâs actually modeling the program from the ground up. If this works out, expect everyone to copy this approach.
Quick note, this one was released with a research license only!
Evals & Benchmarks: agents, deception, and code at scale
A big theme this week was âmove beyond single-turn Q&A and test how these things behave in the wild.â with a bunch of new evals released. I wanted to cover them all in a separate segment.
OpenAIâs GDP Eval: âeconomically valuable tasksâ as a bar (X, Blog)
OpenAI introduced GDP Eval to measure model performance against real-world, economically valuable work. The design is closer to how I think about âAGI as useful workâ: 44 occupations across nine sectors, with tasks judged against what an industry professional would produce.
Two details stood out. First, OpenAIâs own models didnât top the chart in their published screenshotâAnthropicâs Claude Opus 4.1 led with roughly a 47.6% win rate against human professionals, while GPTâ5-high clocked in around 38%. Releasing a benchmark where youâre not on top earns respect. Second, the tasks are legit. One example was a manufacturing engineer flow where the output required an overall design with an exploded view of componentsâthe kind of deliverable a human would actually make.
What I like here isnât the precise percent; itâs the direction. If we anchor progress to tasks an economy cares about, we move past âtrivia with citationsâ and toward âdid this thing actually help do the work?â
GAIA 2 (Meta Super Intelligence Labs + Hugging Face): agents that execute (X, HF)
MSL and HF refreshed GAIA, the agent benchmark, with a thousand new human-authored scenarios that test execution, search, ambiguity handling, temporal reasoning, and adaptabilityâplus a smartphone-like execution environment. GPTâ5-high led across execution and search; Kimiâs K2 was tops among open-weight entries. I like that GAIA 2 bakes in time and budget constraints and forces agents to chain steps, not just spew plans. We need more of these.
Scale AIâs âSWE-Bench Proâ for coding in the large (HF)
Scale dropped a stronger coding benchmark focused on multi-file edits, 100+ line changes, and large dependency graphs. On the public set, GPTâ5 (not Codex) and Claude Opus 4.1 took the top two slots; on a commercial set, Opus edged ahead. The broader takeaway: the action has clearly moved to test-time compute, persistent memory, and program-synthesis outer loops to get through larger codebases with fewer invalid edits. This aligns with what weâre seeing across ARCâAGI and SWEâbench Verified.
The âAmong Usâ deception test (X)
One more thatâs fun but not frivolous: a group benchmarked models on the social deception game Among Us. OpenAIâs latest systems reportedly did the best job both lying convincingly and detecting othersâ lies. This line of work matters because social inference and adversarial reasoning show up in real agent deploymentsâsecurity, procurement, negotiations, even internal assistant safety.
Big Companies, Bigger Bets!
Nvidiaâs $100B pledge to OpenAI for 10GW of compute
Letâs say that number again: one hundred billion dollars. Nvidia announced plans to invest up to $100B into OpenAIâs infrastructure build-out, targeting roughly 10 gigawatts of compute and power. Jensen called it the biggest infrastructure project in history. Pair that with OpenAIâs Stargate-related announcementsâfive new datacenters with Oracle and SoftBank and a flagship site in Abilene, Texasâand you get to wild territory fast.
Internal notes circulating say OpenAI started the year around 230MW and could exit 2025 north of 2GW operational, while aiming at 20GW in the near term and a staggering 250GW by 2033. Even if those numbers shift, the directional picture is clear: the GPU supply and power curves are going vertical.
Two reactions. First, yes, the âinfinite money loopâ memes wrote themselvesâOpenAI spends on Nvidia GPUs, Nvidia invests in OpenAI, the market adds another $100B to Nvidiaâs cap for good measure. But second, the underlying demand is real. If we need 1â8 GPUs per âfull-time agentâ and there are 3+ billion working adults, we are orders of magnitude away from compute saturation. The power story is the real constraintâand thatâs now being tackled in parallel.
OpenAI: ChatGPT Pulse: Proactive AI news cards for your day (X, OpenAI Blog)
In a #BreakingNews segment, we got an update from OpenAI, that currently works only for Pro users but will come to everyone soon. Proactive AI, that learns from your chats, email and calendar and will show you a new âfeedâ of interesting things every morning based on your likes and feedback!
Pulse marks OpenAIâs first step toward an AI assistant that brings the right info before you ask, tuning itself with every thumbs-up, topic request, or app connection. Iâve tuned mine for today, weâll see what tomorrow brings!
P.S - Huxe is a free app from the creators of NotebookLM (Ryza was on our podcast!) that does a similar thing, so if you donât have pro, check out Huxe, they just launched!
XAI Grok 4 fast - 2M context, 40% fewer thinking tokens, shockingly cheap (X, Blog)
xAI launched Grokâ4 Fast, and the name fits. Think âtop-leftâ on the speed-to-cost chart: up to 2 million tokens of context, a reported 40% reduction in reasoning token usage, and a price tag thatâs roughly 1% of some frontier models on common workloads. On LiveCodeBench, Grokâ4 Fast even beat Grokâ4 itself. Itâs not the most capable brain on earth, but as a high-throughput assistant that can fan out web searches and stitch answers in something close to real time, itâs compelling.
Alibaba Qwen-Max and plans for scaling (X, Blog, API)
Back in the Alibaba camp, they also released their flagship API model, Qwen 3 Max, and showed off their future roadmap.
Qwen-max is over 1T parameters, MoE that gets 69.6 on Swe-bench verified and outperforms GPT-5 on LMArena!
And their plan is simple: scale. Theyâre planning to go from 1 million to 100 million token context windows and scale their models into the terabytes of parameters. It culminated in a hilarious moment on the show where we all put on sunglasses to salute a slide from their presentation that literally said, âScaling is all you need.â AGI is coming, and it looks like Alibaba is one of the labs determined to scale their way there. Their release schedule lately (as documented by Swyx from Latent.space) is insane.
This Weekâs Buzz: W&B Fully Connected is coming to London and Tokyo & Another hackathon in SF
Weights & Biases (now part of the CoreWeave family) is bringing Fully Connected to London on Nov 4â5, with another event in Tokyo on Oct 31. If youâre in Europe or Japan and want two days of dense talks and hands-on conversations with teams actually shipping agents, evals, and production ML, come hang out. Readers got a code on stream; if you need help getting a seat, ping me directly.
Links: fullyconnected.com
We are also opening up registrations to our second WeaveHacks hackathon in SF, October 11-12, yours trully will be there, come hack with us on Self Improving agents! Register HERE
Vision & Video: Wan 2.2 Animate, Kling 2.5, and Wan 4.5 preview
This is the most exciting space in AI week-to-week for me right now. The progress is visible. Literally.
Moondream-3 Preview - Interview with co-founders Via & Jay
While Iâve already reported on Moondream-3 in the last weeks newsletter, this week we got the pleasure of hosting Vik Korrapati and Jay Allen the co-founders of MoonDream to tell us all about it. Tune in for that conversation on the pod starting at 00:33:00
Wan open sourced Wan 2.2 Animate (aka âWan Animateâ): motion transfer and lip sync
Tongyiâs Wan team shipped an open-source release that the community quickly dubbed âWanimate.â Itâs a character-swap/motion transfer system: provide a single image for a character and a reference video (your own motion), and it maps your movement onto the character with surprisingly strong hair/cloth dynamics and lip sync. If youâve used runwayâs Act One, youâll recognize the vibeâexcept this is open, and the fidelity is rising fast.
The practical uses are broader than âmake me a deepfake.â Think onboarding presenters with perfect backgrounds, branded avatars that reliably say what you need, or precise action blocking without guessing at how an AI will move your subject. You act it; it follows.
Kling 2.5 Turbo: cinematic motion, cheaper and with audio
Kling quietly rolled out a 2.5 Turbo tier thatâs 30% cheaper and finally brings audio into the loop for more complete clips. Prompts adhere better, physics look more coherent (acrobatics stop breaking bones across frames), and the cinematic look has moved from âYouTube shortâ to âfilm-school final.â They seeded access to creators and re-shared the strongest results; the consistency is the headline. (Source X: @StevieMac03)
Iâve chatted with my kiddos today over facetime, and they were building minecraft creepers. I took a screenshot, sent to Nano Banana to make their creepers into actual minecraft ones, and then with Kling, Animated the explosions for them. They LOVED it! Animations were clear, while VEO refused for me to even upload their images, Kling didnât care haha
Wan 4.5 preview: native multimodality, 1080p 10s, and lip-synced speech
Wan also teased a 4.5 preview that unifies understanding and generation across text, image, video, and audio. The eye-catching bit: generate a 1080p, 10-second clip with synced speech from just a script. Or supply your own audio and have it lip-sync the shot. I ran my usual âinterview a polar bear dressed like meâ test and got one of the better results Iâve seen from any model. Weâre not at âdialogue sceneâ quality, but âtalking character shotâ is getting⌠good.
The generation of audio (not only text + lipsync) is one of the best ones besides VEO, itâs really great to see how strongly this improves, sad that this wasnât open sourced! And apparently it supports âdraw text to animateâ (Source: X)
Voice & Audio
Suno V5: weâve entered the âI canât tell anymoreâ era
Suno calls V5 a redefinition of audio quality. Iâll be honest, Iâm at the edge of my subjective hearing on this. Iâve caught myself listening to Suno streams instead of Spotify and forgetting anything is synthetic. The vocals feel more human, the mixes cleaner, and the remastering path (including upgrading V4 tracks) is useful. The last 10% to âyou fooled a producerâ is going to be long, but the distance between V4 and V5 already makes me feel like I should re-cut our ThursdAI opener.
MiMI Audio: a small omni-chat demo that hints at the floor
We tried a MiMI Audio demo liveâa 7B-ish model with speech in/out. It was responsive but stumbled on singing and natural prosody. Iâm leaving it in here because itâs a good reminder that the open floor for âreal-time voiceâ is rising quickly even for small models. And the moment you pipe a stronger text brain behind a capable, native speech front-end, the UX leap is immediate.
Ok, another DENSE week that finishes up Shiptember, tons of open source, Qwen (Tongyi) shines, and video is getting so so good. This is all converging folks, and honestly, Iâm just happy to be along for the ride!
This week was also Rosh Hashanah, which is the Jewish new year, and Iâve shared on the pod that Iâve found my X post from 3 years ago, using the state of the art AI models of the time. WHAT A DIFFERENCE 3 years make, just take a look, I had to scale down the 4K one from this year just to fit into the pic!
Shana Tova to everyone whoâs reading this, and weâll see you next week đŤĄ
ThursdAI - Sep 25, 2025 - TL;DR & Show notes
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @yampeleg @nisten @ldjconfirmed @ryancarson
Guest - Vik Korrapathy (@vikhyatk) - Moondream
Open Source AI (LLMs, VLMs, Papers & more)
Listen to this episode with a 7-day free trial
Subscribe to ThursdAI - Recaps of the most high signal AI weekly spaces to listen to this post and get 7 days of free access to the full post archives.