Playback speed
Share post
Share post at current time

📅 ThursdAI - Jan 24 - ⌛Diffusion Transformers,🧠 fMRI multimodality, Fuyu and Moondream1 VLMs, Google video generation & more AI news

From Weights & Biases, join me, Alex Volkov as we explore the incredible advancements in the world of AI this last week, and oh boy we had such a great show! This one is worth a listen folks!

What A SHOW folks, I almost don't want to write anything in the newsletter to MAKE you listen haha but I will I know many of you don't like listening to be babble.

But if you chose one episode to listen to instead of just skimming the show-notes, make it this one.

We've had 2 deep dives, one into the exciting world of multi-modalilty, we chatted with the creator of Moondream1, Vik and the co-founders of Prophetic, Wes and Eric about their EEG/fMRI multimodal transformer (that's right!) and then we had a DEEP dive into the new Hourglass Diffusion Transformers with Tanishq from MedArc/Stability.

More than 1300 tuned in to the live show 🔥 and I've got some incredible feedback on the fly, which I cherish so if you have friends who don't already know about ThursdAI, why not share this with them as well?

TL;DR of all topics covered:

  • Open Source LLMs

    • Stability AI releases StableLM 1.6B params (X, Blog, HF)

    • InternLM2-Math - SOTA on math LLMs (90% GPT4 perf.) (X, Demo, Github)

    • MedArc analysis for best open source use for medical research finds Qwen-72 the best open source doctor (X)

  • Big CO LLMs + APIs

    • Google teases LUMIERE - incredibly powerful video generation (TTV and ITV) (X, Blog, ArXiv)

    • 🤗 HuggingFace announces Google partnership (Announcement)

    • OpenAi 2 new embeddings models, tweaks turbo models and cuts costs (My analysis, Announcement)

    • Google to add 3 new AI features to Chrome (X, Blog)

  • Vision & Video

    • Adept Fuyu Heavy - Third in the world MultiModal while being 20x smaller than GPT4V, Gemini Ultra (X, Blog)

    • FireLLaVa - First LLaVa model with commercial permissive license from fireworks (X, Blog, HF, DEMO)

    • Vikhyatk releases Moondream1 - tiny 1.6B VLM trained on Phi 1 (X, Demo, HF)

  • This weeks's buzz 🐝🪄 - What I learned in WandB this week

    • New course announcement from Jason Liu & WandB - LLM Engineering: Structured Outputs (Course link)

  • Voice & Audio

  • AI Art & Diffusion & 3D

    • Instant ID - zero shot face transfer diffusion model (Demo)

    • 🔥 Hourglass Diffusion (HDiT) paper - High Resolution Image synthesis - (X, Blog, Paper, Github)

  • Tools & Others

    • Prophetic announces MORPHEUS-1, their EEG/fMRI multimodal ultrasonic transformer for Lucid Dream induction (Announcement)

    • NSF announces NAIRR with partnership from all major government agencies & labs including, OAI, WandB (Blog)

    • Runway adds multiple motion brushes for added creativity (X, How to)

Open Source LLMs

Stability releases StableLM 1.6B tiny LLM

Super super fast tiny model, I was able to run this in LMStudio that just released an update supporting it, punches above it's weight specifically on other languages like German/Spanish/French/Italian (beats Phi)

Has a very surprisingly decent MT-Bench score as well

License is not commercial per se, but a specific Stability AI membership

I was able to get above 120tok/sec with this model with LM-Studio and it was quite reasonable and honestly, it’s quite ridiculous how fast we’ve gotten to a point where we have an AI model that can weight less that 1GB and has this level of performance 🤯

Vision & Video & Multimodality

Tiny VLM Moonbeam1 (1.6B) performs really well (Demo)

New friend of the pod Vik Hyatk trained Moonbeam1, a tiny multimodal VLM with LLaVa on top of Phi 1 (not 2 cause.. issues) and while it's not commercially viable, it's really impressive in how fast and how quite good it is. Here's an example featuring two of my dear friends talking about startups, and you can see how impressive this TINY vision enabled model can understand this scene. This is not cherry picked, this is literally the first image I tried with and my first result.

The image features two men sitting in chairs, engaged in a conversation. One man is sitting on the left side of the image, while the other is on the right side. They are both looking at a laptop placed on a table in front of them. The laptop is open and displaying a presentation, possibly related to their discussion.

In the background, there is a TV mounted on the wall, and a cup can be seen placed on a surface nearby. The scene suggests a casual and collaborative environment where the two men are sharing ideas or discussing a topic.

Vik joined us on the pod to talk about why he didn't go with Phi-2, he also mentioned that Phi-1.5 was retroactively also MIT'd, it's license literally says MIT now on HF 👏 Great conversation, tune in for that at around 00:31:35

Adept is teasing FuYu Large - their CHONKY VLM


Adept previously released Persimmon, and then Fuyu VLM (which is a type of persimmon we see you adept) and now tease the release for Fuyu Heavy, a much bigger model that can compete or come close to GPT4V and GeminiUltra on MMMU and MMLU (text) while being 20x smaller approx.

While we don't yet get to play with this, they show some great promise in the benchmarks


⭐️ Performance: Excels at multimodal reasoning and matches/exceeds text-based benchmarks.
❗️ Challenges Faced: Dealt with issues related to image data, model stability, and pre-training data scarcity.
✅ Evaluations: Outperforms Gemini Pro on MMLU and MMMU benchmarks.
AI Summary by Arc Browser (haha see how I cheated here? I sometimes do shortcut summaries using Arc Max, it's dope, try it)

Fireworks AI releases FireLLaVa - with a commercially available license

 FireLLaVA is the first commercially permissive open-source LLaVA model, a type of multi-modality model called a Vision-Language Model (VLM) that can understand both visual and textual inputs.

  • The original LLaVA model was limited for commercial use as it was trained on data generated by GPT-4, which has non-commercial licenses. 

  • recreated the LLaVA training data using an open-source language model, CodeLlama 34B Instruct, to make a commercially viable version.-

  • FireLLaVA performs comparably to the original LLaVA model on benchmarks, showing open-source models can generate high-quality data for VLM training.

  • FireLLaVA is available via HuggingFace and through's prediction API, enabling new visual capabilities for applications.

Vik and I chatted about this, and while Fireworks didn't release datasets, they did release an example of how to start collecting them, and it's clear that everyone is clamoring after great vision / image datasets 👏

Really hoping that many great dataset for multimodal AIs will come out in 2024 giving us increasingly better multi modal LMMs 👏

Big CO LLMs + APIs (Blog)

GOOGLE announces LUMIERE video generation model that shows incredible push in consistency

Supports multiple tasks like image to video, text to video, video inpainting, Video stylezation and more, looks incredible. It seemed that they have cracked both spatial and temporal consistency, something that's severly lacking in previous video generation attempts, and makes character consistency quite remarkable. Of course, as with other google incredible papers, we never know if we'll ever see this model or be able to play with it, here's hoping 🤞

Google will add 3 new AI features to chrome

A contact form for an apartment rental. The initial draft says “im interested in this place - do you allow dogs?” Chrome’s “Help me write” feature provides suggested text.
  • Chrome is introducing 3 new experimental AI features to make browsing more efficient:

  • Tab Organizer: Chrome will automatically group similar tabs to help with multitasking

  • Custom themes: Users can generate unique browser themes using text prompts and AI image generation

  • Writing help: Chrome will offer suggestions to help users draft messages and posts on websites

- They are currently only available to US users who opt-in on the Experimental Features page 

I think this development is super super important because making AI accessible via the incredible Chrome platform to billions of people, is going to put Gemini in front of grandmas, students, everyone. Qutie impressive and the compute needed to pull something like this off is also quite mindboggling! 👏

Of course, they are not the first browser to add AI, I love the Arc Browser and it has AI previews that I use quite often!

This weeks Buzz (What I learned with Weights & Biases this week)

Have you like many of us have trouble getting structure output (JSON, other stuctures) from LLMS? Jason also had this problem, that's why he authored the Instructor Library, which makes it easy to guide the LLM to give structured output using Pydantic. Jason has presented at Ai Engineer conference, and recently collaborated with Weights & Biases to launch a free course in how to guide your LLM to give structured outputs!


Jason is also an independent consultant working with companies on their AI implementations and has many battle tested examples from implementations across the board, which he shared with us on the pod.

Give this short course a try if you haven't yet, it's really high quality content, in addition to tons of other stuff we have there, for free 👏

Voice & Audio

11Labs has a new overdub studio and it's really working well

Check out this short segment of myself, speaking in dubbed Russian! It’s really sounds like me, sent to my mom to see if she falls for it 😆 She didn’t

AI Art & Diffusion

Hourglass Diffusion Transformers

New high resolution diffusion architecture from K-diffusion and RoPE team (X, Blog, Paper, Github)


Paper presents a new method called HDiT ( HourGlass Diffusion Transformers) that shows promise in training models with high resolution images without incurring the significant hardware costs that go with scaling image sizes, replaces the latent diffusion models enabling O(n) complexity and scaling well.


Utilizing tricks and best practices for transformers architectures, like RoPe (that we've covered on ThursdAI before) cosine similarity self-attention, RMSNorm, GeGLU, etc. and using something called local self attention, this paper shows incredible promise for high resolution architectures for image creation tools.

We had the pleasure to host Tanishq Abraham, one of the co-authors (and CEO of MedArc, Director of research with Stability + PHD at 19) to walk us through the paper, explain the problem and the solution. Additionally, friend of the pod

is co-author as well 👏 and Alex Birch joined us silently from the audience 👂while giving commentary in the group chat.

P.S - All of these co-authors attribute the bulk of the work to Katherine Crowson from k-diffusion 👏

Tools & Others

Prophetic introduces Morpheus-1 - multimodal foundational model trained on fMRI and EEG signals

In a breaking news fashion, the folks behind Prophetic, a new startup that just announced MORPHEUS-1 as we were hopping into the space, came to chat with us.

They are working on a new multimodal ultrasound transformer! That's right, multimodaliy is not only about images/text folks, we've covered this before but these chads are actually trying this out, they have trained a transformer architecture to take EEG and fMRI signals and output directions for the ultrasound to activate areas of the brain to induce Lucid dreaming. And they are asking for beta testers!

It's all quite futuristic, and if you're in NY, reach out to them (and then let us know if you had Lucid dreams!)

Definitely worth a listen on the pod and check out their video announcement for mode details, was really quite an incredible conversation with Wes and Eric.

National Science Foundation launches NAIRR pilot (Blog)

Partnering with 10 other federal agencies as well as 25 private sector, nonprofit and philanthropic organizations, the NAIRR pilot will provide access to advanced computing, datasets, models, software, training and user support to U.S.-based researchers and educators

Basically, this is a huge governmental endeavor to provide resources about AI, make sure companies collaborate and keep AI accessible across the board and tons of government agencies as well as private sector companies have joined hands in this. Just look at this list, it's a veritable who & who of AI in US (notably, Tesla/X is missing)

And that’s all folks, that’s all she wrote (or I guess, I wrote) today! What an incredible show, really thankful for folks who came out, guests and co-hosts and see you next week!

If you scrolled all the way to here and want to show me that you did, your emoji of the week is 🍊 (only cause persimmons don’t have emojis) so DM or reply with this and share this pod with 1 friend or tag us on social media!

Full Transcription below:


[00:00:00] Alex Volkov: right, folks, it's time for the sound. Let's get it started today.

[00:00:11] Alex Volkov: Welcome, everyone. Welcome to

[00:00:13] Alex Volkov: this live recording of ThursdAI, the Twitter space, podcast, and newsletter that brings you. everything that happened. the AI world, every Thursday, literally almost every Thursday. My name is Alex Volkov, an AI evangelist with Weights Biases, and

[00:00:33] Alex Volkov: this is ThursdAI

[00:00:37] Recap & TL;DR

[00:00:37] Alex Volkov: Alright, recap, here we go. Taking a deep breath. We've talked about incredible amount of stuff here on Thursday AI for January 24th. We've talked about the areas of open source LLMs was very interesting. We've talked about stability AI, releasing a stable LLM, tiny version, 1. 6 billion parameters. That's really good at different languages, the European languages as well.

[00:00:58] Alex Volkov: And it's not commercially viable. For open source, but it is under the stability membership. So if you have that's a great model for you. We've talked about Intern LM2 for a state of the art on math LLMs. We briefly mentioned this, but it's getting 90 percent of GPT 4 performance on math, which is, was quite incredible.

[00:01:16] Alex Volkov: We also had the pleasure of Tanishq, Abraham to join us from MedArk for the analysis of open source models as it relates to the medical field. And it turns out that the model called Quen72 from Alibaba, Quen72 is the best open source doctor that we have achieving like incredible and beating even MedPalm1, which was back then by Google trained as one of the best medical LLMs.

[00:01:42] Alex Volkov: We also. were a very multi modal heavy space today like a lot we had the like we had the folks from Prometheus lab join us and talk about their multi modality which is not Trans, which is transformer based, but not LLM based so their multimodality is EEG signals and fMRI signals as they work on hyper focused ultrasound to induce a lucid dream state in your brain.

[00:02:11] Alex Volkov: Their multimodal model is basically taking inputs from EEG and outputs in, in the directions or where to focus this ultrasound is super cool. And I definitely advise you to listen to them. It wasn't planned. I just saw the post. I just commented, Hey, we're going to talk about this. They jumped on Prometheus looks like a cool multimodal attempt, nothing to do with vision, but also we talked about vision multimodality as well.

[00:02:34] Alex Volkov: So we've covered Adept the company who was founded by a few folks from the original Transformers paper and they have previously released. Per semen models. And then EU eight B was a multimodel that did not use a vision encoder like a different architecture. They released an announcement. They didn't release any code or weights or the way for us to try this yet, but they released something called Fool You Heavy, or they announced something called FU You Heavy, which is an extension of the previously released fool you eight B.

[00:03:00] Alex Volkov: Significantly more trained. And they talked about how difficult it is to train multimodal models and they claim to have a third. Place in the world after GPT 4 and Gemini Ultra on a bunch of the multi modal metrics and evaluations like MMU and MMLU. They also talked about the process, how difficult it is to train these models at scale.

[00:03:20] Alex Volkov: So cool from Adept and we're waiting for some ways to test this. We also talked about fire lava, which is, if you remember, we've talked about lava before multiple times. Lava is a Open source way to train models in multimodal and like Baklava from Focus on Stage here, Nissen and Farrell, and Obsidian from LDJ who's also on here and also Moondream.

[00:03:39] Alex Volkov: Like all of the things we've talked about are based on Lava. Lava was not commercially permissive licensed because of the data set. Fire Lava decided or released the first Lava model with commercial permissive license from Fireworks AI. And we also had it. Quite an interesting chat with Vic, who is the author of Moondream 1, which is a tiny 1.

[00:03:59] Alex Volkov: 6 billion parameter vision language model, also on top of Lava, that has Phi 1 as 1. 6 billion. The foundational kind of brain, the LLM brain in it the conversation with Wick was very interesting. So shout out Wick. Thanks for coming up. Specifically because he also mentioned that Phi 1 Microsoft, if you guys remember Phi 2 was MIT licensed back in December.

[00:04:20] Alex Volkov: It was a surprise to all of us. And apparently they went back and also changed the the License on Phi 1, which is super cool, and Vic told us that he saw this. So Moondream is a very capable, very tiny vision model that works quite well. Definitely worth listening to this conversation with Vic.

[00:04:36] Alex Volkov: We also announced in the This Week's Buzz category of ours, or segment of ours, about Everything Weights Biases, we announced a new course in our academy from Jason Liu, the author of the Instructor Library. And he has a course now that was released today called LLM Engineering Structural Outputs.

[00:04:54] Alex Volkov: And as Nissen , pointed out a bunch of the folks in open source are learning from these free YouTube videos and definitely worth checking out Weights Biases Academy because there's a bunch of knowledge there. And it's all for free and just join and just register. It's super, super cool. And then we had an incredible honor again of having one of the authors of this paper.

[00:05:12] Alex Volkov: As always, I love when we discuss stuff and the authors of the stuff come to chat with us. So we had Tanishq Abraham. But also we had Alex Birch in the audience listening to us while he was working and sending us DMs from the new paper called Hourglass Diffusion High Resolution Image Synthesis.

[00:05:30] Alex Volkov: And this paper will be in the show notes and Dinesh went through the kind of the in depth of the problem he tries to solve. And they. They talked about integrating transformers and diffusion models previously to separate areas and they haven't came up with the first one, but they definitely used a bunch of the techniques to optimize transformers into the diffusion world and create a pixel space, high resolution image synthesis, which is, shows great promise going forward.

[00:05:59] Alex Volkov: Incredibly insightful conversation from Tanishq, definitely worth a listen. We also covered in this area, we covered Instant ID, which is a one, one shot or zero shot face transition into diffusion models. So you can upload one picture of yourself and get quite incredible results in image diffusion.

[00:06:17] Alex Volkov: Or like generative images with your faces or your kid's faces, which is super cool. I haven't tried my cat. I don't know if it like works on cat's faces. I'll try it out. We covered a new, a state of the art. Automatic speech recognition system that beats Whisper or at least runs 30 times faster than Whisper on different tasks.

[00:06:36] Alex Volkov: We're going to add this to the show notes as well. And a little bit about deepfake audio with 11 labs have a dubbing studio released. And some conversation about whether or not or how it already affects politics. And then the last thing we've covered is the National Science Foundation, NSF, announces a new partnership from all major labs and government agencies around AI, and includes DOD and DOA, and includes OpenAI and Tropic, includes open source folks like Hug and Face, and MetaAI is also participating in this.

[00:07:11] Alex Volkov: And also Ways and Biases is part of that huge partnership, governmental partnership. So I think this is all the stuff that we've covered in this space.

[00:07:19] Show starts with house keeping and structure breakdown

[00:07:19] Alex Volkov: We have quite the show for you today, and as always there's no boring weeks in AI, is there? And some weeks start slow and then pick up, some weeks start Crazy from the get go. If you remember, there's one week where one Friday had a bunch of releases, and this week we had a very full week, full of very cool innovations, but also exciting stuff.

[00:07:47] Alex Volkov: And then we have some authors of those stuff here with us today, and we're gonna talk about a bunch of multimodality, which we've been talking about for a while. Obviously the space started with the multimodal GPT 4 and then we just kicked it into high gear. I think that it's time to get started with our default segment. So for those who are new to Thursday AI, we usually segment this to five or six segments, the biggest one being open source LLMs. And then we have big companies LLMs and API. So we usually cover the Google stuff and OpenAI stuff.

[00:08:18] Alex Volkov: Mistral has been here and there, been [00:08:20] in the open source, now is the big company as well. So depends on what they release, that's where Mistral stuff falls. And then we talk about vision and video, which is Basically, we'll recover the multimodality stuff and that section is going to be the, I think, the main one today.

[00:08:36] Alex Volkov: There's so much stuff. It's crazy. We also have tthis com this corner I call This Week's Buzz. I feel like I have to explain this. Maybe people don't get this dad joke that I put in there. Buzz, as in bees, right? So bees, Buzz. And Weights and Biases, the shorthand for Weights and Biases is WandB.

[00:08:54] Alex Volkov: Weights and Biases, W and B. And for a very funny reason, there's a mascot of ours that's a bee that's holding a wand, because it's WandB. And like this little joke has been Prevalent like in many places. I think I haven't explained it yet. And so this week's buzz is actually the corner about everything that I've learned with Weights Biases every week.

[00:09:13] Alex Volkov: And so this corner we're going to chat with Jason and announce some cool stuff. The next corner we have is voice and audio, which we usually have a bunch of stuff. We have VB from Hug Face usually join us. He's like the AI audio person over there. There's not a lot of voice and audio stuff.

[00:09:29] Alex Volkov: So I actually don't have anything voice and audio related in my notes. However if you guys know like very cool things that happened. This week with voice and audio, please let me know, we're going to talk about them. We're going to move to AI art and diffusion in the next segment. We're going to talk about some cool things there.

[00:09:45] Alex Volkov: And then the last segment is like a free for all, it's tools and others. So I usually put agents in there. I usually put like super cool things. So I have two, two, two exciting things to talk about there. So this is usually the structure.

[00:09:58] Nisten Tahiraj: I do have, is one more thing there, and it's the W2V, the BERT speech encoder. think it's for meta, and it's about, it's supposed to be like 30 times faster than than Whisper. So yeah, it's another very efficient automatic recognition ASR model. So I'll I'll post it in the links

[00:10:20] Alex Volkov: And I think also we had 11Labs announce like a yeah, I had a tweet about actually ThursdAI Content, that I spoke in English, obviously, and then I asked it to translate to Russian. We'll cover this, 11Labs has a dubbing studio.

[00:10:33] Alex Volkov: .

[00:10:33] Open Source LLMS

[00:10:33] Alex Volkov: And then, let's go to open source, folks. I think let's go to open source.

[00:10:55] Alex Volkov: All right, let's start with our open source segment here. And I think the first thing we should probably quickly mention is our dear friends at Stability AI, folks who've Made a dent on the industry with Stable Diffusion, obviously but they're training a bunch of other stuff. We've talked about multiple stuff they did.

[00:11:12] Stable LM 1.3B

[00:11:12] Alex Volkov: We've talked about Stable Video Diffusion and like how open source lags behind closed source, but not by that much. And Stability released a new LLM, which they had the Stable LLM before, I think, Nistan, have you used Stability stuff before? For the LLM stuff?

[00:11:31] Nisten Tahiraj: I have Months ago, so I'm not up to date on

[00:11:35] Alex Volkov: Yeah, so

[00:11:36] Nisten Tahiraj: used it on Google collabs and

[00:11:37] Alex Volkov: Yeah, so they're not like, they haven't changed the industry in the LLM world as much as they have in the image diffusion world, for sure. However, there's a big however, they're working on multiple fronts. And it looks like, I had a chance to actually chat with Imad for almost 20 minutes.

[00:11:52] Alex Volkov: Imad is this like very incredible person who knows a lot about a lot. And it's like the conversation there is like basically a stream of consciousness conversation, which I had. No trouble in following up because we talk about everything here on ThursdAI. But the folks who were with me and talking to Imad, they looked at me and was like, How do you know all this?

[00:12:11] Alex Volkov: And I'm looking at Imad and was like, How does Imad know all this? That's what happens when you're on stability. So they released they're training a bunch of different models. This week they gave us Stable LLM, which is a tiny model, 1. 6 billion parameters model. It's really we've been saying this previously.

[00:12:24] Alex Volkov: It's really funny to say small LLM, right? If you expand the LLM abbreviations, like a small large language model. But this one is tiny. It runs super fast on, on multiple devices. I think their point is to actually like edge device running. So obviously we've covered multiple small. LLMs before, we've covered PHY, if you remember PHY 1, we're gonna talk about PHY with Vik in a second.

[00:12:47] Alex Volkov: We also talked about like PHY 2, I think there's like a few others StabilityRelease, there's It's pretty good. It's pretty good. I was itching to play with this, they released a GGUF. Apparently I dunno if you knew this name, but apparently stability has their own CPP and their like GGF file, which is like a, for those who are not following all the AT acronyms.

[00:13:11] Alex Volkov: GGF is a quantized version of models. So apparently stability has, like stability. CPP is incompatible with Lama cpp . And so apparently Elm Studio had to add a specific support for this and they did. And so if you wanna play with stability, AI. Stable LM, now you can , with LM Studio, and LM Studio at least in my experience, gave me ridiculous performance.

[00:13:34] Alex Volkov: I got, on, on this Macbook M3, M3 Max I got more than 130 tokens per second, which was like ridiculously fast. And the model was fairly capable for a small model. I was very impressed. So if you want to play with a small model, you want to do some stuff with this, stability is definitely an interesting one.

[00:13:53] Alex Volkov: Support in Elm Studio. Yeah, go ahead.

[00:13:56] Nisten Tahiraj: yeah, it's a 1. 6B. So in that means it's 1. 6 gigs to run at eight bit without losing much accuracy. However, the, that means that it has a lot more applications for tiny stuff, because then you can get that down to 800 megs. And so on. So this is people did find some issues. Again, it's a tiny model, but they found issues with it being able to continue the conversation.

[00:14:24] Nisten Tahiraj: However, for one shot answers, it was extremely capable. So just keep that in mind when using it. It is probably right now the best model for that size. Just keep in mind if you're going to do something with it. Don't expect much in terms of follow up stuff. Just if you can do it in one shot, great.

[00:14:48] Nisten Tahiraj: Use that. And yeah that's about all I have to say.

[00:14:51] Alex Volkov: Yeah. And additional things that it punches above its weight on other languages. So if you folks remember when we talked about Mistral, for example, getting compared to open the eye on Tropic, et cetera Mixtral medium, the model is like specifically for the German, the European language, the German, Spanish, French, Italian, all those it's significantly better.

[00:15:11] Alex Volkov: Stability is also playing in that market looks like for the smaller size. And so this. Out this tiny model beats the five versions of three billion parameters. So it beats models twice its size, even some seven billion parameters, specifically for , European languages,

[00:15:25] Alex Volkov: and if you remember, we've talked about MPT from Mosaic, was that? Yeah. So this model beats the Mosaic MPT 7B, which was probably back in May was like the coolest like open source model. So that was 7 billion. This beats that on empty bench and everything.

[00:15:40] Alex Volkov: It's quite incredible. It beats Falcon 40B. It's really, the speed, the reason why we bring you these models is not only Hey, use this one. Because Nissen said this one may not be exactly good for your commercial stuff. Also, it's not really commercially viable. There's a specific stability license that you have.

[00:15:58] Alex Volkov: Stability membership, they call it. They have to apply for stability AI membership. And then based on the size of your business you're able to use, they have to make money somehow. But we bring this to you also to show that how fast we're moving from a 30 billion parameter model to a 77 billion parameter model, and now to a 1.

[00:16:13] Alex Volkov: 6 billion parameter model, that compresses like incredible amounts of trillions of like words from the human knowledge into just, listen, do we say like this can go down to like less than a gig, right? If we look super quick,

[00:16:28] Nisten Tahiraj: Yep. At 4 bit, it should be 800 So we're getting to the point where they'll just fit in a Raspberry Pi Zero with 512 megs and they'll be conversational [00:16:40] and useful and even multi modal. So we're almost there.

[00:16:43] Alex Volkov: Yeah, it's quite incredible. And then, okay, so this is stability stuff. Meanwhile, I'll say hi to a new guest of ours that I just saw on my timeline.

[00:16:51] Prophetic announces MORPHEUS-1 an EEG/fMRI multimodal to induce lucid dreams via hyperfocused ultrasound

[00:16:51] Alex Volkov: What's up Wes, how are you?

[00:16:53] Wes Louis: Hey

[00:16:54] Wes Louis: guys, how are you?

[00:16:55] Alex Volkov: Hey. Hey welcome. Folks maybe saw my tweet, maybe didn't as that I love planning for Thursday, but I also love breaking news. As I was planning, I was going through my feed, and thankfully my Twitter feed is back at his own, like giving me the best AI stuff. And Wess and I think your co-founder is also here.

[00:17:10] Alex Volkov: Eric, yeah. Let me add you real

[00:17:12] Alex Volkov: quick. I didn't plan on this folks. I just literally just like tagged and they came. The video that you guys posted came through my timeline and I would love to go and give you a stage for a minute or two to explain what prophetic is because the transformer stuff that you discussed with the EEG and fMRI signals, I really dig.

[00:17:30] Alex Volkov: Could you summarize that video for us for a brief, like two sentences? That would be super cool, I think.

[00:17:38] Wes Louis: So

[00:17:38] Wes Louis: this has been something we've been working on for a while.

[00:17:40] Wes Louis: It's really a, essentially,

[00:17:42] Wes Louis: a multimodal transformer model that is designed entirely for neural data. And so basically, what we've done is, we built a data set of EEG and fMRI and, what we're designing is a neural simulation device to basically induce lucid dreams.

[00:17:59] Wes Louis: And so we build the data set on heightened prefrontal cortex activity. This is, the neural correlate of lucid dreaming. And we basically built a model where you prompt it with your current brain state. We have a set of sensors on the device, and then we output targets for the neurostimulation.

[00:18:17] Alex Volkov: That's quite incredible. So for folks in the audience, we talk about multimodality often and oftentimes we just mean VLMs, like we mean like vision and text, which we're going to cover like a bunch today. But today I think the highlight of today's Thursday is multimodality applies to many things. So you guys are, your multimodality is not even there's no text in there at all, right?

[00:18:36] Alex Volkov: This is just EEG signals and fMRI signals. Is that correct?

[00:18:41] Wes Louis: Yeah, it's purely prompted with EEG. And one thing I'll say is, everyone talks about multimodal. And, so you're using, let's say, an LLM, and you're prompting it with a photo, for example. This is similar in many ways because neural imaging data, particularly EEG, is you can nicely get, you can get it into, it's a neural image you can get it into an image format.

[00:19:02] Wes Louis: And then prompt the model that way, but then on the generation side of things that's entirely, we use a pretty unique fMRI embedding process that we've come up with ourselves and ultimately the idea there is that you take this heightened neural activity, And those are candidates for targets for nerve simulation.

[00:19:20] Wes Louis: And, we

[00:19:21] Alex Volkov: What do you, sorry, what do you mean, what do you mean by targets for folks who have no idea what this means?

[00:19:26] Wes Louis: Yeah. We're using this is the other big technology that makes all this work is FocusUltraSound. FocusUltraSound, for those that don't know, is this Really, cutting edge neurosimulation technique that can get, quite deep into the brain, other techniques, people who may be familiar with, direct current, alternating current, really get soaring to the surface.

[00:19:47] Wes Louis: Of the brain, whereas focus ultrasound can get quite deep, but there's also this ability to steer the beam and also create acoustic holograms. And so when we think of heightened neural activity it really takes the form of these 3D figures. And the idea being that we can create these outputs of fMRI targets and then translate those over to the focus ultrasound.

[00:20:12] Alex Volkov: This multi modal transformer takes on the input EEG signals, and on the output it prints out those targets. Those are targets for this technology to then stimulate the brain to go into a specific state.

[00:20:31] Wes Louis: Yes, and all of this is closed loop so in that, once you create the simulation, the model is prompted again with the current brain state and this is continuous. Process of learning and figuring out what sets of tokens lead to this heightened state and that heightened state is really identified as gamma frequencies and that's really the fastest band of activity.

[00:20:53] Wes Louis: So it's this continuous process until someone gets to a lucid state.

[00:20:58] Alex Volkov: That's quite incredible. So you guys announced the LLM today, but it's not like you're not releasing the open source. This is just an announcement of your efforts, correct? Anything else you want to add here? And I think you started talking about folks can join the beta if they want to.

[00:21:12] Nisten Tahiraj: Yeah, that's what I

[00:21:12] Wes Louis: would point out is that we have a beta program that, that this is really the purpose of this announcement is we're looking for people to sign up. We've had 200 or so in the last, Two hours. And so this spring we'll have it working. And if you're a New York based or you're willing to come out to New York we'd be, more than happy to have you test out the product.

[00:21:31] Alex Volkov: That's awesome. Congrats folks. Actually, you want to add anything?

[00:21:33] Eric Wollberg: Alex. Hey, how's it going? This is Eric. I'm a

[00:21:36] Alex Volkov: Oh, Eric, yeah.

[00:21:37] Eric Wollberg: with West. Yeah. Hi thanks for doing this. Yeah, one thing that's just I think, the sequence of how we've released these things, we showcased in October our prototype that we designed with Card79 notably did, Neuralink for Elon, and then we, Also worked with Max Hodak at Science.

[00:21:52] Eric Wollberg: Max Hodak used to run Neuralink for Elon and then spun out Science. So really top consumer VCI kind of design folks. And so then now we have this model, right? This ultrasonic transformer where now we're going to be migrating that on to, the technically working prototype and beginning neuromodulation.

[00:22:08] Eric Wollberg: So that's what the beta user program is all about. We've got, yeah, like 225 people signing up in the first two hours we're really looking for we're excited to have people on board and begin to do this you have an opportunity if you're, especially if you're early up on that list to be the first person to achieve an ultrasonically induced lucid dream, which is You know, I think it's going to be a pretty watershed moment.

[00:22:28] Alex Volkov: That's super cool. I've tried to, to lucid dream a lot of times in my life and I never actually got to a stable one. So I'm excited to follow you guys, but also excited from the technology application of this, because we talk about transformers and a lot of this is going to LLMs.

[00:22:42] Alex Volkov: Now we're going to, this week we're going to talk about Transformers as applied to the fusion models as well. And here you are like doing like full multimodality out, out of the left field. So I love it. And hopefully you guys will do some cool things and keep us up to date and welcome to, to join on Thursday.

[00:22:55] Alex Volkov: I, to talk about this.

[00:22:57] Nisten Tahiraj: Awesome. Thanks, Alex. Thank you, Alex.

[00:22:58] Alex Volkov: Thanks for hopping on, folks. And as folks, as I love breaking news here on Thursday. This is like a tiny breaking news. Thank you, Wes. Thank you, Eric, for joining folks. If you want to try, the future, sign up for the beta, because why not?

[00:23:09] Alex Volkov: And I think it's it feels like non invasive, right? You put this headset on, and then hopefully you go to sleep, and hopefully you're able to control your dreams, which is like what Vision Pro will do for outside world, but this is like inside your dream, it's super cool. All right, let's move on to, I think we're moving on to the big, no, actually we're moving on to the big category for multimodality as we're already here.

[00:23:33] Alex Volkov: Vision and video and multimodal, or at least VLM multimodal.

[00:23:38] Adept teases Fuyu Heavy, their flagship multimodal catching up to Gemini Ultra and GPT4V

[00:23:38] Alex Volkov: I'm gonna start with the big dog here, ADEPT. If you guys remember ADEPT Labs were co founded by a few folks from the original Transformer paper. I don't think they're no longer there, but I have to, I feel like I have to add this.

[00:23:52] Alex Volkov: Prefix every time we talk about adept, adapt released a few models for us. If you guys remember, Persson was a seven B model or eight B, eight B it was weird, but they released an 8 billion parameter model. It was like very interesting back then. They also then on top of this released fio, which is persson is the type of fruit, F is the type of tree that persimmon grows on.

[00:24:10] Alex Volkov: So we see you adept, we see your jokes here. Also. I love the LLM naming and then they raised Fuo back then. And FIO was. Interesting from the perspective of it didn't use a vision encoder, it did something else. It was very interesting that their approach to vision models allowed them to use Non standard image sizes, because they didn't train it on such a thing.

[00:24:31] Alex Volkov: So back then, that was what was interesting. And now, they've announced, they haven't released anything. They haven't said, hey, here, use this. I wasn't even able to use this. But they announced Fuyu Heavy. Fuyu Heavy, according to them. And so far, Adept have been trustworthy enough for us to trust.

[00:24:48] Alex Volkov: What they say this is the third in the world multi modal or I guess VLM. So not multi modal like, like Wes and Eric just told us, but a multi modal in the sense of like images plus text together. This is the [00:25:00] third in the world model behind GPT 4 Vision and Gemini Ultra. Which Gemini Ultra we haven't yet tried, obviously, we don't have access.

[00:25:08] Alex Volkov: If you have access in the audience for Gemini Ultra, and you want to help me, help a brother out, let me try and play with this, please let me know. But so they're announcing, AdeptFuyu is announcing that Fuyu Heavy, their model, is 20 sizes smaller than GPT 4 Vision. I have no idea how they even know what size GPT 4 Vision is.

[00:25:28] Alex Volkov: They say that around 20 to 30 sizes smaller. And comes very close in the multimodality stuff. And they talk about the challenges of creating like large multimodal image based model. The challenges are stemming from there's not a lot of assets properly to test. There's not a lot of the tooling instrumentation stuff are really hard for images as well.

[00:25:47] Alex Volkov: And so they announced this they showed some very incredible performance. And I will remind folks that Adept specifically started with tools to make you run your computer. So their models are specifically tuned on UX, UI and web stuff. And expecting to hear more from them and finally getting to play with this.

[00:26:06] Alex Volkov: Go ahead, Faro.

[00:26:09] Far El: I just

[00:26:09] Far El: want to say that,

[00:26:10] Far El: Demos are easy. I'm going to take it with a

[00:26:14] Far El: grain of salt until I actually see the model or are able to test it. The thing is that there is no indication of actual like speed of the inference or whether these examples were cherry picked or not, right? There's a lot of question marks about this, especially when you just come out and, make a marketing announcement without actual access to the model.

[00:26:37] Far El: Yeah, it looks cool, but I'm not, I'm not hyped just because it's not like it's not verified or validated

[00:26:43] Nisten Tahiraj: in any way.

[00:26:44] Alex Volkov: Yeah, I'm with you, I'm with you. Specifically I will say though, about Adept specifically, we've seen stuff from them, we've seen papers from them before, and they did, folks started asking like, Hey, where's the weights? Where's the weights? And they did say that, stuff is coming, but they want to like, keep a competitive edge.

[00:27:00] Alex Volkov: But we see, we've seen like at least a new architecture from them, if you remember with Fuyu. And so we know

[00:27:05] Nisten Tahiraj: Oh, of course.

[00:27:06] Alex Volkov: yeah, the Fuyu architecture is legit, like they literally was able to. create a multi modal without an image encoder thing back then. We're definitely going to listen to this. But based on the metric that they released, if this actually performs as well on MMMU, which is the kind of the equivalent of MMLU.

[00:27:25] Alex Volkov: For multi modal stuff it's going to be very exciting their heavy model, definitely.

[00:27:29] Fireworks releases FireLLaVa with a fully commercially viable license

[00:27:29] Alex Volkov: Moving on, actually, Pharrell we'd love to hear what you think about this. And actually, Vic, this is wrapping you up to the next conversation. Fireworks AI that I haven't actually used, but they released the first Lava model with commercial permissive license from Fireworks.

[00:27:43] Alex Volkov: So Lava was released. Lava, we've talked about Lava is the architecture. That allows many of these models to be trained in a multi modal fashion, correct? Lava was released, it was not with a commercial license because it was trained on a bunch of I want to say that wasn't marked for commercial and open source licensing.

[00:28:01] Alex Volkov: So a lot of these models that we get, we cannot actually use in production. And FireLava announced that like their first Lava model was commercially permissive licensing. And I think that's super cool because finally folks will be able to build this. And as a reminder, Lama, the LLM was released without commercial license.

[00:28:19] Alex Volkov: And then Lama 2 released with commercial license and then incredible amount of stuff started happening because companies who wanted to use this in production actually started like looking into this and using Lama 2. And so hopefully the same will start happening with FireLava. I actually am not sure if they released the weights.

[00:28:36] Alex Volkov: I think they did. Yes, they released the weights on Fireworks AI, FireLava 13B on HugInFace. And yeah, listen, go ahead. You guys trained stuff on top of Lava. So please, first of all, introduce the stuff that you've trained on and then also like comment on the ability to use this now in production.

[00:28:56] Nisten Tahiraj: Yeah, I just want to say that The entire vision open source vision field, and non open source, it is extremely competitive right now. For example, here, we've released Baklava, which is bak lava. Again with the naming. So that that was three months ago. Also LDJ here made the obsidian, which is like the three B one, and then they made A seven B as well.

[00:29:22] Nisten Tahiraj: We also have the dev lead of Quinn. He was in the audience as well, so they made the Quin 14 b vl. And this part is, oh, and we have Vic as well, who also made a very fast. And a small model recently. And Valkylava was being used as a benchmark, which was pretty interesting, actually. Yeah, the Vision LLMs are extremely competitive right now, and I think it's one part where open source can really surpass what you get from from any from any API, because it's something you can run local on the device and you have full control over.

[00:30:01] Nisten Tahiraj: So the interesting thing yeah, as for Fireworks 13b, that's still Lama 13b base, as far as I saw, and I tried to use their inference on their site, but it wasn't working, and I can't complain too much about it, because ours is not working either. That's why I wasn't using WSGULAG yeah, also to comment a little bit on Fuyu, because I do like their trying a completely new approach. They don't use stuff that's similar to clip image models, which is what everybody else uses. They do something where they take, I think, groups of pixels or stuff. They serialize it, so the image is just being represented as just another string of text or a string of tokens. So they can scale.

[00:30:48] Nisten Tahiraj: To 8k, 16k, whatever you have, they don't have, they don't have that limitation that others have in, in terms of architecture. So it is good to see that approach is working overall, whether it will be competitive we'll see. So yeah, I wanted to comment on that. But yeah, I haven't actually tried the Fireworks model itself, but I did see, again, the architecture is similar to also Lava 13b. Yeah, that's about all the comments I have on that.

[00:31:22] Alex Volkov: And like you said, interestingly, it's still based on Lama, right? And it's time for, it's time for new things. And I think this takes us to the next topic of conversation. And again, Vic, I want to introduce you properly this time, or at least let you introduce yourself.

[00:31:35] Moondream1 from Vik Hyatk - 1.8B VLM

[00:31:35] Alex Volkov: But the next kind of iteration or of our conversation about multimodality, like we said, today is a multimodal space is the existence of like very tiny vision models, vision, large language models, or a large multimodal model, it's really hard to like, name these things. Vic, welcome to the space, this is your first time, please introduce yourself and then let's talk about Moondream a little bit.

[00:31:57] Vik Hyatk: Hey folks hey Alex, thanks for having me. Super excited. My name is Vik. I'm pretty new to the AI space, I think. Like a lot of people, I got into it when that big stable diffusion moment happened. And I was like, this is what I need to spend my life working on. So I went out, bought a workstation with 3090 and started playing around with stuff.

[00:32:15] Alex Volkov: You and me both brother, you and me both. And, okay. So the reason why you're here and the reason why I'm , calling on you in the vision and video area is because of Moon Dream one. You, can you introduce Moon Dream one a little bit to the audience?

[00:32:29] Vik Hyatk: Yeah so it's a small language model. It's about 1. 6 billion parameters. It's built on top of Siglip from Google or DeepMind. I forget which one of the two. Trimil, because that's the vision encoder and it uses 5. 1. 5 as the text model, and then it's trained using the standard lava. So super thankful for the folks that worked on these projects amazing models they've put together.

[00:32:52] Vik Hyatk: It works. I'm tooting my own horn a little bit here, but it's surprising. I see people post screenshots of them asking questions and it still blows my mind that it works that well.

[00:33:03] Alex Volkov: I let me talk the horn a little bit because I definitely tried out. Thank you for the hugging face. How can I say, space that you put up like super quick, and the next follow up is going to be about how to actually use this, but this is based on Lava, so the same non commercial license, correct?

[00:33:19] Vik Hyatk: [00:33:20] Correct. The top piece of feedback I've gotten from people is that they want to see this with a commercially permissive license. I'm working with, working on that. The FireLava folks didn't release the dataset, but thankfully they did talk about their process to create the the non encumbered version of the dataset.

[00:33:37] Vik Hyatk: So I'm working on it. I'll have that out in a couple of days, the dataset at least, and then we can start training models that aren't encumbered like that.

[00:33:44] Alex Volkov: Incredible. And so the next thing that I wanted to talk to you about is PHY 1. So PHY is from Microsoft. PHY 1 was not released with a commercial license. We remember it was trained on synthetic data in tiny stories, like a tiny 1. 6 model. So we saw a few releases since then. So obviously we talked just now about StableLM.

[00:34:01] Alex Volkov: Semi commercial, if you're a part of their membership, and also Phi2 was MIT license. It's a little bit bigger. It's three, I think, billion parameters. Have you tried with Phi2 and could you speak about that experience?

[00:34:14] Vik Hyatk: Yeah, I I did actually. So I was initially working on training Moondream 1 with PHY 2 once it came out. There are some issues with fine tuning it when you have flash attention on I believe. And so it just takes a lot longer to train. So I went back and looked at PHY 1. 5 and I saw that they updated the license for 1.

[00:34:32] Vik Hyatk: 5 to MIT as well.

[00:34:33] Alex Volkov: Oh, really?

[00:34:35] Vik Hyatk: stick with what works. Yeah.

[00:34:37] Alex Volkov: Wow. I did not know this. So it actually updated the license backwards.

[00:34:42] Vik Hyatk: Yeah, on the Hugging Face page, at least it says MIT now.

[00:34:45] Alex Volkov: I love it. Like it would make sense, right? But folks, I don't think we've talked about this. So like breaking news here. Thanks, Vic. Phi 1 is also, we'll check this. We'll double check,

[00:34:55] Nisten Tahiraj: Also three. They're both MIT licensed now. So whatever pressure we put on Microsoft's Azure side, it worked.

[00:35:03] Alex Volkov: nice. That's incredible. All so now, so this part of your stack of Moonbeam is now MIT licensed. So Lava is the only thing that's holding this back from being used in

[00:35:14] Vik Hyatk: Just the

[00:35:14] Unkown: data set, yeah.

[00:35:16] Alex Volkov: The dataset. Okay. Okay. So definitely there's work being done there. I will just pay send folks attention to the nest, to the top of the space where I had my tests.

[00:35:25] Alex Volkov: I literally just pasted an image. And again, thank you for the demo, Vic. Folks will get the demo in show notes as well. I pasted an image of two of my friends just sitting and talking across like a TV with some things. Literally the model said, image features two men sitting in chairs engaging in conversation.

[00:35:42] Alex Volkov: One man sitting on left side, one other on the right side. That's obvious, but still cool. They're both looking at a laptop placed on the table in front of them. The laptop is open and displaying a presentation. Possibly related to their discussion. So this feels like hallucination a little bit because the model does not know what it displays, but fine.

[00:35:57] Alex Volkov: And so in the background, there's a TV mounted on the wall, a cup that can be placed on the surface nearby. The scene suggests a casual collaborative environment. This is ridiculous. This is like a super tiny model and it outputs this scene almost perfectly. And. I've tested like the same image in different, like a bigger, GPT 4, it pretty much gives me the same information.

[00:36:17] Alex Volkov: So I was really impressed. So Turing the Horn, for sure, because the tinier the model is, the better the utilization. And we've talked about different vision enabled hardwares that are possible or not possible based on whether or not they're going to be able to run stuff on like Raspberry Pi. And, the smaller these models, the smarter they are, the better we'd be able to use them in cheaper hardware.

[00:36:40] Alex Volkov: Really impressive. What are you planning to do with this? Like, how has the community accepted this? What type of conversations did you get into? And what are you planning to do next here? Besides training the

[00:36:51] Vik Hyatk: I was blown away by the reception to this. I've, when I put it up, I thought like maybe it might get like a hundred likes or something and then I'd move on to my next project. But I've seen a bunch of super cool demos. Come out of this, I think the fact that it is small and it runs inference so fast makes a lot of use cases that were previously not possible, a lot more viable, like captioning a video in real time or recaptioning a billion images and whatnot.

[00:37:15] Vik Hyatk: There's a couple of things I'm working on. Obviously the top thing is like getting it to a permissive license. I also, I could use some help on a couple of fronts. So I do want to make it easier to run, get gguf, olama integration and whatnot.

[00:37:30] Alex Volkov: Definitely LM Studio integration. I would love To play around with this with Elm Studio, just to see how fast this is, this runs on my software. MLX would be a cool suggestion as well the community is very excited about MLX, I don't know if you saw. But Elm Studio is a friend of the pod, definitely it's connected to YouTube.

[00:37:46] Alex Volkov: I think it's super easy to just add it there. Right? Listen it's not difficult.

[00:37:51] Nisten Tahiraj: You just gotta add a Jason file to, to, to your model and that's it. Or just message him 'cause he's very responsive to this stuff. And might even write the Jason for you. And then it will be immediately available for everyone running LM Studio.

[00:38:06] Vik Hyatk: Amazing. Another thing we have going on, by the way, is we're building an agent version of this with Open Interpreter in mind.

[00:38:13] Vik Hyatk: A version of this that's excellent at identifying UI elements because we want Open Interpreter to have the ability to operate purely off of a local model. Open Interpreter, by the way super cool project. Check it out, folks, if you haven't already, is is a way to have the LLM use your computer.

[00:38:31] Vik Hyatk: So you can do stuff like. Just tell LLM here I want to turn dark mode on and it'll figure out what buttons to click to enable dark mode for

[00:38:40] Alex Volkov: for folks who follow ThursdAI closely, they remember Kilian came on the pod like a week after Open Interpreter was released, and this was, I think, in 2023, our most famous or received episode back then. It was a super cool conversation, so shout out Kilian Lukas, and definitely Open Interpreter since then has been very huge community of people building very cool things.

[00:39:00] Alex Volkov: Recently released the kind of the browsing area, where it can Controls the computer for you. And it definitely needs eyes for that. And so I think it used GPT 4 vision and now you're saying that Open Interpreter will get open source eyes. Is that what I'm hearing?

[00:39:15] Vik Hyatk: Exactly. That's a goal. CogAgent is super promising in this space. They didn't release their datasets, so we're working on replicating that. CogAgent is just too big for most people to run on their computers. It's I forget, 17 billion parameters or something.

[00:39:29] Alex Volkov: Is that CogAgent and CogVLM, right? I think we, yeah, I think we talked about this. Yeah. It's really good

[00:39:35] Vik Hyatk: but yeah, that's another place where if folks want to get involved the link in my bio as a Discord would love to collaborate with folks on getting that dataset together and training that version of the model.

[00:39:44] Alex Volkov: So I think the kind of the thing I'm hearing from Fuyu, from you as well, the data set for vision stuff are the bottleneck to create like incredible things, right? Like data sets for images, data sets for how people use different UIs, for example, like all these data sets are the kind of the bottleneck for us to get to the next hurdle of getting these models even smaller, even faster performing.

[00:40:04] Alex Volkov: So what are we doing folks? Let's start building multimodal data sets.

[00:40:09] Nisten Tahiraj: Yeah, and at first for Baklava, we were going to have the dataset also open source because we are, the code for us is also open source as well. So it's not just open wave. It is fully open. However, the data we couldn't because of So that's not available and yeah, it's pretty hard to make datasets for vision because with text is very, it's very easy now to manipulate, modify, do whatever you want to, to the data and you can do that at large scale with images, just aren't that many tools, that many ready to go datasets and the open source models just started getting good at them.

[00:40:52] Nisten Tahiraj: So yeah, that's going to remain. A challenge for the time being, but again if anybody here is like a grad student or you're at a company or something in academia, the biggest contribution you can make probably is in the data sets, because the models will get replaced. You'll always have better models coming and going, but the data sets are forever.

[00:41:15] Nisten Tahiraj: If you want to make an impact in this field, get your professor, university, whatever to, to put some money for datasets. We need datasets. For images. With images. Yeah.

[00:41:27] Alex Volkov: And we need them like bigger and bigger ever increasingly bigger scale. All right, Vic, so thank you so much for joining us. Thank you for talking, taking us through how you created Moonbeam. And thanks for telling us like what's next, how [00:41:40] the community can help besides, besides just, data sets provided and testing.

[00:41:45] Alex Volkov: What else would you need?

[00:41:48] Nisten Tahiraj: I I have a

[00:41:49] Vik Hyatk: list of issues on GitHub where I'm looking for help with various But besides that, Compute always helps. I'm currently I'm limited on how many things I can do because my 4090s can only do so many matrix multiplications at a given time. So if anyone has Compute that they can give me access to run these, that would be super appreciated.

[00:42:09] Alex Volkov: Yes, I I've seen this time and time again on ThursdAI on stage, folks ask for sponsorship for compute. I'm actually getting I'm actually getting like DMs from different companies like, Hey Alex, the space is super cool. Can we sponsor someone? Can we? And I'm like, no, I already work with Let's Ambassadors, I don't need sponsorship.

[00:42:25] Alex Volkov: I would want to connect guys that work on super cool things. We need compute to keep going with different companies around like compute specifically. So I'll definitely keep you in mind. And go ahead, Nissan. You had a thing you want to say?

[00:42:38] Nisten Tahiraj: Yeah, just really quickly, this is a very effective way to make projects that are impactful. For example, with Balclava, Pharrell here, and Suntex, they just put out a readme, and tweeted something out, and we got compute. And we got it from Together Computer. So they, they sponsored that, that project and they ended up being a very impactful project that a lot of people use.

[00:43:05] Nisten Tahiraj: That, that works pretty well. I just say be careful with conditional stuff. If they're gonna start talking about an NDA, just Ignore them because that's not really, then you're doing an exchange, you're basically doing work for that person, so that's just a job contract, that's not a sponsor, if someone's sponsoring an open source model

[00:43:27] Alex Volkov: Better be.

[00:43:28] Nisten Tahiraj: not be like an NDA, that's not, that's no longer a

[00:43:32] Alex Volkov: Better be open source after that. Yes, absolutely. So Vic, I'll keep you in mind when people reach out to me. Folks in the audience, if you work at a company that wants to be featured forever in the, in the open source community, definitely reach out to Vic and we want more of this.

[00:43:47] Alex Volkov: We want more of like tiny models that perform incredibly well. We want them to be built into different Tools that we can all use without relying or paying by just using our machines. So definitely we'll keep in mind. Vic, welcome and welcome to the community of ThursdAI. More than welcome to keep joining and participating in this.

[00:44:06] Alex Volkov: I think it's time for us to move on, folks. It's been around 40 minutes. I think we're actually good on time. I think it's time for us to move on to this week's buzz. I wish I had a I really want to do like a music transition here for the, with this week's buzz, with like bees buzzing, etc.

[00:44:20] Alex Volkov: But maybe for next week. Let me just play the regular music and we'll transition and talk with Jason a little bit.

[00:44:24] This weeks buzz - Jason Liu launches a new course with Weights & Biases for free

[00:44:24] Alex Volkov: All right, welcome to this week's buzz, where I talk about some cool things that happened or I learned in Weights Biases. Weights Biases is, ooh, that was an abrupt music stop. Weights Biases is the system of records for all your LM needs. So pretty much like most of the folks up on stage who use who train models use Weights Biases.

[00:44:52] Alex Volkov: It's incredible. The ubiquity , where bias pretty much prevented everywhere. I just saw a stable Kwan, one of our friends of the pod just train something and just post like words and biases, like a snapshot of his last curve going down and literally just asked Hey, do you mind putting a link to the dashboard?

[00:45:08] Alex Volkov: And he did. So you wanna check out how his model is going? I think he's training. I don't think I saw, he's training something like super cool, like a Oh, he's training a mixture. Four 400 million parameters. So he's training like a tiny MOE of mixed role. StableKwan is, he just posted like a chart with the train loss from Weights Biases and I just asked, Hey. Can we follow along with the training? And he posted a link to the Weights Biases dashboard, which is super cool.

[00:45:34] Alex Volkov: Which got a reaction from Weights Biases CEO. . And so I love seeing this in the wild. So folks, if you're training models, please put those dashboards up so people can follow along. It's super it's really nice. But on the other news from Weights Biases this week I want to say hi to Jason Liu.

[00:45:47] Jason Liu: Yeah, Jason Liu.

[00:45:48] Alex Volkov: Jason Liu. Welcome, Jason. I've seen you around. I've seen you, I think at AI engineer event from SWIX. I don't know if we like ran into each other there, but you had a talk there as well. Yeah.

[00:45:58] Jason Liu: Yeah, it was Paidandic is all you need. It did pretty well on YouTube, so I'm pretty

[00:46:02] Alex Volkov: It did great. I also talked with a bunch of people. I think I was interviewing folks, outside of the stage while we were giving the talk, but then it was very well received. And this is on the same similar topic that we're going to talk about now. So please feel free to introduce yourself briefly.

[00:46:15] Alex Volkov: And then we're going to talk about the stuff that we did together.

[00:46:19] Jason Liu: Great. Yeah. So I'm Jason. In the past year and a half, I've been mostly doing a lot of applied AI consulting. Before that, I spent the past like eight years just doing like machine learning. So I did the big data wave, the machine learning wave, the neural networks and deep learning wave.

[00:46:32] Jason Liu: And now we get generative AI. So it's been a lot of fun. And in my spare time I work on a library called Instructor. So now. We have Instructor in, I think, JavaScript, Python, and Elixir. And the general idea is that we want to bring just functions and structs into LLMs and make LLMs feel a lot more backwards compatible with existing code rather than creating new abstractions to handle some of these things.

[00:46:55] Jason Liu: And I think that's been pretty well received in the community.

[00:46:57] Alex Volkov: Absolutely. So Instructor is definitely where I know you from. And today we have an announcement together. So feel free to. Feel free to announce the cool thing that we did and that you worked on really hard.

[00:47:09] Jason Liu: Yeah, so we're starting a new series around, the idea of using like schemas and structures to prompt language models. And I think. At the day or end of this week, we're going to release the first part of a LLM engineering series. And the first part really is just an introduction on how we can use things like structure to prompt LLMs a lot better, right?

[00:47:30] Jason Liu: In the past, we just like beg for the language model to give us JSON. Now we have things like JSON mode and function calling and tools, which gives us the ability to get more structure. But we still need a lot more tools and ways of thinking about how we can reason about these structures. And so part one is going to be around justifying and motivating why we wanna, why we might want to do this.

[00:47:54] Jason Liu: And then I think in February or March we'll start working on part two that uses a lot of the new ways and biases, observability tools to look at how I've solved a lot of LLM problems in production with a lot of my consulting clients.

[00:48:07] Alex Volkov: So just to highlight for folks, Weissenbeisser has a free courses area, Weissenbeisser Academy. And some like very prominent folks in the industry have collaborated with Weissenbeisser to like just basically teach. So we teach you for free how to do these things. So we have courses from like training, LLM from scratch, fine tuning, et cetera.

[00:48:24] Alex Volkov: And then Jason is announcing a new course today that he wrote and and recorded and we helped edit a little bit and publish and also obviously talk and promote this a little bit about how to actually ask your model to give you what you need as a developer, as a AI developer in the structured output, which uses the instructor library.

[00:48:42] Alex Volkov: Correct, Jason?

[00:48:43] Jason Liu: Yeah, these ideas can be used in other libraries as well, right? So for the Python community, we're really using a library called Bydantic, and so this is supported in things like Langchain and Marvin. And so even if you don't use a library like Instructor, learning how to think about prompt infrastructure is still something that's going to be really applicable and valuable for everyone listening.

[00:49:05] Alex Volkov: And you mentioned before, there's like a bunch of stuff that open the icons up with, like JSON mode, in example, etc. There is functions back in June. But also The other LLMs, they don't necessarily follow the same kind of new abstractions that OpenAI releases. I think Anthropic just recently announced that they're moving to function system messages or moving to just messages, things.

[00:49:27] Function calling in OpenSource LLMS

[00:49:27] Alex Volkov: And also we have open source, which is like all over the place. So I guess my question is, with these libraries, with these Paidantic approach and Instructor, would that apply to other LLMs? Does this apply to open source, which we talk a lot about?

[00:49:40] Jason Liu: Yeah, so right now there's only a few open source models that support function calling. So if you've looked at some of the work from the functionary team, they have been training I think mixed role now with function calling, same with the guys that like news research with Technium. There's been a lot of progress in the open source world and getting things like function calling.

[00:49:58] Jason Liu: If you want more structured outputs [00:50:00] too, there's a great library called outlines. That can use something like the Hugging Face Transformers library to also do structure extraction. And again, they also support things like Pytantic. And the goal of the course really is to show you how to think about and how to model these problems in a particular way.

[00:50:15] Alex Volkov: Absolutely. And I think John Durbin in the audience I think Ouroboros was trained on function calling as well, if I'm not mistaken, John. So folks who haven't heard our conversation with John, definitely go and check out where the deep dive with John about Bagel, which now includes the Ouroboros dataset, which now includes function calling as well.

[00:50:33] Alex Volkov: So that's awesome. The open source also moves there. Go ahead, Nissan.

[00:50:37] Nisten Tahiraj: Also really quick the news vision model ended up being good at at function calling, although it had other drawbacks. It was good at function calling because of the Arrow Boro's like thousand something functions dataset. And as far as I saw the newer bagel models, so Bagel seven B are also good at at that, at at function calling.

[00:50:57] Alex Volkov: So big old model series from Maxim Le Bon. Again, shout out Maxim Lebon, who came on the pod last week, and then the full deep dive with him will be released this Sunday, so make sure you're subscribed. We talk about, we don't talk about FunctionCall, we talk about NeuroBeagle. NeuroBeagle is like one of the top performing 7 billion parameters, it's a merge, it's a cool conversation about merging.

[00:51:16] Alex Volkov: But let me back, let me get back to Jason just real quick. Jason, you're also like doing independent consulting, you said, in multiple places, and you're like helping them build. I got to like tap into your experience from like actually like hands on AI building and companies. Could you give us like a little bit of what do companies struggle with?

[00:51:32] Alex Volkov: Like with the first obvious thing that comes to mind that people like AI builders probably like already solved in their minds. What do you have to go through to not only build to them, but also educate them on as you join the company, it starts like helping them out with AI stuff.

[00:51:47] Jason Liu: Yeah. So one of the biggest things I noticed is that when we look at something like a RAG application, really what it looks like is a recommendation system. If you went on Netflix, for example, and you watch a bunch of movies and the recommendations don't get better, it would be a really terrible experience and you probably lose a lot of customers.

[00:52:03] Jason Liu: But for a lot of companies these days that are using things like agents or retrieval, We are in a situation where, you know, no matter how many users you get, if you don't improve your language model, if you don't improve your embeddings, the product doesn't really get any better. And so one of the big things I'm focusing on this year is helping these companies build a better feedback loop and a data flywheel.

[00:52:22] Jason Liu: And so we can know for sure that as we get more users, there's these network effects that improve the models that we want to train. And so I think step one is, being able to fine tune your own embedding models and your re rankers and go from there and then, see what comes up in the future.

[00:52:39] Alex Volkov: Awesome. So definitely folks, give Jason a follow. The course, I think we're releasing it today, but I haven't seen any social mentions, but it's really worth watching. I watched a few of this and we'll follow as well. And this is a course series now. So we're going to start with this, and then we're going to continue with the monitoring tools that Waze Ambassadors have.

[00:52:56] Alex Volkov: Correct?

[00:52:58] Nisten Tahiraj: Yeah, the first course is like 30 minutes. It's super quick. The real goal is to show you what's possible and get you thinking about some new ideas. And then the next course will be deeply integrated with the more visibility tools from Wisdom Biases and specifically around the experiences I've gotten from consulting production clients.

[00:53:13] Alex Volkov: Incredible. Thank you, Jason. Thank you for joining us. And thank you folks who worked on the course together with you. I'm excited to see this. And again, the reminder, there's a bunch of free stuff there. There's a bunch of like knowledge just drops here. And hopefully I will be able to tap into this community and also build more things.

[00:53:29] Alex Volkov: Go ahead, Nistan, and then we'll move on.

[00:53:31] Nisten Tahiraj: Yeah, I just want to say that a lot of us here that got good at machine learning were from just a random YouTube series. So the Karpathy series on Building one from scratch. The full stack is just pronounced like that. Their LLM one from way back in April and March. So I'm really looking forward to this one because doing YouTube tutorials is actually extremely efficient.

[00:53:53] Breaking News - HuggingFace announces a collaboration with Google

[00:53:53] Nisten Tahiraj: But on that note, we have breaking news.

[00:53:56] Alex Volkov: Wait, we have breaking news. Hold up. You know what this means.

[00:54:11] Alex Volkov: Yes, Nistan, go ahead now.

[00:54:14] Nisten Tahiraj: Phil Schmidt, who is a friend of the pod and has been here.

[00:54:18] Alex Volkov: Here, yes.

[00:54:18] Nisten Tahiraj: definitely. Yeah, Devleet at, At Hugging Face, he's also the one that did the integrations, if I might be wrong, but the integrations for with AWS Bedrock and also with CloudFlare workers. Yeah, so now it looks like he's been working on doing an integration.

[00:54:35] Nisten Tahiraj: with Google, where you'll be able to just take whatever models or fine tunes and stuff you have on HuggingFace and then use Google's infrastructure, use both their TPUs and NVIDIA H100s, they're advertising this, that Google owns, to continue training, fine tuning, serving, deploying stuff via HuggingFace.

[00:54:55] Nisten Tahiraj: Google. Is this is a very interesting move. Google's jumping in more on the open source side there. I don't know what this means, but this is a very interesting development.

[00:55:06] Alex Volkov: I know what this means. This means that, if Hug Face becomes public ever, buy their stock. That's what this means. Hug Face like literally embedded into the, like the infrastructure of AI and definitely worth following. And the more integrations they have, the better, like it is for the open source community as well.

[00:55:25] Alex Volkov: All right, folks. Thanks Nissan

[00:55:26] Nisten Tahiraj: This is not financial. By the

[00:55:28] Alex Volkov: financial advice, but they're also not public yet. Look, I don't think this move. Yeah, I don't think this moves the needle for, in terms of Google investing,

[00:55:36] Hourglass Diffusion Transformers deep dive with Tanishq Abraham

[00:55:36] Alex Volkov: Alright folks, we're moving forward and the way, where we're moving forward is also like into kind of like diffusion mode, and I'm very excited to introduce Tanishq.

[00:55:45] Alex Volkov: Tanishq, have you been here before? Remind me, please. I don't think you've been here on stage before.

[00:55:50] Tanishq Abraham: I, I don't think I've been on stage

[00:55:52] Alex Volkov: No. All right. So I'm very excited to have you here. Thanks. Thank you for joining us. So folks, one of the coolest things that came out in at least the research area from this week was this paper from.

[00:56:03] Alex Volkov: From multiple authors, some of them friends of the pod, like Enrico, if you remember the chat with Enrico we did with rope scaling is on the paper as well. Katherine Crowson who we should mention, I don't think she's been here or, but we've talked about some stuff that she did. Stefan Baumann, Alex Birch, Tanishq, you're on there, Daniel Kaplan, and then Enrico, a friend of our Nico.

[00:56:23] Alex Volkov: Tanishq has been the friend of the pod behind the scenes, you guys didn't know this, but we've met in NeurIps so we've met before. Tanishq, do you mind introducing yourself just briefly for the audience who haven't met you or followed you so far?

[00:56:34] Tanishq Abraham: Yeah, sure. My name is Tanish. I am a research director at Stability ai and also CEO of MedAR, which is a medical AI research organization. I've also been involved with fast ai, been working on, diffusion models for

[00:56:48] Tanishq Abraham: I guess past year and a half or so. Yeah, so I do all kinds of stuff.

[00:56:53] Tanishq Abraham: Generative ai,

[00:56:53] Tanishq Abraham: medical ai. Yeah.

[00:56:55] Alex Volkov: You also just like a briefly skipped over the fact that you got your PhD at 19, right? Is that correct?

[00:57:01] Tanishq Abraham: Yes, that's correct. I got

[00:57:02] Tanishq Abraham: it. That was last year. Yes,

[00:57:03] Alex Volkov: So if folks in the audience don't know what this means that there's not many like 19 year old PhDs and Tanishq is one of them. And also we met once. I think a year and a half ago. And then the next time we met in Europe, I just remember every detail of our conversation. But that's beside the point.

[00:57:17] Tanishq Abraham: yes.

[00:57:19] Alex Volkov: Thanks

[00:57:19] Tanishq Abraham: met at the Stability AI

[00:57:21] Alex Volkov: Lunch party. That was super cool. And since then, many things have changed. And I really want to talk to you in that area, right? So this paper, shout out to all the authors because I'm looking at this. I've seen like multiple folks share this paper. Paper is talking about high resolution image synthesis.

[00:57:39] Alex Volkov: With something called Hourglass Diffusion Transformers. And I will pin your great thread about this here on top of the space, and it will be in the show notes. Could you briefly tell us the problem this tries to solve? And then we're going to go into actually how this kind of approaches how to solve this.

[00:57:57] Tanishq Abraham: Yeah, definitely.

[00:57:58] Nisten Tahiraj: Yeah. So first of all, of course preface this by saying it's mostly, of course

[00:58:01] Tanishq Abraham: Kat's genius work here. And we were just lucky to be able to help her on this project. But yeah, just to get her started.

[00:58:06] Alex Volkov: just one tiny second because it's worth a shout out. So Kat, by Kat you refer to Katherine Carlson, right? And if folks ever used Stable Diffusion before, either in Automatic 1. 1. 1 or whatever, and you [00:58:20] choose anything with K dash that's, this is the Katherine, right?

[00:58:24] Alex Volkov: This is, K Diffusion is like her area. Very incredibly prolific person in this area I don't know many facts about her, but like everybody who I talked to from this paper, including Enrico, everybody's like referring to Kat, that's her work. So maybe a huge shout out to Kat and yeah, go ahead, please.

[00:58:40] Tanishq Abraham: Yeah yeah, she's like a, she was like one of the original AIR people, so yeah, I had, she helped start the field in a way, anyway,

[00:58:46] Tanishq Abraham: To To provide some context of

[00:58:48] Tanishq Abraham: what this paper is about the idea is that, if you want to do like high resolution generation, so think like 1024 by 1024 the typical approaches these days utilize some sort of multi stage approach, like the most common one, like stable diffusion, is this sort of latent diffusion where you have to encode it in with some sort of auto encoder into some latent space and you're doing diffusion on the latent space and you're not actually doing it on the actual pixels.

[00:59:15] Tanishq Abraham: And so that comes with some disadvantages. For example, if I don't know if people who are like doing things like image editing with stable diffusion, you realize you don't have a whole lot of fine grained level of control in terms of the actual, at the pixel level.

[00:59:30] Tanishq Abraham: It's difficult to do that because it's happening in the latent space rather than at the pixel space. So there are various different things where like it has its own challenges. Of course, like latent diffusion has a lot of different advantages too, but you know for some applications it may not be ideal.

[00:59:44] Tanishq Abraham: And then on top of that the other aspect that, we wanted to like, look into basically was the fact that we're seeing People move towards towards transformer models for diffusion as well. And of course, in the past, most of the diffusion models have been with, a U net architecture, a convolutional U net.

[01:00:02] Tanishq Abraham: Also stable diffusion uses a convolutional U net. But, there have been a lot of papers examining the use of transformers. And, of course, the nice thing about transformers is, people know how to train them, they're quite scalable, so people would rather use transformers for diffusion over over something like a U net.

[01:00:18] Tanishq Abraham: But again, the problem is that So far, it's mostly only been applied to the Latent Diffusion Scenario, mainly because it would be very hard to do this at pixel scale because of the quadratic complexity of attention. So if you wanted to scale up to higher resolution, you know that, it would be, at the number of pixels, you're going to have quadratic scaling with that.

[01:00:40] Tanishq Abraham: So it would be very difficult to train this with, I guess enough resources or whatever. So that's the problem that we're trying to solve is what can we do to resolve the quadratic complexity of the transformer architecture that allows us to then train a diffusion transformer in pixel space.

[01:00:58] Tanishq Abraham: So that's what the hourglass diffusion transformer tries to address.

[01:01:02] Alex Volkov: Thank you for the brief introduction. For I will try to recap as a way I understand this. So folks who are not machine learning scientists in the audience would be able to follow along. But basically Gen AI, this whole wave of Gen AI has two, two big infrastructures so far, right?

[01:01:15] Alex Volkov: The diffusion, the stability AI and of the image models and video models. They're based on like diffusion or you said latent diffusion, correct? And then there's the LLM area with basically based on transformers. And we've seen a bunch of stuff going back and forth in tech, like in techniques between them, right?

[01:01:31] Alex Volkov: So Laura, I think is a thing that like many people in the diffusion area, like trained Laura's on different concepts. And then obviously like fine tuning with Laura's then became a thing and back and forth. We've seen like back and forth different approaches. I think you said like The open source area in LLMs in Transformers specifically has a bunch of like super cool tricks and optimization techniques and flash attention different things, right?

[01:01:54] Alex Volkov: There's a bunch of stuff that people developed in one area that wasn't necessarily applyable to to, to diffusion models. And so you guys set out to try and unify those two, or at least use some of the tricks and looks

[01:02:09] Alex Volkov: succeeded to an extent. Yeah. Go ahead please.

[01:02:12] Tanishq Abraham: yeah, I think it's, yeah, about Now that we have this transformer architecture, we can try to apply some of the tricks that people have been using, things like, rope embeddings there are other tricks like RMS norm, these are the sorts of tricks, for example, that are used in the Lama architecture these sort of similar architectural decisions and you could take those sorts of best practices and try to see if they help with diffusion now.

[01:02:33] Tanishq Abraham: So yeah, I think that's the idea. And like people were exploring yeah, that's like another interesting thing about our papers. Like people were exploring diffusion transformers, but they were using very kind of old architectures for diffusion transformers. And here we're trying to also apply all these tricks that we see.

[01:02:47] Tanishq Abraham: People are applying the LLM space and trying to apply that to to, to diffusion. Yeah, that was also an important part of our paper as well.

[01:02:54] Alex Volkov: And of course, you mentioned Rope, and I want to shout out a friend of the pod, Enrico, from News Research, Enrico. Wait, I don't actually remember if Anuka is part of News Research. Maybe, so he and News Research worked on the Rope paper together. And for folks who are interested in hearing about Rope, we had a deep dive during the summer, one of the coolest episodes.

[01:03:12] Alex Volkov: Most of it back then went above my head, but it was super cool going back there and saying, Hey, oh, I learned this. Rope is basically a way to extend context windows and do a bunch of other things for Transformer based large language models. And I wonder how does Ropen play here? And Enrico is part of the authors here on, on the paper.

[01:03:29] Alex Volkov: So he contributed at least part of that work, I assume. Enrico?

[01:03:34] Tanishq Abraham: Yeah. I think the rope stuff is like something that We even, we haven't like fully explored the full potential there, I think. But at least for what we were doing, we saw improvements in, in performance, just using rope over other sorts of, these sorts of position embeddings.

[01:03:50] Tanishq Abraham: But yeah, I think there's definitely potential for allowing the model to handle larger resolutions or do things like this because of the rope embeddings that we have in the model. Yeah it's, I think, also meant for future work.

[01:04:02] Alex Volkov: Incredible. You guys use all these techniques. You introduce, or I guess start formally announcing this concept of the fusion transformers, which is the mixture of these two things. And what are some of the results that you get? You've trained a few models to test.

[01:04:15] Alex Volkov: How do you even, measure that you're getting performance or you're just looking at algorithms or you're actually generating images. Can you talk us through the process of validating this like theories and papers?

[01:04:26] Tanishq Abraham: Yeah, but I just want to yeah, I guess to take a step back to clarify we didn't necessarily invent the concept of diffusion transformers. That is something that people have already developed but the idea that we focus here is the problem is in the past, diffusion, Transformers were done with the latent space because of this quadratic complexity.

[01:04:45] Tanishq Abraham: So we basically have a different type of transformer architecture, which is this hourglass transformer that enables for Like O of N scaling, so like a linear complexity. So it, it will scale with the number of pixels much better than it won't blow up like, like you, you have with with the attention quadratic complexity.

[01:05:07] Tanishq Abraham: So that was the main trick that we're using. So we have some tricks in there. That allow it to have that property. And that's what enables us to do it on the pixel space, as opposed to the latent space that the previous diffusion transformers were doing. And then on top of that, we are adding all these additional transformer tricks, which no one had tried out before with diffusion transformers.

[01:05:27] Tanishq Abraham: So those are the main sort of contributions of this paper in terms of in terms of, and yeah, I guess one thing, the, yeah, the other thing worth mentioning is that the way that this architecture is able to do this is partly because it's, it the architecture is a very hierarchical architecture.

[01:05:45] Tanishq Abraham: So it's actually able to process at different image resolutions. And for example at the high resolutions, we use a sort of the, this sort of local attention, which is what is. Having this linear scaling, but then at the low resolutions, we were able to do the regular attention.

[01:06:01] Tanishq Abraham: Yeah, there's also this hierarchical processing of the image resolution. That's also, I think, an important point, which enables also for higher fidelity as for generation. And yeah, in terms of testing the

[01:06:13] Alex Volkov: Yeah. And so the next question is how do you actually like test the architecture? How do you validate these like approaches that you tried actually better than what the field has previously been at?

[01:06:26] Tanishq Abraham: Yeah. We looked at two datasets. One, we did ImageNet generation. So can conditional, class conditional ImageNet generation. So that is, passing in an ImageNet class, you generate images of that class. So if you pass in a zebra [01:06:40] class, you're generating zebras, or you're in some sort of dog class, you generate the dogs.

[01:06:43] Tanishq Abraham: That's, we train a model for that. We train it at a resolution of 256 by 256 and that, that's one of the experiments where we compare to other architectures. And so we we're, the interesting thing is that, of course, we're comparing to other architectures that are using, for example Latent Diffusion, that they're, using the latent space there the architecture is functioning on the latent space and not on the pixel space, but we have our architecture that's functioning on the pixel space and using this hourglass transformer and it's getting better results than with the with the latent space.

[01:07:19] Tanishq Abraham: We're beating, for example, the previous Diffusion Transformer model which was using the latent space. And then another interesting data set that we used was the FFHQ. Data set which is this sort of data set of high yeah like high resolution faces and so this is at this is at a 1024 by 1024 resolution and so this is like you know very difficult to be able to train especially in a pixel space you know at Scale of 1024 by 1024.

[01:07:47] Tanishq Abraham: And actually there are not many other diffusion models that are trained on this model. There are a bunch of GAN models, for example, but not really many diffusion models. There's like only one or two that we actually found in the literature because it is, it can be a bit difficult because of this, because of the.

[01:08:01] Tanishq Abraham: The pixel scale or the, the resolution of the images, but yeah we were managed to train a model with our architecture. It can, it trains quite fast. And yeah we are able to we're basically like, I guess at this point now we would be the best diffusion model for that for that data set.

[01:08:18] Tanishq Abraham: And we are measuring with FID. But of course, like FID, as a metric also has its problems it does have some bias towards like towards GANs and so GANs tend to have a lower FID kind of in terms of the bias of the FID. So like when we look at it qualitatively, honestly, we think like it's quite comparable to the GANs, might be better than the GANs, honestly.

[01:08:41] Tanishq Abraham: So we may do more evaluations and study that further. But honestly, this may be like. One of the state of the art models for this FFHQ dataset but it's a bit hard when you're using as a metric, but that's of course the problem with, everyone's using that metric in the literature, but yes, but yeah, I think that, again, that's another really interesting result that we observed.

[01:09:01] Tanishq Abraham: And then, of course, we do

[01:09:02] Alex Volkov: I want to follow up with a question here real quick. For folks like, hard for them to follow like much of this, but they've used something like Stable

[01:09:09] Tanishq Abraham: oh, sorry.

[01:09:10] Alex Volkov: No, that's all great. This is all recorded. Folks can like pause and go to, and go research and come back and listen to you.

[01:09:15] Alex Volkov: This is great. Like you did the deep dive. I really appreciate it. I just want to bring this back a little bit upwards towards like

[01:09:21] Unkown: Sure.

[01:09:22] Effects on the industry from Hourglass Diffusion Transformers

[01:09:22] Alex Volkov: affect the industry, given that we have stuff like Stable Diffusion out, and that keeps getting better, Mid Journey is getting like reality adjacent to the point where like it's really hard to distinguish, there's like different upscalers that take the outputs and then run some upscaling how does this affect the industry to, in your mind?

[01:09:40] Alex Volkov: Will this Accelerate some stuff. Will this be applied to different areas that like the fusion models have not been traditionally in? What is the kind of the, let's say this is a building block that you've created. How does this affect us in three, six months?

[01:09:54] Tanishq Abraham: Yeah, I think this is just a kind of a new unique direction to explore. Of course, I think latent diffusion is still a very interesting, invaluable direction, but this is just it's always good to have different directions to explore. And I think And honestly, like this architecture can be applied to latent diffusion as well, and maybe we get even better results, for example, we can do maybe like, multi megapixel level synthesis by combining, this method with latent diffusion or something like this as well.

[01:10:23] Tanishq Abraham: So it's not even like it's. Limited to just the pixel space. That's what we're showing that, that's something that is interesting about this. But again, it can also be applied to agent diffusion and can even, of course, these models could be scaled up further. There's a whole lot of future work to explore here, I think.

[01:10:39] Tanishq Abraham: And yeah, I think and of course it's computationally efficient. And yeah, I think the nice thing is yeah, moving towards the transformer architecture when, it's, people understand the transformer architecture at this point. I think, there's people understand how to scale it and different tricks.

[01:10:55] Tanishq Abraham: And it's, I think this is a good, by introducing this architecture, this is a good way for. As to try to bring some of those advances in transformers into the diffusion model field as well. So I think that's the other interesting aspect of this.

[01:11:12] Alex Volkov: for me reading this is not a machine learning scientist. Reading this was like the highlight of interesting things were like The open source community moves in, in, in different areas, but also like bringing over some of the learnings, bringing over some of the talent the tooling around, like making things available.

[01:11:28] Alex Volkov: And I think that's like very exciting. We also have Alex Birch, is that correct? The name also in the audience. So shout out Alex. And then what else do we not cover this stage? What is the last thing that you want to say? Or maybe shout out some of the co authors feel free, the stage is yours.

[01:11:44] Tanishq Abraham: Yeah, I'm just looking at some comments that I, Alex also has some comments that he said. So he thinks, for example, that with this model, that there's potential to. Achieve more realistic textures than even mid journey. So I think, we have observed, like with the model, like the, because that's the thing about using, when you're using a latent diffusion where, it's not, you're not doing, when you're not doing it at the pixel level, it's a bit.

[01:12:07] Tanishq Abraham: Difficult to get those get those, textures accurately, but if you're doing it the pixel level I think you're able to get those textures yeah, it can do that much better. And we've observed that with the models that we've been training. And yeah, I definitely agree with Alex there.

[01:12:22] Tanishq Abraham: Yeah, I think also like it may have potential to achieve like really realistic textures and that, that's something. That I guess we could look forward to hopefully. Yeah.

[01:12:31] Alex Volkov: that's incredible cause I think the realism comes from the imperfections, especially like textures and skin, et cetera. And like diffusion models have, at least for many folks are easier identifiable by the kind of the smoothness of edges and different things. So definitely like more more textures are there for humans in real pictures.

[01:12:50] Alex Volkov: And then we're looking forward to more of that in diffusion models. That's incredible. So definitely, thank you for breaking this down for us, Dinesh. Thank you and Catherine and Alex and everybody else in Rico who worked on this. I think we have some questions from folks on stage here. Vic, go ahead, please.

[01:13:05] Vik Hyatk: Yeah, another question.

[01:13:06] Vik Hyatk: I just wanted to see I played around with the repository a bit. It's a great way for anyone interested in getting into diffusion models to get started. It's not your typical research code base. It's super clean.

[01:13:19] Vik Hyatk: You're not going to run into a bunch of dependency issues and whatnot.

[01:13:22] Vik Hyatk: So that

[01:13:23] Vik Hyatk: was amazing. It's also super compute efficient, so you don't need a ton of compute. To start to see good results. I'd strongly recommend checking it out if anyone was feeling intimidated

[01:13:32] Vik Hyatk: before,

[01:13:32] Vik Hyatk: don't be.

[01:13:34] Alex Volkov: Incredible.

[01:13:35] Tanishq Abraham: Yeah. That, that, that comes down to Kat's again, Kat's genius. I think this is a code base that she's been working on for quite some time and I also really enjoy working with it.

[01:13:42] Tanishq Abraham: It's like one of my favorite diffusion model code bases. So I definitely agree that anyone who's interested in playing around with diffusion models should check it out.

[01:13:49] Alex Volkov: So that, that's on Cat's GitHub. We're going to add this in shell notes called KDiffusion, correct? It's now

[01:13:55] Alex Volkov: part of that existing code base, but now like this, the Hourglass Diffusion Transformer. Get used to say Diffusion Transformers from now on, folks. Hourglass Diffusion Transformers, HDITs, are now a thing.

[01:14:06] Alex Volkov: And Tanish, thank you so much. And Alex for joining in from the comment area. And thank you for working on this work. Hopefully this will get the recognition it deserves and definitely as a foundational block to get us Higher performance, lower, hardware requirement models that look way better.

[01:14:22] Alex Volkov: Incredible.

[01:14:23] Open source models in medical fields

[01:14:23] Alex Volkov: Tanishq I wanted to follow up with you, because MedArk is something that you're now CEO of medical things, and then you had a tweet today that I really wanted to talk to you about specifically because Quyen was involved, and we have like folks from Quyen, usually like friends of the path as well, they join us could you,

[01:14:37] Alex Volkov: let's talk through this please, let's talk through How open source is catching up to medical space.

[01:14:42] Alex Volkov: Could you briefly summarize what we've talked, recent work from you guys?

[01:14:46] Nisten Tahiraj: Yeah. Sure. Yeah. I've been

[01:14:48] Tanishq Abraham: quite busy with all kinds of different research projects. So that was another. Ongoing research project that we're working on at MedArc and that I'm shared some progress of that today morning. So basically, at MedArc, we're of course interested in [01:15:00] developing open source medical language models.

[01:15:03] Tanishq Abraham: So that, that's something that we're heavily interested in. And of course, in order to be able to do we wanted to understand what The current capabilities of these language models look like the open source language models and no one had done like a very proper analysis of this as far as I could tell and yeah, basically we, what we did is we added this suite of tasks known as the Multimed QA.

[01:15:24] Tanishq Abraham: Sweet of tasks. So this is a kind of a bunch of tasks, a total of nine tasks that were they came from different other papers and stuff, but Google put them together as this is their sort of evaluation bench, this is the evaluation benchmark that This is what Google was using to evaluate their MedPAL models and, whatever models they had.

[01:15:44] Tanishq Abraham: And then, the community, the medical AI community been using that. It's been used to evaluate GPT 4

[01:15:49] Unkown: and all kinds of

[01:15:50] Tanishq Abraham: other models as well. And yeah, I, we, at MedArf, we added it to the LM eval harness. So that's like the common sort of for open source language models.

[01:15:59] Tanishq Abraham: Everyone I think uses LM eval harness to evaluate the models on various tasks. So now it's in there. So people can easily also evaluate their, whatever the models they have on these medical tasks. And so once we added it into LM eval harness, we just wanted to just. Do a comprehensive like analysis of a whole bunch of models in the open source space, just to see like these sorts of generalist models.

[01:16:21] Tanishq Abraham: Like they're not necessarily particularly trained on medical data. Of course they've probably seen some in, in, in their pre training or whatever, but that's not their main purpose and that's not their main focus in their pre training. And I'm, I was just curious what their performance would look like and, how it compares to other models like GPT 4.

[01:16:36] Tanishq Abraham: GPT 4 is also a generalist. It's a generalist language model as well. It's not also necessarily trained on medical, but, it's really good at that. In fact Prompt Engineer GPT 4 is state of the art on this benchmark, actually.

[01:16:48] Alex Volkov: I remember this. I remember where Google came up with a specific medical device and then GPT 4 just like basically with prompt engineering on that benchmark became the top one, right? This was quite incredible that the most generic

[01:17:00] Alex Volkov: model we have. Yeah,

[01:17:02] Tanishq Abraham: that's the it's called MedPrompt. That's the state of the art, this prompt engineering, prompt engineered GPT 4, it's called MedPrompt. And so they do a whole bunch of tricks like, dynamic few shot and GPT 4 written chain of thought and all kinds of tricks that they throw at GPT 4 and they got state of the art.

[01:17:18] Tanishq Abraham: And then of course they use the same tricks to then, later claimed that GPT 4 is better than Gemini as well. It's not just for medicine that you can use it. They use it for just general prompt engineering as well. But yeah, anyway so yeah, this is, so overall the point is I wanted to evaluate how these models do in the how the open source models do in this, on this benchmark.

[01:17:38] Tanishq Abraham: And so I evaluated a whole bunch of models. I evaluated Lama, Mistral, Mixtral. I evaluated the Yi series of models. I evaluated Quinn. Yeah, so I evaluated a whole bunch of models here and basically what I found out is first of all, Lama 2 is not that great compared to all these other models, actually, and it's, It's interesting because in the literature people are still fine tuning Lama 2 for medical purposes but, it actually doesn't have a very good base capability of for medical knowledge.

[01:18:09] Tanishq Abraham: So Lama 2 is not very good at medical stuff, but the models that are quite good are basically the Yi series of models, so Yi 34b is really good, as well as the Quen series of models. So Quen 72b is The state of the art open source model and it's, and this is not like doing any sort of prompt engineering or anything like this.

[01:18:28] Tanishq Abraham: This is just like five shot prompting and it's beating MedPalm version 1. So MedPalm version 1 was released in November of 2022 and that was like the first sort of yeah, that was Google's model that they had. And this Quenz72B is beating MedPom1 without any sort of prompt engineering or any of these tricks.

[01:18:50] Tanishq Abraham: And yeah, I think that's really, honestly, quite impressive because

[01:18:54] Alex Volkov: Yes.

[01:18:55] Alex Volkov: I want to shout out Jun Yang or Justin Lin a friend of the pod, the head of technical, working on Quen for such like incredible achievement. And thank you for testing this. Because we and Nistan, like you worked on AI in medicine as well. Like we've waiting, this is going to happen.

[01:19:11] Alex Volkov: Want it or not, there's like several doomers that say, Hey, never trust an AI doctor, but, many people already go to JGPT to, maybe get a second opinion. And Google has obviously been working on this MetPalm and MetPalm2.

[01:19:22] Alex Volkov: I think for many people it's going to be easier to digest this idea if the model that talks to them is like fully runs on their computer, open source, no internet, like no data sharing.

[01:19:33] Alex Volkov: I think that's a very important piece of this as well. And it's great to see that, we're now getting like some cool comparison, but definitely open source is coming strong on this one.

[01:19:42] Unkown: Yeah.

[01:19:43] Nisten Tahiraj: Yeah. I had the same thing as, Astonish with the Lama models, you can train them on good medical data, but they don't have a, they don't perform great at the base. I'll tell you, it's still, GPT 4 is king when it comes to it. And the product I worked on last year in March, it's still going, Dr.

[01:20:04] Nisten Tahiraj: Gupta. ai is, it is still going. It's just a very well prompted, engineered product. Doctor with with a good RAG system too, that was one of the first, but I will say the thing about the main concern now, and why I think open source will basically completely dominate medical AI, is that their main concern is If they're dependent on some kind of API endpoint that makes the hospital and people's medical data really vulnerable to malware and foreign intelligence groups, which have been wrecking havoc with with medical data and ransomware.

[01:20:42] Nisten Tahiraj: So that's their main concern. And the only way we're going to solve that is by having models that they run locally. So I'm really glad Tanishq actually took the task on. Benchmarking some of these, because you have the entire medical safety field and all the funding and all the people and I have yet to meet an AI safety person that even knows how to rename a file in Linux, let alone actually write some kind of benchmark.

[01:21:07] Nisten Tahiraj: So I'm glad someone's actually taken on the challenge of making open medical yeah, medical LM benchmarks.

[01:21:19] Tanishq Abraham: Yeah, I completely agree in terms of yeah, I definitely think open source is definitely the feature for medical AI and medical LLMs. And I think hospitals and doctors will be more comfortable when they know they have access to the model and this is the model that they're using rather than when it's behind some API also where not only like in the case of like malware or things like this, but open eye.

[01:21:40] Tanishq Abraham: AI will just change the model or something like this too, or, these are all concerns that we see this already happening with the models that OpenAI has. And, these are all like concerns that, there needs to be complete transparency when working with with these kind of more crucial applications.

[01:21:55] Tanishq Abraham: And, by doing all this open source I think that that provides that transparency that doctors and hospitals and healthcare systems will be comfortable with that. That's why I'm really excited about working in this area. And I think there's really a lot of potential here.

[01:22:09] Alex Volkov: incredible. Thank you for this work, Dinesh. Thank you for bringing us kind of the idea of which of the models. Surprisingly, Quen. Like I wouldn't assume if you gave me all the models that we've talked about I wouldn't assume that Quen was the most performing, but hey, we'll take what we can get.

[01:22:22] Alex Volkov: Quen72b, the best open source doctor, folks. You hear, you heard it here based on this research.

[01:22:30] Tanishq Abraham: Yeah. Thank you for letting me share all this work.

[01:22:32] Alex Volkov: That's incredible. And as a friend behind the scenes, but now friend of the path, you're always welcome. Thank you for the deep dive on the Hourglass Diffusion Transformers. Thank you for the authors as well. Alex, like still, I think is in the audience and Catherine and Rico and some other folks, and definitely for MedArk, keep us up to date.

[01:22:48] Alex Volkov: We'll keep reporting and the stage is yours whenever you want it. I think folks we're moving forward. I think Nissan, unless you have, or sorry, Tanish, you have the one last thing you want to

[01:22:57] Tanishq Abraham: I would just say please follow first of all, follow all of our Hourglass Diffusion authors. They all deserve your support and also please follow MedArk as well.

[01:23:06] Alex Volkov: 100 percent worth following and definitely will be in the show notes for folks who are listening to this while driving and cannot like click that follow button. I think we're moving to as we're in the hour and a half into the space, let me reset [01:23:20] this a little bit for folks. If you just recently joined us, you're listening to ThursdAI where we talk about everything.

[01:23:26] Alex Volkov: And everything incredible and interesting in the world of AI and open source, LLMs, big companies we cover. And we also had a deep dive today about vision video. My name is Alex Volkov. I'm with Weights Biases. I'm an AI evangelist. And yeah, we're here every week and we keep up to date. So you don't have to, so if you were out of Twitter or if you don't even participate in Twitter and you're just listening to this on the podcast, we got you we're going to cover everything that's most important and then send you this, so definitely check out.

[01:23:52] Alex Volkov: There's the i. news for that. And I think we're moving towards the big companies area, which we haven't touched. We briefly covered in the breaking news where Hug Face just announced a partnership with Google. So you'd be able to very easily run the models from Hug Face on TPUs and the Thingisneyosa GPUs, which is incredible because Google has those, but they don't even give them away.

[01:24:15] Alex Volkov: I think they're all reserved for collab or something. But also. Everything that I have today in the big company LLMs and APIs and everything is from Google.

[01:24:25] Google teases LUMIERE, SOTA video generation models

[01:24:25] Alex Volkov: So the next thing that we're going to talk about is Lumiere. And I don't know if you guys saw the video, but I definitely saw the video. I think Pharrell, you sent this in our group chat first, but by that time there was already spreading around.

[01:24:37] Alex Volkov: . So there's obviously the whole area that we've talked about. Sable Diffusion Video releases like very short videos image to video and text to video. And then there's the front runners in the closed source, which is Runway and Pika. And there's like another one Firework. Oh, Leonardo is doing some incredible things.

[01:24:54] Alex Volkov: All of them have very short videos and the consistency between the frames is not like incredible. And Lumiere. Has shown a video and this, like for all, sorry, you're saying this could be like very cherry picked, et cetera. But it feels like this is like another step in this direction that's significant.

[01:25:13] Alex Volkov: And for folks who are not like watch the video yet, definitely worth watching. I'm going to add this it's already on the top of the space, but basically you see they announced a bunch of stuff that Lumiere can do besides just generation. So video in painting is one that they've announced.

[01:25:28] Alex Volkov: They announced like a text to video text to video, image to video in painting. And they have something like they say, realistic, diverse, and coherent motion specifically around the motion of kind of the characters, which has been lacking in all these like video synthesis. I will say it's.

[01:25:44] Alex Volkov: It's pretty remarkable to even discuss that oh, this vision text to video image is not as good as that one. It's really incredible that we're, like, at this point where we can say, a highbrow, Oh, yeah, I prefer this output. We're, like, we're typing text and getting a video back.

[01:25:59] Alex Volkov: It's ridiculous on the surface of even saying this to us. Like a year and a half ago that this would even be possible. But with that said, we're moving forward. We're like hedonistic adaptation is a thing. We're getting used to these tools and we're getting them like day to day. And then we're like, okay, yeah, this tool is better.

[01:26:15] Alex Volkov: They said the existing video malware synthesized distant keyframes, followed by temporal super resolution and then that's probably it makes it temporal consistency difficult to achieve. Temporal consistency basically says where like characters throughout the video, what they do.

[01:26:30] Alex Volkov: And so you've all seen these videos where like the face changes from frame to frame, et cetera. And this. This series of videos from New Year looks very consistent, like spatially and temporally. Like definitely where the characters are in the video, but also like throughout time. And they Attribute this to different methods that they've used I will not go into this, but I think the tasks are very interesting.

[01:26:53] Alex Volkov: They have video editing applications image to video and painting and stylized generation. Something I also liked. You'd be able to take like an image and then generate videos based on that style, not necessarily that image. So very impressive from folks from Google, as always from Google.

[01:27:08] Alex Volkov: I haven't played with this. I don't think there's a way for us to play with this yet. So there's a paper maybe some of the ideas in the paper could be reproduced in open source. But it's like a model show in the paper from folks quite a lot of folks, Omar Bartal, Hila, Omar, Charles Herman, and there's like a bunch of folks there on the paper.

[01:27:25] Alex Volkov: Very like visually appealing demo as well. So definitely we'll add this video in the show notes. And I think we have. One more thing here in Diffusion stuff. Yes, the one last thing that I wanted to talk about is Instant ID. Where so we moved off from Lumiere, Lumiere is like super, super cool, but we haven't seen this work.

[01:27:43] Alex Volkov: Hopefully the releases as Google has a back, they have an example of like when they released stuff, like Dreambooth was released and everybody was using this. And. I think that's pretty much it in the big companies and open source.

[01:27:55] InstandID - 0 Shot face transfer diffusion models

[01:27:55] Alex Volkov: The other thing that I wanted to mention is instant ID. We've mentioned this briefly before, but it's been pretty much everywhere on my timeline. If you haven't played with this, I very strongly encourage you to play with this. Because instant ID is a technique to transfer to create diffusion models with your face.

[01:28:11] Alex Volkov: And we've all probably tried this at once. With, like I said, like a dream booth from. Nathaniel Ruiz, who's a dear friend of the pod, who's been here a couple of times. There's like other techniques also to transfer your face into a latent diffusion model. And they all used to take multiple images of your face and some amount of training.

[01:28:32] Alex Volkov: And Instant ID is basically a technique that you can try right now, super quick. With zero shot, one image. You can generate images with your face, or with your kid's face, or whatever. And literally I just want to highlight how impressively fast we're moving towards these type of tools. This used to take fine tuning.

[01:28:52] Alex Volkov: This used to take GPU and knowledge, and there's, like Kokhya, and like this used to take Loras and before Loras, Dreambooths. So actually there's a couple of companies that I know that built on top of providing the fine tuning experience around this, where you upload images, you get like this huge, like four gigabit, like stable diffusion file specifically trained on you as a concept.

[01:29:13] Alex Volkov: And now there's like a zero shot transfer thing called Instant ID. Where a hug and face demo is included here. I will attach this now soon. Where you just upload one image of yourself. Literally for me and Nishtha and Tanishq, for the non on, Umesh, for the non anons here on stage, we'd be able to use our profile picture here and just generate us with a cowboy hat in, in noir style and it will look like us.

[01:29:36] Alex Volkov: For most of the time. I've tested this Instant ID on my kids. And, I'm not going to post this because of privacy. But my kid loved it incredibly so much. He was a superman. It looked like him. It's unbelievable. That it was, like, able to transfer this with one image. It's quite incredible how fast we moved here.

[01:29:52] Alex Volkov: Definitely, if you haven't tried Instant ID but you have tried avatars before, Try Instant ID, you'll be blown away. It runs on your Mac as well, not that great, but it runs through Pinocchio computer. Definitely worth noticing how fast we're moving in this generation. And shout out to whoever built this.

[01:30:08] Alex Volkov: And there's quite a few technologies like this now. Highlighting how fast we're moving, and I think that's pretty much it.

[01:30:15] Voice and Audio - New tech challenges Whisper

[01:30:15] Alex Volkov: So we've covered our diffusion. We've covered yeah, let's move to voice and audio Nistan, you brought us this new, so I definitely want you to pull up the tweet and let's talk about the faster encoder ASR.

[01:30:25] Alex Volkov: And then we can also, while maybe you pull this up, I will say that this week I've 11Labs announced like a big funding rise, but 11Labs also released their dubbing studio. And if you followed Twitter at all, not even the I Twitter for the past like week and a half, two weeks, you maybe have seen the dubbed video of the Argentinian prime minister, or I don't know if he's a prime minister or president, probably president, right?

[01:30:55] Alex Volkov: Yes, president. Millet something he went to the World Economic Forum and gave a speech in Spanish. And then there was a dubbed version, as like these meetings of global summits of leaders, et cetera, they have. Instant translation in their ear to any language, and that's a human that knows both languages.

[01:31:14] Alex Volkov: And then, somebody said, hey, okay, this is one example, and they posted a Heijan. If you remember Heijan, we've talked about Heijan, quite incredibly translation, dubbing, and leap modulation service, where you can upload yourself and get an instant avatar. Somebody used Heijan on the whole speech.

[01:31:29] Alex Volkov: And that went ridiculously viral. I think there was like 50 million views on it, on X. And that was like mostly a combination of [01:31:40] Millet being like very viral in his opinions, being like, stoking some controversy. But also because you literally hear the person. Speak in English with a Spanish accent where this didn't happen, like literally he spoke in Spanish.

[01:31:52] Alex Volkov: Quite incredible technology and people have been shocked and said, Oh my God, this is coming for all of us in DeepFakes. Fine, we've talked about this multiple times. So Eleven Labs now has a, like a alternative to this, called Eleven Labs Dubbing Studio. And I've actually used this on a piece of Like on a trailer for ThursdAI, of me speaking in English, and they asked to dub me in Russian, the language that I do speak, and my mother tongue from Ukraine, and it sounded ridiculously cool.

[01:32:18] Alex Volkov: Here's a quick snippet of me from a Thursday I show with you three weeks ago that I dubbed into Russian for your entertainment.

[01:32:28] Gadget for children, for parents who have children who do not want to buy iPhones. Because then Instagram will destroy their brains. This is the perfect device for this.

[01:32:36] It looks like a language. In fact, you can talk to a rabbit, it is very cute, there is one simple interface, this is a voice.

[01:32:43] Alex Volkov: It sounded like, so far, How should I say, these models that emulate voice did not work on me. Specifically, my accent is not that great, but because my accent is probably Russian, the Russian version of me sounded really close to me.

[01:32:54] Alex Volkov: For the first time, I was like, Oh, okay. All right. And Eleven Labzner released this dubbing studio and hopefully these models are now coming to open source.

[01:33:04] AI deepfake of Biden caused controversy on mass media about AI

[01:33:04] Alex Volkov: Because there's also a thing where I think there's a recording of Biden saying something like stay home going around and everybody in the media making the big fuss about, Oh my God, AI is coming for all of us.

[01:33:15] Alex Volkov: And there's a big cry for folks to say, we should build tools to detect against this, et cetera. And my stance remains the same. Listen, I think we've talked about this multiple times. The only way through these woods is for everybody to know that their voice is very easily be fakable with three seconds or 10 seconds of their voice.

[01:33:31] Alex Volkov: It's time for the it's time for humanity to adapt to the situation where there's no panacea here. You should just know that just trusting voice blindly without knowing the source just don't do that because it might as well be fake. I don't know if you want to add anything.

[01:33:44] Alex Volkov: Yeah, go ahead.

[01:33:45] Nisten Tahiraj: really quick, I want to say, we already have laws to deal with this. More law is not necessarily going to fix the issue because, fraud is illegal in a free market. And if you want. Or at least people that are more in politics and stuff. If you want to solve the issue, do the job you already have.

[01:34:05] Nisten Tahiraj: You already have a list of spam callers, which you have been identified without an AI. And can you shut them down? So People love to imagine problems and love to think of doom or whatever in the future and then they completely ignore the stuff in front of them. All of us do this, but yeah, again, fraud is illegal.

[01:34:27] Nisten Tahiraj: Can you shut it down as a job, as a government? You don't need a new law, you don't need to be make speeches about AI. You need, just need to shut down fraud when it's identified. Otherwise, all of these tools and conferences and stuff are pointless.

[01:34:42] Alex Volkov: As predicted.

[01:34:43] Nisten Tahiraj: that's what I'm gonna

[01:34:44] Alex Volkov: Yeah, no, that's great. As predicted, the first. Election related deepfake type thing. The media was all over this and the doomers were like, here we go. And people were like it came sooner than we thought. And no, we've literally been talking about this for the past year.

[01:34:57] Alex Volkov: That like elections are coming. These things are going to happen. The technology was there even before. Now it's just like a little bit more accessible. The laws are in place, make it more difficult for grandmas to get spam calls, not make it difficult for the open source stuff. So hopefully like the more prevalent these technologies are, this is my stance, the better the chance that, people will just get used to this being everywhere.

[01:35:19] Alex Volkov: And definitely for folks of us who have our audio out there, we're doomed, right? So come up, like my usual suggestion here is come up with your loved ones with a key phrase that only you to know like. The Terminator scene with the dog come up with this and make sure that if you get a call in 3 a.

[01:35:34] Alex Volkov: m. at night, it sounds like a bad quality version of you, of your relative from somewhere, from an unknown phone. Make sure it's them by asking like, Hey, remember we went to Hawaii and you never went to Hawaii? And they say, Oh yeah, of course. But also you can probably most of those will be LLMs, so you can probably like.

[01:35:53] Alex Volkov: Don't prompt trick them, the spammy LLM calls that sound like you're a relative.

[01:35:57] W2V BERT ASR gets whisper quality with significantly less parameters

[01:35:57] Alex Volkov: Alright, moving for unless, listen, you want to add some stuff about this W2V BERT speech encoder? I've added it to the top of the space.

[01:36:07] Nisten Tahiraj: Yeah, just really quickly, I'm gonna do the paper reading on it 'cause

[01:36:10] Alex Volkov: Oh, hell yeah!

[01:36:11] Nisten Tahiraj: It's a pretty nice paper, so stay tuned from that at some point when we announce it and it's from MITs and and Google and some people from Google. So it's a, another really nice encoder only it has potentially seems to be up to 30 times faster.

[01:36:29] Nisten Tahiraj: So this could

[01:36:30] Alex Volkov: then whisper,

[01:36:31] Nisten Tahiraj: quite useful. It could be quite useful for those making assistance that run on local devices or on low resource devices. But also, For stuff on the web. Now it is officially supported by the Transformers library. We'll wait on Zenova to I think probably it's going to be available via WebGPU and stuff, I'm guessing.

[01:36:55] Nisten Tahiraj: Yeah it's very, it's nice to see that that field also going forward. Because we already have excellent speech recognition. We know it works really well. We just needed to work on more low power devices and mobile and

[01:37:08] Alex Volkov: Absolutely. And looks like some stats here. A bunch of languages are more than the Stan Whisperer, 143 languages. And you can like fine tune this on specific languages as well to make it like better. And VB benchmarked it on Mongolian, and beat Whisperer in less than 1200 steps. So smaller model, like fine tunable, super, super cool, and the best part of it is MIT license.

[01:37:29] Alex Volkov: So there have been other ASRs. They're not in this license. And now we're getting like a state of the art tiny model in this license. I think that's most of the stuff that I wanted to cover.

[01:37:39] NSF announces a new initiative called NAIRR

[01:37:39] Alex Volkov: No, I wanted to cover one last thing. One last thing. National Artificial Intelligence Research Resource. N A I R R.

[01:37:47] Alex Volkov: Which is coming to us from National Science Foundation, United States National Science Foundation collaborating with agencies and different so All of these incredible three letter agencies are collaborations in this foundation now. NSF is the science foundation, both DARPA and NASA, and NIST, which is the Institute of Standards and Technology, and DOD and DOE, and, like, all these things.

[01:38:11] Alex Volkov: But also, the private sector is joining this companies like Entropic and OpenAI. And Palantir, and Google, and Luther, and HugInFace, and Weights Biases. Obviously, I saw this oh, that's cool. We're, like, Weights Biases are participating in this incredible effort. Are all joining together in this initiative to, to promote, support AI research and advancing like safe and secure and trustworthy AI.

[01:38:33] Alex Volkov: And it's also great to see like folks like Hug Face here and Meta as well is represented folks who push open source as well, because, these government affiliations, government organizations, they have to have folks who promote open source as well. And they've organized them to. Four focus areas open enable AI research to access into diverse AI resources via the NAIRR pilot portal.

[01:38:56] Alex Volkov: So definitely expect there to be government grants for GPUs for different things, I don't know how easily those will be obtainable, but we had some folks in Canada from Canada before talked about you could ask for grants. to train or fine tune like the stuff that Tanish was talking about research which open source is better medical in QA could be happening through the government they also focus on security and And I think something called NARR classroom, which I have no idea.

[01:39:22] Alex Volkov: Oh, which new communities for education, training and user support. Like very government like approached. However, this is definitely like good to see the companies that participate in this. It's not only government, it's also open, like a private sector as well. NVIDIA is there, AMD is there, Eleuther, like we said, open source folks are represented as well.

[01:39:43] Alex Volkov: A huge kind of chunk of companies, it's good to see that the government is like actually moving towards some standardization which may be needed hopefully less regulation, more standardization. And I think with that, we are pretty much all over the news that we had for [01:40:00] this week. Which was great.

[01:40:01] Alex Volkov: I want to say thank you. A huge thank you again for, first of all, the listeners who come here and listen, and the folks on stage who help me from week to bring you the latest and greatest in the iNews.

[01:40:11] Alex Volkov: Thank you so much, and we'll let you go on this Thursday, and we'll see you next week.

[01:40:14] Alex Volkov: Take care, everyone. Bye bye.

ThursdAI - Recaps of the most high signal AI weekly spaces
ThursdAI - The top AI news from the past week
Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week.
Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.
Listen on
Substack App
Apple Podcasts
Pocket Casts
RSS Feed
Appears in episode
Alex Volkov
Eric Wollberg
Wes Louis