ThursdAI - The top AI news from the past week

ThursdAI July 13 - Show recap + Notes

0:00

-1:42:03

ThursdAI July 13 - Show recap + Notes

GPT-4 Code Interpreter, Anthropic Claude V2, X.AI, SDXL 1.0 & more

Alex Volkov

swyx (Shawn)

, and

Junaid Dawud

Jul 14, 2023

Welcome Friends, to the first episode of ThursdAI recap.

Every week since the day GPT-4 released, we’ve been meeting in twitter spaces to talk about AI developments, and it slowly by surely created a community that’s thirsty to learn, connect and discuss information.

Getting overwhelmed with daily newsletters about tools, folks wanted someone else to do the legwork, prioritize and condense the most important information about what is shaping the future of AI, today!

Hosted by AI consultant Alex Volkov (available for hire), CEO of Targum.video, this information-packed edition covered groundbreaking new releases like GPT 4.5, Claude 2, and Stable Diffusion 1.0. We learned how Code Interpreter is pushing boundaries in computer vision, creative writing, and software development. Expert guests dove into the implications of Elon Musk's new XAI startup, the debate around Twitter's data, and pioneering techniques in prompt engineering. If you want to stay on top of the innovations shaping our AI-powered tomorrow, join Alex and the ThursdAI community.

Since the audio was recorded from a twitter space, it has quite a lot of overlaps, I think it’s due to the export, so sometimes it sounds like folks talk on top of each other, most of all me (Alex) this was not the case, will have to figure out a fix.

Topics we covered in July 13, ThursdAI

GPT 4.5/Code Interpreter:

00:02:37 - 05:55 - General availability of Chad GPT with code interpreter announced. 8k context window, faster than GPT-4.

05:56 - 08:36 - Code interpreter use cases, uploading files, executing code, skills and techniques.

08:36 - 10:11 - Uploading large files, executing code, downloading files.

Claude V2:

20:11 - 21:25 - Anthropic releases Claude V2, considered #2 after OpenAI.

21:25 - 23:31 - Claude V2 UI allows uploading files, refreshed UI.

23:31 - 24:30 - Claude V2 product experience beats GPT-3.5.

24:31 - 27:25 - Claude V2 fine-tuned on code, 100k context window, trained on longer outputs.

27:26 - 30:16 - Claude V2 good at comparing essays, creative writing.

30:17 - 32:57 - Claude V2 allows multiple file uploads to context window.

32:57 - 39:10 - Claude V2 better at languages than GPT-4.

39:10 - 40:30 - Claude V2 allows multiple file uploads to context window.

X.AI:

46:22 - 49:29 - Elon Musk announces X.AI to compete with OpenAI. Has access to Twitter data.

49:30 - 51:26 - Discussion on whether Twitter data is useful for training.

51:27 - 52:45 - Twitter data can be transformed into other forms.

52:45 - 58:32 - Twitter spaces could provide useful training data.

58:33 - 59:26 - Speculation on whether XAI will open source their models.

59:26 - 61:54 - Twitter data has some advantages over other social media data.

Stable Diffusion:

89:41 - 91:17 - Stable Diffusion releases SDXL 1.0 in discord, plans to open source it.

91:17 - 92:08 - Stable Diffusion releases Stable Doodle.

GPT Prompt Engineering:

61:54 - 64:18 - Intro to Other Side AI and prompt engineering.

64:18 - 71:50 - GPT Prompt Engineer project explained.

71:50 - 72:54 - GPT Prompt Engineer results, potential to improve prompts.

72:54 - 73:41 - Prompts may work better on same model they were generated for.

73:41 - 77:07 - GPT Prompt Engineer is open source, looking for contributions.

Related tweets shared:

https://twitter.com/altryne/status/1677951313156636672

https://twitter.com/altryne/status/1677951330462371840

@Surya - Running GPT2 inside code interpreter

tomviner - scraped all the internal knowledge about the env

Peter got all pypi packages and their description

swyx

added Claude to to smol menubar (which we also discussed)

SkalskiP awesome code interpreter experiments repo

See the rest of the tweets shared and listen to the original space here:

https://spacesdashboard.com/space/1YpKkggrRgPKj/thursdai-space-code-interpreter-claude-v2-xai-sdxl-more

Full Transcript:

00:02 (Speaker A) You. First of all, welcome to Thursday. We stay up to date so you

don't have to. There's a panel of experts on top here that discuss

everything.

00:11 (Speaker A) If we've tried something, we'll talk about this. If we haven't, and

somebody in the audience tried that specific new AI stuff, feel free

to raise your hand, give us your comment. This is not the space for

long debates.

00:25 (Speaker A) We actually had a great place for that yesterday. NISten and Roy from

Pine, some other folks, we'll probably do a different one. This

should be information dense for folks and this will be recorded and

likely we posted at some point.

00:38 (Speaker A) So no debate, just let's drop an opinion and discuss the new stuff

and kind of continue. And the goal is to stay up to date so you don't

have to in the audience. And I think with that, I will say hi to Alan

Janae and we will get started.

00:58 (Speaker B) Hi everyone, I'm NISten Tahira. I worked on, well, released one of

the first Docker chat bots on the market for Dr. Gupta and scaled it,

and now we're working on getting the therapist bought out once. We

can also pass more testing and get Voice to work at a profitable

manner because we don't really have VC. So at the scale of few

hundred thousand users, the API bills matter quite a bit.

01:31 (Speaker B) So, yeah, these spaces have been pretty helpful because I have some

trouble with running a Voice transformer, trying to run it on the

browser on web GPU, and then the person that wrote Transformers JS

comes in here and just says, oh yeah, that back end is messed up.

Just try blas and synth and stuff. So these have been very

interesting and technical spaces.

01:54 (Speaker A) Yeah, we need to get Zenova in here. Zenova is the guy who NISten was

referring to. Al Janae, do you want to give a few words of intro and

say hi and then we'll start? Just briefly, please, because I think we

need to get going.

02:09 (Speaker C) Sure. Hi, I'm Janae.

02:11 (Speaker D) I'm the resident noob, I started messing around with AI at the

beginning of.

02:16 (Speaker E) The year, and I also host the.

02:18 (Speaker D) Denver AI Tinkerers coming up next week.

02:20 (Speaker A) And if you're in Colorado area, greater Denver, please join us. It's

going to be a blast.

02:27 (Speaker F) Hi, I'm Al Chang. I'm kind of an old school technologist. Just

getting started with the AI again and just here to help.

02:36 (Speaker A) Yeah. All right, folks, so I think we've had a whole space on this.

Simon Wilson and me and many, many other folks chimed in. The second

this was released.

02:50 (Speaker A) Was that six? Was that Sunday? It's hard to keep track of actual

days. Saturday, Saturday, last week, exactly during those spaces, by

the way, as we were talking, Chad GPT, Logan and everybody else from

OpenAI announced general availability of Chad GPT with code

interpreter. So GPT four with code interpreter.

03:12 (Speaker A) And I think we just heard from Matt that even some folks who got

access to the slept on it a little bit because it's maybe potentially

because of its very horrible name that's really hard to type

interpreter and get lost in the R's. But it's an extremely powerful

new superpower that we've got. And we've had the whole space talking

about use cases that people already had.

03:37 (Speaker A) It was like three days into it and since then I bet that many more

people tried it. I think Swyx 20,000 listens to that space, plus the

pod. At least people definitely want to hear new use cases, right?

03:53 (Speaker G) Yeah, not much else to add about it. I think it's the feature for

Switch.

03:59 (Speaker A) Posted a whole deep dive essay and coined it GPT 4.5 between us

friends. And one of the interesting things about it is that we think

at least that's where we are currently after playing around with

this, is that it's a fine tuned model. So they kept training this on

actually running code and executing code.

04:21 (Speaker A) That's what we believe. We don't know, nobody confirmed this and then

that it's fine tuned from an earlier checkpoint of GBT Four. And so

we actually had some folks on spaces talking about that it's less

restricted and better like previous times.

04:36 (Speaker A) So it's an interest, I think NISten right. We have some folks who

tell us they're using code interpreter without the code part. They

just stopped the GPT Four just because it's that model.

04:48 (Speaker A) And I think also they took down the 25 messages per hour restriction

on code interpreter. I've had like four hour sessions and it stopped

like I didn't saw complaints.

05:03 (Speaker G) So it's just better.

05:06 (Speaker A) It's also fast. I think it's fast because not many people maybe use

this by default and this could be the reason for the speed, but it's

definitely faster for sure. I think also context window, was it Yam?

Somebody summarized the context window and they told us the context

window for code interpreter is eight k versus the regular GPD for

actually that could be also a kick.

05:29 (Speaker G) You mean Yam copied and pasted.

05:34 (Speaker A) I would encourage you and Yam need to kiss in the cup because Yama is

doing a lot of legwork to take down the stuff that he posted and Yama

is working on that and it's very visible and you guys need to do

there you go, yam, you need to clear the air. However, Pharrell and

Gabriel bring you up as well. And we're going to keep talking about

code interpreter because that's what we're here to do. NISten and a

few other folks and we started cooking with code interpreter.

05:59 (Speaker A) And by cooking I mean we started stretching the complete boundaries

of what's possible there. And I think Simon Willison kick started

this with the latent space Pod. So for folks who are not following

latent space pod, feel free to follow SWIX, his main account, not

this hidden one.

05:59 (Speaker A) And SWIX reposted the spaces we had simon Wilson was able to run node

JS and Dino within code interpreter, even though OpenAg didn't allow

for that by uploading like a binary and asking code interpreter to

generate. Simon then promptly said they fine tuned the model away

from that and we found ways anyway to ask it to do some stuff. I have

a thread on how I was able to run a vector DB chroma inside code

interpreter.

06:10 (Speaker A) I ran whisper CPP. We saw some folks running GPT-2 inside code

interpreter, right? So imagine an Ll GPD Four running another and

talking to it. It's like a little brother inside.

06:10 (Speaker A) I personally love that inception. I don't know if the person who ran

GPD Two is in the audience as Dan I think was the nickname NISten. I

don't know.

07:22 (Speaker A) Surya.

07:23 (Speaker B) Surya. He also wrote the search to PDF plugin for GP Four plugins and

he wrote that in like two days and it's more used than any other

enterprise thing, which is pretty hilarious.

07:36 (Speaker A) We need to get surya.

07:38 (Speaker B) Yeah, he just did that as I'm just going to do a search plugins for

PDF and it's like the most used.

07:45 (Speaker A) So dope pretty amazing. Again, in that space we've talked about

having like a living manual, so to speak, for code interpreter use

cases because it's coding. So it covers pretty much everything that

we can think of as coders, maybe just in Python, maybe restricted to

an environment. And I've been trying to do that with the code

interpreter can hashtag and I encourage all of you, let me pin this

to the top of the space, to the jumbotron if you have an interesting

code interpreter thing and I'll bring up Skalsky P to the stage as

well.

08:03 (Speaker A) And Lantos, so many good friends. If you have a very interesting code

interpreter technique or skill or new thing that people can do

without coding skills, please tag with this hashtag so folks can find

this. Otherwise I will cover the main three things the code

interpreter gave us besides the new model.

08:42 (Speaker A) One of them is uploading files. And since we've talked, we've noticed

that you can upload up to 250 megabyte files and those can be zips of

other files. So we've uploaded like full models weights.

08:55 (Speaker A) We've uploaded bin files. It's incredible that you can now drag and

drop whole directory and have JPT just know about this and read about

this. We've uploaded weights in embeddings.

09:08 (Speaker A) You can then obviously execute code in a secure environment, which is

again incredible, and you can download files, you can ask it to

actually generate a download for you, which is also super, super

cool. Maybe one last thing I'll say before I'll give it to the

audience for a few more cool use cases. And folks in the stage,

please feel free to raise your hand.

09:21 (Speaker A) I'll get to you in the order that you raise your hand if you have a

use case. Some folks built like a built in memory built in brain

within code interpreter just to save to a file. That's what I try to

do with my vector DB and then they download that memory at the end of

every session and then upload this to the next one and have some like

a prompt that reminds the jgpd like to start from that point.

09:50 (Speaker A) So in addition to the context window, they're also having a separate

offloaded file persisted memory. So code interpreter incredible.

Again.

10:00 (Speaker A) Potentially GPT 4.5. And if you haven't played with this, feel free

to if you don't know what to play with, follow the code interpreter

can hashtag and let's get to Skowski.

10:11 (Speaker A) What's up, man?

10:14 (Speaker H) Hi, hello. Do you hear me?

10:15 (Speaker A) Yeah, we can hear you fine.

10:19 (Speaker H) Yeah, I've been playing a lot with code interpreter over the past

five days, mostly with computer vision use cases because that's what

I do. I haven't introduced myself. I'm pretty much doing computer

vision full time for the past five years and was focusing on like

when I saw that you can input image and video, that was immediately

what I was thinking, we need to make it to computer vision. So I went

through some low effort tasks.

10:46 (Speaker H) So I managed to run old school computer vision algorithms, face

detection, tracking of objects, stuff like that. But I also managed

to exploit it a little bit. So you can add yolo object detection

models to the list of models that were run in code interpreter.

11:15 (Speaker H) There are some problems with memory management, so I'm not yet fully

happy with the result. But yeah, I managed to run it on images and on

videos and the things that are super cool and are kind of like

underrated right now, false positive. So when the model detects

something that shouldn't be detected, you can really use text to ask

code interpreter to filter out false detections.

11:48 (Speaker H) You can just give it your feeling like why that stuff is happening or

when or where. And it's very good at cleaning the detections, which

was kind of like mind blowing for me. And one thing that I noticed

that it sucks at is I managed to create an application that counts

objects moving on the video when they cross the line.

11:55 (Speaker H) And I didn't use any off the shelf libraries, I just had detector and

say, okay, now draw a line and count objects when they cross the

line. It's terrible at that, writing math logic to figure out that

something crossed something, we had like ten prompts or twelve

prompts exchange and I basically bailed out on that, forget it. So

there are some things that blow my mind, but there are something that

probably not.

12:49 (Speaker A) So folks, feel free to follow Skowski. And also I just pin to the top

of the Tweet his brand new awesome code interpreter use cases, git

repo, and there's a list, there's a bunch of use cases there. This

could also serve as a de facto manual. So feel free to go there at

PRS and follow that for updates.

12:52 (Speaker A) And I want to get to Lentos because he seems to be unmuting. What's

up, Lentos?

13:12 (Speaker H) I was just going to say I can't follow him because he's blocked me.

13:15 (Speaker C) Sad face.

13:16 (Speaker H) Oh, no, I noticed that, but I'm not sure why. I will undo that.

13:20 (Speaker A) All right, I'm the peacemaker in the status. Please kiss and make up.

You two as well. Everybody should get along.

13:26 (Speaker A) Yay. I want to get to some other folks who came up on stage recently.

And Gabriel, welcome to talk about code interpreter and your use

cases.

13:35 (Speaker A) Jeanette, if you play with this, I would like to hear two more

opinions before we move on to the next incredible thing. Yeah. Oh,

you guys are talking about let's get together and then June sorry, I

should have been explicit about the order.

13:54 (Speaker E) No worries. So I just posted a comment on this space about the

message cap on a conversation. So even though in the UI, it still

says 25 messages per 3 hours, if you look at the network request, you

can see that. And I posted this, it's actually 100 messages per 3

hours now.

14:12 (Speaker E) And I don't know if they're scaling that up and down as demand

increases and decreases, or they're just trying to trick people into

conserving their messages, but it's definitely been on 100 for a

little while now. Can you confirm same thing you can see in the

network?

14:32 (Speaker A) Can you confirm the same for the regular mode, or do you think the

regular mode is still restricted? Well.

14:41 (Speaker E) Based on just the fact that there's only one message cap, they don't

have message cap per model. So I think it's just consistent across

all the GP four models. And that's also my experience in the last

it's been a little while now. It's probably at least a couple of

weeks that it's been higher.

14:51 (Speaker E) And same thing we discussed, I think, on Saturday about the context

window. And you can also see it in the API that the context window is

eight K for plugins and code interpreter, and it's 4K for the base

GPT four model.

15:16 (Speaker A) That's awesome. Like suicide. Better in every single way.

15:22 (Speaker D) Yeah.

15:23 (Speaker A) Awesome. Thanks.

15:24 (Speaker E) Yeah. In terms of use cases I can share, I've been digging around a

lot in the code interpreter, and I was really trying to hone in on

why are the packages that are installed there, the Python packages in

the environment? Why are they there? Some of them seem really random,

and some of them make a lot of sense. And they released it, saying

it's for, basically data analysis. And a lot of them make sense for

that, but some of them are just really wild, like the ML packages.

15:54 (Speaker A) And the Gabriel folks in the audience. If you look up at the jumbo

tone where we pin Tweets two Tweets before there's a Tweet by Peter

Zero Zero G, who actually printed all the packages and asked GPT Four

to kind of summarize what they do. So if you have no idea about the

potential capabilities of what it can do, feel free to pin that tweet

for yourself. And then it has a bunch of descriptions of what's

possible.

16:11 (Speaker A) So go ahead. Gabriel. Yeah, cool.

16:28 (Speaker E) Yeah, I've done the same kind of thing with just a short yeah, I got

it to do a four word description for each one. So if you're looking

for a really short description of each package, I'll post that tweet.

And if you're looking for a long one, I think Peters is great. And

what you can see there is that there are packages for web

development, right? There's Fast API, there's Flask, there's a bunch

of other packages for Web development.

16:40 (Speaker E) And besides the fact that there's no network access, which obviously

other people using it might be turning it on, but it was just

interesting to me. My perspective is that OpenAI has been using this

internally throughout all their teams for development and testing it

internally, but probably also using it pretty consistently. They

probably have access to the Internet.

17:14 (Speaker A) Yeah, I'm sure they have access to.

17:15 (Speaker E) The Internet and they can install new packages. But I think they also

have the ability, instead of uploading files and downloading files,

they have the ability to just mount persist memory, I don't think, to

persist. I think they just mount their local working directory on

their computer right wherever they're working. So they have their

active directory where they have their project, and they just mount

that and give the code interpreter access to the whole directory with

their whole repo of their project.

17:48 (Speaker C) Yeah.

17:48 (Speaker E) And then Chat Gvt is just writing code to the working directory and

reading from there and it can explore their whole project. We can do

that now by uploading, you can zip your whole project and upload the

whole thing zipped and have it unzipped. And then it can kind of

explore your whole project. But then once it makes some changes, you

want to commit them, you have to ask it to zip the whole thing back,

download it and upload it.

17:48 (Speaker E) And then I think what they're able to do is more of like a kind of

peer programming thing where the developer makes some changes and

then Chat GPT makes some changes and they're kind of working

together. This is taking it one step further. I don't know if they

have this or not, but it would be super.

18:29 (Speaker A) Cool in the realm of updates unless there is no speculation. But I

would love to explore this more with you in the next stage because

this applies to open source and how people already saw somebody tag

us after the last space and said, hey, I'll build this open source. I

would love to pin this to the top of the space. However, I want to

move on to new space and then move on to other updates.

18:51 (Speaker A) Sorry to interrupt, but thanks. I think that the collaborative,

persistent code superpower that probably maybe at some point will

come to us as well. Plus the internet access is like another ten x I

want to get to Skowskin and lent us and I think we'll move on to

Claude.

19:08 (Speaker A) Thanks Gabriel.

19:11 (Speaker H) Yeah, I have a question. I'm not really sure guys, if you notice that

I was obviously experimenting with PyTorch because I needed it for

computer vision. I noticed that the PyTorch version that is installed

in the environment actually pre compiled to work with CUDA. So it's a

GPU version of PyTorch.

19:31 (Speaker H) Even though that in the environment you don't have access to GPU, you

only have CPU. So I'm curious guys, what you think about that. Why is

that? Any ideas?

19:42 (Speaker A) Ideas that just come from what Gabriel just said? Likely we're

getting the same Kubernetes container. However, the open AI folks

have like unlimited stuff. They probably also have CUDA that would

make sense right there is probably connected to a GPU as well, but

that's just an idea. Lantos, I want to get to you and then we'll move

on to Claude.

20:02 (Speaker A) Folks and folks in the audience, feel free to hit the little right

button on the bottom left looks like a little message and leave

comments through commenting as well. Moving on to Claude V Two. Folks

in the audience and folks on stage, feel free to hit up the emojis

plus one.

20:19 (Speaker A) Minus one if you have tried Claude V two if you like it and you

haven't liked it. I'm going to cover this anyway because I think

somebody called me, I think Roy from Python called me a Cloud V Two

fanboy yesterday and I first got offended and I told him that I'm

just a fanboy for 24 hours. Before that I was a code interpreter

fanboy and then I figured with myself whether or not I am a fanboy of

Claude V Two.

20:43 (Speaker A) And yeah, I am and Sweet told me to relax and in fact I invited him

here to be the red blanket on the other side of the list. Anthropic

the company that we can definitely consider number two after opener.

I think that's fair in terms of quality.

21:02 (Speaker A) Have long released Claude version and they made some ways when they

released Claude AKS clong with 100K complex window, they have

released Cloud V Two and let me paste some Claude sorry, pin some

Claude thingies in the jumbotron, sorry. However, Cloud V Two

released with multiple stuff and I want to focus on two stuff and I

think we'll cover the UI first and then we're going to talk about the

model itself, UI wise and product wise. My hot take and I'll pin this

to the top.

21:38 (Speaker A) Unfortunately not debate this, but I love you, all of you. Is that as

products, Cloud V Two right now beats JPD as a product. My mom can go

into two websites and she'll prefer one versus the other one.

21:51 (Speaker A) Or my friends that don't know Xai as plugged in as we are, theirs is

free. And I think Cloud V Two beats GPD 3.5, which is also free, and

100K context window with the model being traded, 200 unleashes, a

bunch of use cases that were not possible before.

22:12 (Speaker A) It just frees you up. If you heard Skowski just say the limitations

of code interpreter. A bunch of these limitations stem from the eight

K context window.

22:13 (Speaker A) If you print a bunch within the code that you're doing, code

interpreter sometimes forgets what you guys talked about 20 minutes

ago. And the 100K context window also means a long, long conversation

history with the model. And I think it's really great.

22:37 (Speaker A) Not to mention that you can drag and drop full books in there. Those

books need to be in like one or two files and they still don't accept

zip files. And I'm planning to release an extension soon that does

this for us and unifies and single files.

22:51 (Speaker A) So hopefully by next week we'll have some updates. However, once you

upload that much or you can upload like a transcript or a podcast,

you can do a bunch of stuff because Cloud V Two is also better

trained on code and we saw a significant jump in wait, I'm switching

to the model, so let me get back to the UI. The UI allows you to

upload files.

23:09 (Speaker A) The UI has a command k interface, which I personally love. I hit

Command K in every website and see if they support it. You can just

start a new chat real quick.

23:21 (Speaker A) It doesn't have Share, but it's definitely refreshed and free UI.

It's called Cloud AI and that's the URL, and if you haven't tried it,

definitely try it. Comments about just the product side and the UI

side before we move to the model? Anybody play with this? Anybody

like it? Anybody loves the upload files feature? I would love to hear

hands and comments.

23:42 (Speaker A) Go ahead, Matt.

23:44 (Speaker D) A bit of a weird thing, but what I've noticed is it's actually quite

frustrating if you want to paste text in it actually, if it's over a

certain length, will paste in as a file. Little small thing.

Hopefully they'll change it, but it is really annoying because then

you can't edit it. Chat GP does do that much better, but I generally

agree with you that overall the product experience on Claude is.

24:03 (Speaker A) Significantly the new one. The fresh coat of paint they released for

us. I will say that Cloud so far was kind of a hidden gem, that only

folks who got access to the API actually got access to their UI, and

that UI was very restricted and folks who have access to Cloud API

know what I'm talking about. I think that UI is still around.

24:22 (Speaker A) It still shows your history. It's like very restrictive. It's not as

cool as this it's not as leak as this.

24:27 (Speaker A) So we like cloud AI, definitely a plus. Check it out. Now, let's talk

about the model behind this UI, because that model also changed and

several incredible things that changed with it.

24:38 (Speaker A) First of all, they released a new model, same price as the previous

one. We love to see this. Please everybody, including opinion,

continue giving the same price and cheaper and cheaper down the line.

24:41 (Speaker A) We love to see this. Second of all, they claim it's been fine tuned

on several things. One of them is code.

24:54 (Speaker A) And we actually saw a bump in the evaluation called Human Eval, which

is a set of questions that OpenAI released and I think the bump was

from like 55% to 78%, which I think beats 3.5 and is not there

compared to GPT four. Correct?

25:14 (Speaker C) Yeah, and four and four on past first on the first, not on GPT four

that is allowed to refine and fix it there, but on the first trial.

Yeah, by a little bit.

25:33 (Speaker A) So, news to me and thank you for joining in the past numbers is how

many times it's able to reflect upon the sensors and improve them.

25:43 (Speaker C) The past time is kind of what I meant by reflection is even stronger

GPT four. If GPT four sees the exception, it can come up with a

solution. So this is not in the Human Eval test, but if you use GPT

four this way, you get to 90 something percent, which is which I

think it's more realistic if you think about it. No programmer writes

the whole code in a one go.

26:10 (Speaker C) You write it intuitively, six bugs and so on. And also in code

interpreter, you see it. But it is remarkable to see state.

26:19 (Speaker A) Of the art on first and it's significantly better in code. And I

suggest folks who previously tried quad and haven't impressed to try

as well. An additional crazy thing that they've trained on is 100K

contacts window and they've actually trained, they claim on 200K

contact window, so twice as much as the previous round. And we follow

this one guy of your press, the guy behind Self Ask with Search and

the guy behind Alibi, the ability to extend complex windows.

26:55 (Speaker A) He just defended his PhD and he talked about complex windows and he

was impressed with the way they presented and the way they showed

their loss curve. And so this could be we saw the paper maybe this

week the folks saw the paper where the window dips in the middle.

There's like less attention in the middle of the beginning at the

end.

27:03 (Speaker A) And it looks like that's not the case for Claude as well. So I

suggest you try the huge context window and al you have your raised

hand and then we'll talk about some other model changes.

27:26 (Speaker F) Yeah, I would talk a little bit about I used Claude about a month and

a half ago to win Best Solo Hacker at the Craft Ventures hackathon

david Sachs won. Yeah, it had like 200 entries, but it's

exceptionally good at creative writing and also like comparing and

contrasting. I don't think people have really taken advantage of what

the context window is capable of doing. It's more than just loading

single files in.

27:53 (Speaker F) So what I did for the project was I loaded these large legislative

bills, these like 50 page unreadable bills, and you turned them into

relatable narratives. So one of the things that Claude can do is you

can adopt a persona. So a lot of times with summaries, summaries just

compress the text that you see, but you can tell it to say, write

1000 words from a social conservative point of view, or a bus

driver's point of view, or a social liberal point of view.

28:21 (Speaker F) And what that does is it takes all of its knowledge about the outside

world and gives you not a summary, but it gives you essentially an

essay about the practical effects of something like a bill. I've

actually been working with the idea of reading a book and having it

tell you what I would have learned from this, because that's actually

probably what you're more interested in. What it can do in terms of

comparing and contrasting large essays is exceptional.

28:51 (Speaker F) So you could have it say, write 2000 words from a social conservative

point of view, 2000 words from a social liberal point of view, and

then have it contrast the essays, which is something that would be

very difficult for a human to do. So you get to give it multiple

files and have it just give you a more balanced approach so you get

rid of some of the bias that comes in.

29:18 (Speaker A) My dream, go to my dream project that I never get to is to create

this for Twitter as like a Chrome extension that I can select a bunch

of tweets and then say, remove the bias from this and just give me

the debiased version of all of this. Yeah, completely. Like the cross

reference ability of Cloud between because of this context window is

incredible for many, many use cases.

29:41 (Speaker F) Yeah, I would say that as far it's not as good as GPT Four for

certain things. But that context window is fantastic. And I would say

a lot of people that are using embeddings and retrieval, you can

actually just put the whole thing in the context window and ask

questions to that and then you have a baseline to compare your

results from it. Most people, if they're chatting to a website or

something like that, you actually can just put the whole thing in

there as opposed to trying to chunk it up and do questions and you'll

see that your results are much better that way.

29:51 (Speaker F) And for most people, that would be good enough.

30:17 (Speaker A) So additional thing that the additional thing that Cloud was trained

on, they've talked about the output tokens, just the number. Of

output tokens of how much cloud is able to generate. And they've said

that previous models, I don't know if the same about GPT, I haven't

seen numbers on GPT Four, but they've said that previous Claude

models were focused on shorter outputs just as they were trained. And

this latest model was trained to output up to 4000 tokens in output.

30:47 (Speaker A) This is added to the fact that they also fine tuned it and trained to

output JSON files, complete JSON files as responses, which we as

engineers, we waited for this and Open Xai gave us functions via kind

of here you go, there's the function interface. And we love the

function interface. The function interface kind of locks us down to

the OpenAI ecosystem.

31:04 (Speaker A) And it's great to see another model that's like very close to state

of the art in human evil that also is now fine tuned to respond in

full intact JSONs. And those JSONs can be 4000 tokens at length. Any

thoughts on these?

31:28 (Speaker F) Yeah, I can confirm on it being able to write large amounts of

output. I mean, I was having it write like 2000, 3000 word like sort

of essays and outputs and it was fine with that.

31:40 (Speaker A) Yes. And I think it's I'm going to.

31:45 (Speaker B) Stick with GPT Four myself. But this might be pretty useful for just

dumping in an entire code base, given the 100k context window and

then getting some reviews and stuff, and then maybe moving some of

the stuff.

32:02 (Speaker A) Once I stop posting status and build that chrome extension that you

upload the zip and it flatlines it to one file and then upload it,

then we'd be able to do, like, a proper comparison, because code

interpreter can take zip files and then extract them. Oh, one

difference that I want to for folks in the audience, GPD Four with

code interpreter allows you to upload zip files, et cetera. We talked

about this. It does not load them into context window, right? So

there's like eight k context window.

32:30 (Speaker A) The files that you upload are not automatically in the context

window. The model doesn't it has to write Python code that actually

prints the files. And it usually does like the first few lines, hint,

hint.

32:30 (Speaker A) The folks in the audience who get my drift. But it doesn't usually

read all the unless you specifically ask it to and Claude does. So

everything you upload to, Claude goes directly to the immediate

working memory of the complex window.

32:38 (Speaker A) And that's a major difference to watch out for and also take care of.

Go ahead.

33:00 (Speaker C) I would like to ask everyone before I say my opinion, what do you

think about it in comparison to GPT Four about the performance? What

do you think?

33:10 (Speaker A) I would like comments from folks who actually use both and did the

comparison. And before I get to folks, please raise your hand to

answer. I want to call out SWIX's small menu bar which allows you to

actually Swyx. Can you give us like a brief two minutes on the menu

bar thing?

33:28 (Speaker G) Yeah, well, you don't have to choose. Just run it all the time on

every single chat. So it's a little electron app that runs in the

menu bar. And I've been maintaining it and I just added Cloud Two

this week.

33:42 (Speaker G) Cloud Two is not super stable yet. Sometimes it will fail to submit

the button. So you just have to retry manually to submit the button.

33:50 (Speaker G) But yeah, it's a great way to a B test models, but then also just

amplify every question with between four to five different chat

models with the answers. So I've been trying it. It's up to you if

you want.

34:07 (Speaker A) To.

34:10 (Speaker C) Find it.

34:14 (Speaker A) With the announcements, if you can. Yeah, awesome. Yeah, just

basically and maybe for instance, you don't have to stop using, you

don't have to choose. So I think the last thing that we need to

acknowledge it's, Claude, is the multilinguality.

34:28 (Speaker A) So they actually focused on showing us how much better, like, the new

ones from previous ones, and they posted blue scores, Bleu scores,

clock Two is significantly better at languages than the previous

versions. I think, to answer your question, I think it's close to GPD

Four, if not better at some things. Hebrew goes fluently, and usually

Hebrew is not that great.

34:57 (Speaker A) Russian and Ukrainian that I use also go fluently. And that part is

really good with a lot of context because you sometimes need to do a

lot of translation, or at least I need to do a lot of translation.

35:11 (Speaker C) Yeah, multilinguality works great. I was surprised. Absolutely. What

I think if you just compare the two on the same prompt, the same

question, I have a feeling that GPT Four is slightly better, but I

just don't have an example to tell you.

35:31 (Speaker C) Okay, here I don't know, it's a strange situation, but I really

wanted to ask you, like, what did you try and work better here and

there?

35:38 (Speaker A) So here's my use case that GPT Four currently cannot do. Yesterday,

Lex Friedman interviewed Israel's Prime Minister Benjamin Netanyahu

in one of the weirdest turns of history this podcast was, and given

that I know kind of who Benjamin Netanyahu is from, before I decided

to not listen to this, I decided to use the tools that we have at our

disposal. So I ran this through Whisper with Diarization. So I have,

like, a very nice transcript of who's talking.

36:10 (Speaker A) When I took that, I just dumped this as a text file. And I agree with

Matt, it's a little bit annoying that Claude turns whatever you paste

into like, a little text file uploads. That because you can't edit

it.

36:21 (Speaker A) However, I uploaded that transcript directly to Cloud, and then I

asked it to do sentiment analysis, entity extraction, and sentiment

analysis and entity extraction. Something that if I'd asked GPT code

interpreter, it would probably write some Python code to do this, and

Quad just kind of did it. And I haven't seen GPT Four being able to

do this for bigger files.

36:38 (Speaker A) And once I could just let me just this point. I continued by saying,

hey, because of the new coding abilities of Quad, I asked it like,

hey, print me a Python file that dumps whatever table of topics he

mentioned and sentiment, negative, positive, dump it into a word

cloud. That's something the code interpreters can actually do and

show you.

37:03 (Speaker A) But I asked it from Quad because previously Claude was shit at coding

and it gave me Python files that ran from the first time. I didn't

have to change anything, there was no bugs. And then showed me a word

cloud of everything that was mentioned by BB in that podcast and it

all took like maybe seven minutes.

37:11 (Speaker A) And I don't know if for bigger complex windows, GPT Four can

currently do this. Go ahead, Al.

37:28 (Speaker F) Yeah, I've actually been putting a lot of transcripts for podcasts in

there and you can actually have the because it seems so much about

the speakers and it knows about the speakers, you can actually have

them continue a discussion about things that they didn't actually

discuss. Yeah, so it's like you can have it say, okay, well, what are

some topics they disagreed on and then some things that they didn't

cover? Tangentially, you can just have it give you another two

minutes of interview and it does a pretty reasonable job, especially

with public figures that it actually has a lot of their background

on. So it's pretty interesting.

38:01 (Speaker A) And not to mention free, ngbt Four needs a $20 a month payment and

quality is free.

38:08 (Speaker F) That's a good point, too. For those of you that have eval keys,

you'll notice that they're actually not charging you for them, so you

can actually go on as long as you want. The limitation is that you

can only do one request per organization. So if it's just a single

person, they only charge you basically when you start deploying for

commercial purposes.

38:21 (Speaker F) So that's something that people may not have realized.

38:32 (Speaker A) So I think we've covered everything right, trained on 200K context,

which they can enable tomorrow for us, and we'll get like two X. It's

going to be insane. There is some stuff that they have in Cloud in a

tropic called Constitution AI, so they have a mix of Rlhf access and

Constitution AI. So they're working on their model to actually be

more helpful, but also more safe and less jail breakable.

38:57 (Speaker A) They talked at length about this. We talked about human evil better

and same price and free playground. I think we've covered most of it.

39:03 (Speaker A) So anything else about Quad that we haven't covered, feel free to

raise your hand and tell us, and if not, I think we can move on. What

do you guys think?

39:17 (Speaker G) I'll mention briefly, did you talk about the multiple file uploads?

39:21 (Speaker A) No, go ahead.

39:24 (Speaker G) So I think it's just an interesting way difference between co

interpreter and Claude code interpreter. You can only upload one

file, right? But it can be a zip file with multiple files in Zion. So

it's de facto multiple files, but then you can only run code on that.

Whereas what Cloud here is doing is something slightly different,

which is to me is interesting, which is you can upload multiple

files, it just reads the file straight into the context and it's

using that 100K context to synthesize answers.

39:24 (Speaker G) So you can do, for example, PDF A and PDF B and give me a comparison

between the two of them or synthesize knowledge across them. And I

think that is something that code interpreter cannot do because code

interpreter will only run code across files. So I think that's

noteworthy.

40:15 (Speaker G) It's called genuinely coming up with one new thing that is not

copying chat GBT and good for them.

40:23 (Speaker A) Yeah. And unfortunately no zip allowed. But we're going to fix this

with an extension and hopefully talk about this next week. I want to

say hi to Weather Report.

40:33 (Speaker A) Feel free to chime in. Sorry you raised your hand open to come up

before. So if you have a comment about code interpreter, we've moved

past it, but if you have a comment about Claude, feel free to tell us

what's up with the report.

40:46 (Speaker A) Actually, I had only one thing about code interpreter that in the

previous space I talked about that there was a hypothesis I had about

code interpreter, which.

40:56 (Speaker B) Is to use it as a huddle because it's recorded.

40:59 (Speaker A) We'll move on and let's talk about code interpreter next time. I

think that some folks are saying that their audio is glitching and so

they're not able to and I want to see if I think Joseph has comment

about code interpreter. Joseph Polak. We'll give him a second to log

in and then I think we'll move on to other updates because we have

many other things to talk about.

41:29 (Speaker A) What's up, Joseph? Welcome to stage.

41:31 (Speaker G) Hi there, folks.

41:33 (Speaker A) Thanks for taking my question. I didn't even know all about that code

interpreter stuff with the file.

41:40 (Speaker G) So I'm really happy to have heard it. About Cloud, though.

41:46 (Speaker A) For Cloud. Well, I'm still on waitlist. First of all, it's free now.

You can access it right now.

41:53 (Speaker A) Cloud AI. There's no waitlist anymore unless you live in the States

and you'll have to get a VPN. Okay, I'll definitely check that out.

42:03 (Speaker A) My question was about using Cloud and actually code interpreter

through API. Do you think that's ever going to exist or if it's

coming so clogged API? But I think that's waitlisted. I have talked

with Claude folks and they said the waitlist is now going faster.

42:24 (Speaker A) So they are ready to get more people in. I think because of the new

safety updates, they're less afraid. So definitely apply for the

waitlist on quads account.

42:35 (Speaker A) Code interpreter is not available via API, and we've seen some folks

who hack it together with like, I think a browser plugin that proxy

something. Sweets I don't know if you remember the unofficial quote

unquote code interpreter API and it's how to access this, but it's

not available in the official OpenAI APIs as of yet. We haven't seen

them.

42:56 (Speaker G) No. For the record, there's no unofficial code interpreter API.

There's the browser side thing that we are trying to but nobody's

made any.

43:07 (Speaker D) Adapter for it yet.

43:08 (Speaker G) I think you can, if you want, using puppeteer.

43:12 (Speaker A) I would not recommend definitely, if anything, there was some folks

that tagged us and I need to go and find this that they're working on

like an open source version of code interpreter that uses laws and

stuff. And that one this will likely be the way forward. If you do

want something programmatic that has code interpret capabilities, go

ahead. NISten.

43:35 (Speaker B) There's also Chatbot UI on GitHub. So yeah, for the other people that

are hacking something together, I'll wait until there is something

public before, because then.

43:45 (Speaker D) We don't know everything.

43:47 (Speaker G) Open source is going to be worse. Because you are missing the model.

43:51 (Speaker A) Yeah, because we think that it's fine tuned on actually knowing how

to run code. Right. That's kind of the highlight that we get with

from the less space. We think it's smarter because of that.

44:01 (Speaker A) And one of the main things again, sorry, going back to code number

just real quick, it is able to then fix itself and ask itself, oh,

oops, I made a mistake. Let me try again. Matt, I saw you unmute

yourself.

44:13 (Speaker A) Feel free to go ahead.

44:16 (Speaker D) Well, yeah, just a quick thing. So from what I know, openi will be

offering fine tuning relatively soon. So at that point, you

theoretically could go and fine tune your own code interpreter like

Model, even if they don't offer it, which is going to you.

44:31 (Speaker A) Can also theoretically not that we would recommend, but theoretically

right now you could start distilling some stuff from code interpreter

by asking it questions. Generate code and store it to a file. Ask it

to download and then quote, unquote, generate the data set. But not

that you should, but you can theoretically as well, so that when it's

time to fine tune, you have some data set.

44:52 (Speaker D) Yeah, theoretically. I don't know if a shared GBT currently supports

those types of conversations, but if it does, I'm sure that's going

to happen really soon.

45:00 (Speaker G) I don't think it's maintained because chat GPT itself well, I want to

speak for share GBT. I know, Steven, but I can help you move the

conversation back to cloud.

45:11 (Speaker A) Yes, please. Let's move back to cloud. Thank you.

45:14 (Speaker G) So just between the how many people are listening to this chat

anyway? I think it's like 60 people. Email support@anthropic.com for

the Cloud API.

45:26 (Speaker A) Yes, email them, state your use case and they'll likely get you in

and you can use SWIX's menu bar to actually kind of run them in

parallel with the megaprom feature. Megapron super prompt, what is it

called? I think SWIX dropped. There is like one prompt that you type

and then it all goes to both to all the models. I want to recognize

some folks in the audience.

45:50 (Speaker A) Hey, feel free to regime if you.

45:52 (Speaker D) Want to come up.

45:52 (Speaker A) Obviously, I saw some other Euro I saw in the audience. Max AI.

Welcome, Dexter. There's a bunch of folks who are usually here and

it's great to see, and I think we're moving on to a very spicy one.

46:06 (Speaker A) What do you guys think about Xai? So I'm pasting the summary of the

people. Elon Musk and a bunch of other folks have announced X. AI

they're essentially answer to OpenAI.

46:22 (Speaker A) We've all seen Elon kind of talk about safety and talk about helping

open Xai and then could not be open since then. He talked about truth

GPT at some point. And finally they announced Xai as we were talking.

46:37 (Speaker A) By the way, I have an application from Xai which they're going to

have spaces tomorrow to go deep into deeper into Xai. But so far

there's not a lot of detail. There are some details about the folks

who work there.

46:50 (Speaker A) So they have folks who wrote the Adam Optimizer. There are other

folks thoughts about Xai before we get to hear what they do.

Obviously, there's no product yet.

46:59 (Speaker A) I don't think they've started training. The one thing that I will say

is that they will have premium access to Twitter, obviously, because

Twitter is now rebranded.com Xai. After closing down the APIs and

closing down the scraping for Twitter, xai will now have a data set

that's insane to train on Twitter.

47:21 (Speaker A) And we wish them, quote, unquote, good luck. I would love to hear

from folks on stage. What do you think about the announcement, the

direction, the people? And we're going to wait for tomorrow to

actually hear them talk.

47:24 (Speaker A) I know. NISten, you have some ideas if you want to share to get

started.

47:40 (Speaker B) Well, this is more of an old lady babushko opinion that's just

talking about stuff. I found it interesting that they went from, what

was it? Base GPT through street taking on GPT four and this entire

competition to doing something more noble like dedicating it to be

better at math and discovering new things in physics. So the way I

see that, that's pretty noble. But at the same time, I feel like

that's a result of having problems hiring in order to be competitive

with the other ones.

48:26 (Speaker B) So, yeah, this will be interesting. But the way I see the whole set

up right now is, as the kids say, it's pretty mid, in my opinion.

48:39 (Speaker A) As the kids you don't use with that. I will say that we will see

tomorrow from their space. They're probably going to use Elon's Cloud

to maybe try to hire and it's probably harder now to hire because

everybody knows how quick they're getting fired and how much. It's

not like super fun to work for X, but we're in for a nice ride

because they do have access to the cross pollination from Tesla as

well, right? So if they have big questions, tesla does have a few

good folks still, even after Andre Capati left, and so they'd be able

to ask them for assistance.

49:20 (Speaker A) There's obviously the whole Dodgy thing in play, which we can I don't

think we have time to talk about Dodgy, and it's not new, but there

could be something there. Gabriel, you wanted to come up? Maybe you

have. Yeah, go ahead.

49:33 (Speaker A) Gabriel.

49:34 (Speaker E) Yeah, I was just going to say about Xai, I mean, you mentioned

Twitter's data, and I'd be interested in hearing other people on the

stage opinion on this because recently there's been a lot of work

done on quality of data over quantity of data. And of course, Elon

also has a ton of GPUs. Reportedly, he's bought tens of thousands of

GPUs. So that's definitely important in building these big models.

49:58 (Speaker E) But I'd be interested in hearing from people on the stage if they

think Twitter's data and the kind of data that Twitter has is

actually going to be really powerful for training good models.

50:11 (Speaker A) Anybody wants to take this?

50:13 (Speaker F) Yeah, I'll take a little of it. One of the things that Twitter has

that other people don't is that people are actually debating issues.

So I think that's one of the reasons why he's really focused on the

idea of Twitter being a source of truth and being sort of

unrestricted so that you're not just following like, one thread, you

watch the narratives being debated and he has access to all that.

50:35 (Speaker A) Data and community notes. And it's really hard to scrape. Like, I

don't think it's API ball at all. It's not super simple to scrape at

all.

50:42 (Speaker A) I want to get yum before I think Matt wanted to unmute and go and

then yum. If Matt, you still want to chime in and then yum.

50:53 (Speaker D) Yeah, I mean, nothing too much to add here. I think the community

notes are very interesting as a way to sort of like, reduce

hallucinations. I think one of the things that they're going to want

to do heavily is invest in sort of filtering that data set because

there's a lot of great stuff on Twitter. There's a lot of crap on

Twitter.

51:07 (Speaker A) A lot of yeah.

51:09 (Speaker D) And the more of that that seeps in, the worse the model is going to

perform. Obviously, scale is important, but data quality is

incredibly, incredibly important and the scale kind of doesn't negate

bad data quality. So I think if they do one thing right, it's going

to have to be getting the sort of filtering of the data set down. But

they do have a ton of incredibly high quality data.

51:27 (Speaker A) Yes, I think Yam was next and then we have a few folks wanted to come

in. I think Pharrell wanted to come up. So yam. And then pharrell.

51:34 (Speaker A) And then Gabriel.

51:37 (Speaker C) I just want to say, of course, if you just take Twitter data and

start training your model, you can expect it to be average Twitter,

which is not what you want. What you can do, which is a gold mine, is

to transform this data or just rephrase it as other forms. And this

just makes the data a gold mine because Twitter does have very high

quality content here and there. Absolutely.

52:05 (Speaker C) If you can, and transform it and rephrase it to a different form if

you want an example. So the paper textbooks are all you need.

Basically, they just take data and make it into a tutorial, make it

into a textbook, like perfect, clean and everything.

52:22 (Speaker C) It is very easy to do, and you don't need a powerful model to do

that. You don't need chachi PT. You can use it to do it with a small

model.

52:30 (Speaker C) I'm currently doing off the record, I'm currently doing it myself in

a large model I'm training. It doesn't it doesn't matter matter

anyway. It's a gold mine.

52:43 (Speaker C) What I'm saying, it's a gold mine.

52:45 (Speaker D) About Twitter.

52:46 (Speaker A) An additional thing before I get to Farrell and then gabriel

additional thing. NISten I talked about yesterday at length in our

late night line cook space. That's not going to be scheduled. If you

guys are on, feel free to join that one.

53:00 (Speaker A) Twitter Spaces is also a gold mine. Transcribing Twitter spaces and

seeing all the reaction emojis that they have in real time. Like the

space that Elon ran with RFK Jr. For example, if you know in the

audience who are actual people instead of bots, and you're able to

get like emoji reactions in real time, that's a definite, definite,

very high signal kind of training set that they have and almost

nobody else has.

53:25 (Speaker A) And through how to get Pharrell, you are next, I think. And then

gabriel yeah, I wonder what.

53:30 (Speaker D) The relation is and how useful the Twitter data will be for their

goal of building a sort of math reasoning machine. Right. Also, do we

know if they're open source, as in truly open source or not?

53:49 (Speaker A) No, we don't know yet. Hopefully tomorrow we'll be able to answer

questions. However, we've seen Elon take Twitter's algorithm to open

source, and now he's like, boasting this comparatively competitive

advantage versus something like Threads. He's saying, like, hey, open

source.

54:07 (Speaker A) If you go to Threads, you're under the Zucks influence algorithm. So

there is definitely an attempt to open source from their side, but we

don't know anything about that beyond that. Gabriel.

54:17 (Speaker A) And then Johnny.

54:20 (Speaker C) Yeah.

54:22 (Speaker E) First of all, I think it's funny that Elon's shit posting is

polluting his data set. I would say that.

54:34 (Speaker A) By the way, if there's anybody with the option to detect Shit

posting, it's them, right? They're going to be able to build a model.

Understand, this is shit post. This is like somebody who made an

effort to give us clean information. But sorry, go ahead.

54:49 (Speaker E) Yeah, that's exactly my point that I was going to make, that Elon was

on this crusade before he bought Twitter. And this is kind of why he

got forced into buying Twitter, because he was going after the bots

and he made a big deal about the bots. And I think they spent a lot

of resources on figuring out what's good content and what's bought

content. And another thing is that we each are kind of experiencing a

different Twitter, right? Because we're within whether it's an ML

Twitter or Israel based Twitter, and there's many different

communities and their Twitter is very good at segmenting those

communities and figuring out which content belongs to what community.

54:55 (Speaker E) And they'll have the ability, I think, to segment this data and train

many different models that are good at different things because

they're in a literature community or in an ML community or MMA

community or whatever.

55:37 (Speaker A) I actually saw a map of like 5 million, 7 million tweets all embedded

in Nomic Xai Atlas. I don't know if you guys follow Nomic, they just

recently announced like a 17 million round A, by the way. So kudos to

Nomic good friends. Andre, the GPT for all team, and they have like

an embedded map before the API was shut down that they were able to

siphon, et cetera.

56:00 (Speaker A) And Gabriel, what you're saying is actually visible in the embedding

map. You can actually see those tweets and then different areas of

the political Twitter. There was a journalist Twitter until all of

the journalists started living there's like a bunch of different

pockets of Twitter that we don't get exposed to, not to mention the

different languages.

56:20 (Speaker A) There's a whole Japanese Twitter that's like insane. And people go

super, super hard. And translating is easy.

56:26 (Speaker A) We talked about Cloud being able to translate. So they have a bunch

of very interesting data. And I think Zuck is also going after that

data with Threads.

56:31 (Speaker A) And I think this is the reason why we'll see Threads getting

continued work and we'll see a lot of investment from their side. But

to compare to Threads, and we talked about this yesterday, is that

Twitter has back history and a lot of historical data that they can

train others. Threads is fairly new as well.

56:54 (Speaker A) So definitely a bunch of interesting data sets. Johnny and then

Lentil. Hey.

57:00 (Speaker H) So one I think about when I think about the data from Twitter that is

potentially lacking and some of the other data sets is colloquial

language. Because what Twitter has that Facebook doesn't have and a

lot of other things don't have, especially from what you're talking

about, like historic, is the way that people actually interact with

each other. You know what I mean?

57:26 (Speaker A) Not only that, how it evolved as well, right throughout exactly.

57:35 (Speaker H) To be honest, I think the data sets from earlier is probably better

and stronger because it's just gotten out of hand. But I agree with

what I'm not sure it was Yam or who said the filtering because all

right, this is black box, it's not open source. Elon has not been shy

about his kind of response to what he perceives as wokism and all of

that stuff. I'll be super curious.

57:36 (Speaker H) I mean, there's a big team on this, but I will be super curious to

see what that bears out in the actual model. Because, God, there's

equal parts or more parts disinformation on Twitter than there is

information. So if we're talking about source of truth, that rings

some alarm bells for me, for me personally.

58:21 (Speaker H) So those are just my thoughts.

58:29 (Speaker A) Yeah. Thanks, johnny Lentil. Go ahead. And then Gabriel.

58:33 (Speaker A) Let's finish on the Gabriel and then we'll move on to the next topic.

58:36 (Speaker H) Cool.

58:37 (Speaker A) Yes.

58:37 (Speaker H) So I think it's going to be hugely bullish for this data. And from

the perspective of relating idea space and people and the relations

between those, I think that's probably going to be more of a goat

information than conversation because you can build so much from

that. Like dating this is just one like a dating thing. Or finding

people, finding brain power compute, that's going to be huge.

58:40 (Speaker H) And to touch on the open sourceness of the data, I think not open

sourcing it at some point is going to be hugely politically bad for

Elon to do.

59:23 (Speaker A) That'S.

59:23 (Speaker H) My thoughts on that.

59:24 (Speaker A) Awesome. Thanks, Lance. Gabriel, let's end up and then, Matt, we're

going to talk about some interesting stuff.

59:31 (Speaker E) Yeah, just on the kind of data. I think for those of us who ran,

like, the early versions of Llama before they got fine tuned in all

kinds of ways, and you run it, and especially the smaller models, you

put in a prompt and it spits out some generic Facebook type of

content. It sounds like a Facebook post of like a 15 year old or

something like that. That shows what you get when you use all this

kind of unfiltered data.

59:59 (Speaker E) But I think the interesting thing is that Llama was then fine tuned

in many different ways and some really powerful models are built on

top of it. So I think in some sense, almost any data is valuable in

the sort of pretraining stages and maybe you need really high quality

for the fine tuning, but I think that big volume might be really

useful, maybe not the most economical.

60:21 (Speaker A) So I want to wrap up things why they potentially have like a leg up

versus not a leg up. We definitely know that Twitter was used to

train other models that we currently use. We know this for a fact.

This was the reason why Elon and Sam Hoffman, who used to be friends,

are no longer friends, sheet posting about them.

60:40 (Speaker A) And the current models we use. Do use this data set, but it's old for

them. It's no longer like recent and relevant.

60:40 (Speaker A) And we know for a fact that Twitter is significantly biased and

probably the best place in the world for uncovering news as they

happen before the bias sets in, before the narrative sets in, before

folks know how to before folks get their marching orders from MSNBC,

from the Other Side, how to think about things when not. The Twitter

is really good at talking about issues as they arise, the second they

arise. And I think that on its own is going to teach the models a

very great deal.

61:16 (Speaker A) Naval Ravican, if you guys follow Namal, he always said Twitter makes

him a better writer. So we definitely know also that tweet in short

form condense information better. And if their model trains on that,

obviously taking all the precautions we talked about before, bots and

shit, posting, et cetera, if they're able to actually get this into

the model, likely their model will be more up to date and more fine

tuned like reaction.

61:20 (Speaker A) So with that, I want to close. We'll see about Xai. It's definitely

exciting, right? We're potentially getting another big one,

potentially open source one.

61:20 (Speaker A) So we'll see. I'm going to wrap up this update and I think the next

one I want to move on. Matt, let me know if you're still around if

you want to cover.

61:20 (Speaker A) So we have Matt, who introduced himself in the beginning. So I'll let

you do this quickly again because maybe and then we're going to talk

about the stuff that GitHub Stars is rising on, which I think is

super cool. And I invite you to give us a little bit of an interview

about this.

62:16 (Speaker A) Go ahead, Matt.

62:17 (Speaker D) Yeah, sure. So I'll try to summarize it a bit better than the last

time. A lot of practice, but very long story short, co founder, CEO

of Other Side AI, creator of Hyperwrite, and a number of other

things. Basically, we've been around for a number of years now.

62:30 (Speaker D) We're one of the first companies in the space working with LLMs. The

goal always has been to build a personal assistant that scales to

everybody, just like a real human personal assistant, but at scale,

way cheaper, digital. The tech wasn't there at the beginning. So we

built other products to sort of learn and gather resources, whether

that's users, revenue, bunch of other things that we can do.

62:50 (Speaker D) What we do today. Today we are actually building that personal

assistant. So an AI that can operate a computer, any software to do

what a human can do on pretty much anything.

62:53 (Speaker D) So it'll help you with your tasks. It's very simple. Today it's a

Chrome extension that lets you sort of like control Chrome just by

sort of talking to it.

62:53 (Speaker D) So you could say, go order me a pizza, or go send this person an

email or go filter my email, or anything else it works okay today.

The idea is that over time, it's going to get a lot better, a lot

cheaper, a lot faster, to the point where six months from now, a year

from now, it might actually be as good as, if not better than a human

on many tasks. But that being said, while I work on this, I also like

to learn about getting the most out of these technologies because

they're so fast moving and you really have to stay on top of it to be

effective, or you.

63:34 (Speaker A) Can every week and then stay up to date with us together. But yeah,

go ahead.

63:40 (Speaker D) Exactly. I mean, a lot of what I do to learn really, is just build

things that I find interesting, and I find that often, even if I'm

not expecting it, a lot of those learnings do translate to stuff

we're doing at other sides. So this sort of just came out of that.

Happy to sort of dive into the project, or if you want to sort.

63:56 (Speaker A) Of stop me and let's pause here for a second and I'll just tell folks

that I pinned Matt's Tweet from a couple of days ago with the

introduction. Since then you got a few thousand stars, I think, on

GitHub, and we're going to talk about the GPT Prompt Engineer project

and the different reasons why Matt and folks kind of written this and

what it's here to serve. So maybe give us an introduction to the GPD

Prompt Engineer and what kind of made you come up with this and how

it works. Yeah, go deep, man.

64:29 (Speaker A) Sure. Yeah.

64:30 (Speaker D) So forget about rambling in advance. Essentially, I find prompt

engineering so fun. I've been doing it pretty much every day for

everything, honestly, to the point of excess, from what I would do

for work to having it decide what I'm making for dinner for years

now. And as I've gone through this process, sort of like learning how

to use these models, it's become very clear that especially as these

models evolve, there's no best practice for anything.

64:54 (Speaker D) Prompts change ways to prompt change. Something that works for one

task might not work for a very similar task. And the only way sort of

get out of that is to sort of get an intuition of the model and try a

lot of things, but that doesn't always work perfectly.

65:01 (Speaker D) And also you don't really know kind of what works and what doesn't.

Even when you're trying things right, you have to do it sort of like

in a very scientific way, but there's no real right answer to

anything. It's kind of like alchemy.

65:18 (Speaker D) So starting to think I think this was right. When GPD Four came out,

I was using GPD Four pretty often to just ideate prompts. I would

say, here's what I'm trying to do.

65:20 (Speaker D) I would say, write a prompt me, and I would use the ideas from that

to help me improve my own prompts and that actually got a lot of

interest. We ended up building a sort of thing similar to that into

the hyperwrite platform. At the time it was really cool, but really

wasn't something that would replace what I do every day, which is

really hardcore prompting.

65:43 (Speaker D) Eventually I was just sort of thinking about it, and I think this was

on the 4 July, I was just sitting there kind of thinking, what if we

tried it? And I started thinking about how could you design a system

that actually comes up with good prompts? Not just a prompt that does

the job, but something that's actually optimal, because as humans,

right, we can only try so many things at once. But the magic of these

LLMs is they're creative and they think faster than we do. In the

time that I could write half a prompt, LLMs could write 5100.

65:48 (Speaker D) And what if you could leverage that? Because even if the average

prompt isn't very good, you're going to luck into one or two that

happen to be exceptional for your task. So I started by doing it

actually with a classifier. I only released this notebook yesterday

just because it's like a step on the road.

65:48 (Speaker D) And what we ended up using it for was actually something at other

side where we needed to build a classifier for something with

personal assistant. And I just wasn't getting good performance out of

the prompts that I was writing. So I said fuck it, what if we have

the AI try to do this? And I built this so that essentially I

describe the task, I give it some test cases, so I'll give it some

true false test cases.

66:11 (Speaker D) Because the classifier was classifying things as true or false. It

was like classified the statement as true or false. And it was like

New York is in America, it would be true.

66:54 (Speaker D) If it was new York is in Paris it would be false. And I basically

created like ten or 20 of these test cases. I described the task and

I had GPT generate something like, I think 20 or so prompts.

66:57 (Speaker D) And surprisingly, the quality of them just at first glance was pretty

good, right? It was kind of shocking considering I spent so much time

trying to do this manually. Then what I did was I just basically had

each of these prompts test against each of these test cases. And I

plotted sort of the success of each and turns out some of them

actually outperformed what I did.

66:57 (Speaker D) I was kind of shocked, right? Like you wouldn't expect that,

especially doing this for years.

67:30 (Speaker A) Just to recap real quick on this, the GPT four, I assume that's what

you're using generated prompts actually performed better than Match

rumors. Prompts and Matchroomr is the founder of a prompt company

with a lot of prompt use cases for a long time, from GPT-3 to four,

et cetera. And some of the ones that it came up with performed better

than yours.

67:52 (Speaker D) Yeah, it was kind of scary. Some of them performed way worse. But the

idea is that you're going to sort of luck into something that is

better. Maybe two out of 20 will be better, but they're great.

68:02 (Speaker D) So I was sort of just so fascinated by this, I was like, how do you

take this further? Because classification is one thing, but real

prompts where you're actually having it generate text, those are

harder. How do you judge that? You could use GPD four to judge them,

right? If you have two prompts and you say each of them generate me

something and they give you your responses and you want to know which

is better, you can ask GPD four. And so I figured we could apply

that.

68:29 (Speaker D) Turns out there's some issues with that and there are some papers

written about this where essentially it'll be sort of like more

favoring the one that's on the bottom. So just do it twice, flip the

order and see if one wins. And I took that approach and I sort of

combined it with sort of like an ELO style tournament where

essentially you have each of them go head to head, like one on one,

and each of them gets their ELO score either bumped up or down based

on whether they win, lose or draw.

68:53 (Speaker A) Can you give two sentences on ELO scores as a concept? Yeah.

68:57 (Speaker D) I'm actually not super familiar with it. Funny enough, I had GPC

write the code for that part, but basically think of it like a

ranking system in a video game. Yeah, think of it like a ranking

system in chess or a video game where you have two people competing

and the one that wins gets their score increased by x. The one that

loses gets their score decreased by x.

69:18 (Speaker D) And it also sort of like weighted based on the previous scores. So if

somebody that has a high score beats somebody with a very low score,

their score won't increase that much because they're very likely

going to win. So it's sort of just like a weighting system to help

figure out what's the best so instead of just sort of getting a clear

cut, yes, this is right, or no, this isn't what you can do with

classifiers, because there is a right and a wrong ground truth

answer.

69:39 (Speaker D) I just had each prompt sort of generate for a test case and the sort

of opposite prompt the competition prompt would generate for that

test case. So I was a little bit complex and they would have the

model judge which one was better. And it's expensive, right? It might

cost like $20 in GPT calls to get to an answer, but turns out at the

end, the prompts again were just kind of blowing me away.

70:04 (Speaker D) Awesome creativity in them. Like the words it used, the trigger

words, it didn't do what I would do. And in a really good way.

70:10 (Speaker D) And it also opened up my eyes to sort of like new ways of prompting

that I never would have thought of and just sort of like aren't

standard. And that's kind of the magic of all this. I think that this

sort of abstracts away the sort of atomic level of prompts, right?

You talk about prompts as sort of a prompt in and of itself and then

a system built around the prompts with many prompts kind of working

together.

70:31 (Speaker D) This makes it so that you don't have to guess about, do I have the

best prompts for this single atomic part of our system? Where the

magic really comes in then, is how do you string these amazing

individually crafted by AI prompts together to make something that

actually works really well.

70:46 (Speaker A) And how you robustly build the evaluation system, right? Because the

classifier is a simple example of evaluating, because maybe you know

this, et cetera, but how do you actually scale up the evaluation

system such that this could potentially run in loops and then

generate the best of the best prompts for a task?

71:03 (Speaker D) Exactly.

71:03 (Speaker A) That's also like a very interesting piece. How do you think about

evaluation going forward?

71:08 (Speaker D) Yeah, so I think it's sort of like that, where you could have this

thing run in the loop three times and take the three winners and then

have GPT read those winners right, and be like, here are prompts that

worked really, really well. Here are the test cases where they

failed. Now I want you to write new prompts that take what's good

about these but also mitigate the failure cases and generate a whole

new set of prompts. Sort of like evolution really doesn't just have

to stop at one point in time after the first run.

71:37 (Speaker D) It's like, let's learn from what these amazing ones still did wrong

and continue to make this better and better and better. Obviously,

this relies on a relatively large test set. I'm also experimenting

with ways where you can have the test set autogenerate, but that's a

little bit finicky.

71:50 (Speaker D) But I do think that sort of like evolution of this could lead to some

really exceptional prompts. But what I found was even on the first

run I was seeing it outperform myself. For example, there was a

classifier we were using GPT four with logic bias to do because it

was such a hard challenge and we were getting some like 90% accuracy.

71:50 (Speaker D) I had it do these prompts with GPT four, but then I had it run them

using GPT 3.5 and it got 96%.

72:19 (Speaker A) We've talked about this pattern before where you can outsource kind

of the hard work to GPD four, but then once you get really good at

prompting, GPD 3.5 is actually very decent in many things and it's

way faster, cheaper, and has a 16K context now that you can use. And

so we've seen this pattern with many folks that if you don't need the

full power of the GPT four, human evil for coding, et cetera. You can

go far into GPT 3.

5 and get very far along, especially as you're getting better

prompts. And now, Matt, you have like a recursive crafter helper guy

that's here. And my next question for you is, have you used anything

else? So you mentioned GPD 3.

5 where you run the prompts. Have you tried them on different models,

like Cloud maybe, or the open source llama ones?

73:07 (Speaker D) I actually haven't just because I wanted to see if this worked. It

was sort of just an interesting thing for me and my time is really

focused on other side and personal assistant, but it wouldn't be hard

to get Claude in. I suspect Claude prompts would perform better on

Claude. Open ad prompts would perform better on Open xai just because

the models give the prompt them very differently.

73:18 (Speaker D) Claude is sort of like a more emotional thinker. Open xai is more of

like a logical thinker. It's a very sort of simple, not perfect

analogy, but I suspect you'd want to sort of like stick within the.

73:36 (Speaker A) Ecosystems, maybe, not to mention inflections pie, which is like a

whole different beast.

73:41 (Speaker D) Yeah, that's an interesting one.

73:44 (Speaker A) We discussed by a couple of times and I've seen some reactions, but I

don't think maybe at the end of this, if we have time, matt, one

question I will have for you on this and I think we'll move on. Is

that where folks can find more work of this? Is it open source? What

are you looking for contributions? If you are. And yeah, just give us

a wrap up of this project.

74:07 (Speaker D) Yeah, so you can find it on GitHub. It's called GPT prompt engineer

Currently there are two notebooks. It's all done in Jupiter notebook

format, so it's pretty easy to edit. One is for the classification

system, the other is for the generation system.

74:20 (Speaker D) We're honestly sort of like at a point where it works well, so it's

like, what do you build around it? One thing that's missing is the

classification version only supports true and false labels, but it's

not hard to use TikTok into or TikTok and whatever it is to allow it

to support arbitrary labels like happy, sad, angry, whatever. That's

probably like a 20 minutes ad that if somebody goes in and does that

opens up a whole new set of use cases. The evolution idea that I

mentioned before, right? Taking the best prompts and then saying,

here's where it went wrong on these test cases, and then throwing it

back to GPT and having it generate more and rerunning it, that's

interesting.

74:45 (Speaker D) The ability to use Claude would be awesome if anybody wants to add

that. I could even see it evaluating each prompt on each model,

right? Because right now we only generate with GPD four. We only

evaluate with GPT 3.

75:19 (Speaker D) 5. But imagine if you generate with GPD four half of them, you

generate half of them with Claude and then you evaluate each prompt

on GPT four, GPT 3.5 and Claude.

75:27 (Speaker D) And you can see sort of the latency success rates for each along with

scores. I think all that would be super interesting. Also sort of

like just open to ideas.

75:40 (Speaker D) I'm not really sort of supporting this at all. So if anybody wants to

kind of take it and run with it, I am all for that. Also sort of just

like a shameless plug right now or thing that we're looking for just

because I have an audience here.

We are at other side in hyperwrite, really looking for somebody to

help on back end hopefully with a security set of expertise. And then

also if anybody is experienced in training machine learning models, I

would love some help there because we're doing a lot of LLM training.

75:55 (Speaker A) So just quick thing and also to add that now with the Prompt Engineer

that's automated, the results of this would likely generate like a

great data set that you can add and continue fine tuning, especially

as GPT four fine tuning is coming soon. So Matt, definitely store

everything you generate with the yellow score and everything and from

a GPT prompt engineer that runs and doesn't know about the rest run,

maybe there's going to be a path forward to actually fine tuning a

prompting model, which could be exactly. Well, yeah, exactly.

76:28 (Speaker D) Imagine taking a prompt and taking one that has a slightly higher

score and fine tuning a model to take the initial prompt and then

sort of output the one that has a higher score and you can do that

evolutionarily continue to get better prompts in theory.

76:40 (Speaker A) Awesome. So folks, if you want to work in a cool place, I can write,

hit met up and also check out GPD Prompt Engineer on GitHub. Thanks

for coming. Feel free to stay and kind of continue commenting and

talking with us as we go through a bunch of other updates that we

have.

76:57 (Speaker A) Just a quick check with NISten who promised me to follow Twitter and

see if anything new comes up. Breaking news as we talk. I haven't

seen anything besides the space of Xai.

77:04 (Speaker A) I will ask people's attention to the last pin tweet from Dr. Jim Fan

that talks about the context length dip. Matt, you also touched on

this context length dip. It's basically a paper, I think.

77:22 (Speaker A) Stanford I'm not sure that figured out. That even longer. Context

windows, they have a dip in the middle, which means that at the

beginning of the prompt at the end of the prompt, the model has more

attention to what you actually asked it to or the details that you

provide in the middle there's like a dip.

77:39 (Speaker A) And this was also released this week. However, the one thing I said

previously I will repeat here claude and some folks who know about

contact windows way more than me. They say the Claude is actually

really good at this without the dip.

77:54 (Speaker D) Yeah, I feel like that's saying. It's an interesting paper. I feel

like it's sort of saying like, hey, if you train on marketing copy,

then it's going to be worse at coding, obviously. Right.

78:03 (Speaker D) We do a lot of long context stuff at other side. That's actually what

I'm focused on right now, training really long context massive

models. And if you train it on data where there's context in the

middle that matters, it is going to be good at that.

78:16 (Speaker A) Interesting. So what you're saying, I think I've seen this kind of

opinion before as well. It's just the outcome of the data that was

fed in and for blog posts and other places, people want to hook your

attention in the beginning and then kind of finish strong. Basically

you're saying that this is potentially an outcome of that and not

necessarily the tech behind it.

78:38 (Speaker D) Yeah, I believe so. I mean, who knows, maybe wrong, but from my

experience, right, why I was given that analogy before is like if you

train it up to do one thing and then you're asking it to do another,

it's not going to do that other thing as well. And I'm guessing the

data set that they sort of did this evaluation on was something that

didn't have a ton of information at all. Part of the reason that so

few of the language model companies have super long context length

models and why it was such a big deal that Anthropic did is because a

lot of the challenge in training them isn't actually in training

them, it's in the data.

79:08 (Speaker D) Obviously, inference becomes a challenge. It's the cost and the

overhead there. But the data to sort of do this is really sparse.

79:10 (Speaker D) It's not very available. Right. So that's I think part of it right

there's not just like a sort of standard data set that has super long

context link, that has information in the middle.

79:25 (Speaker D) We do actually we've been building one another side and that's sort

of given me some of the ideas that I'm sort of spouting here. But my

guess is that Anthropic part of the reason theirs works is because

they focused on the data. The data is really important.

79:38 (Speaker A) Right.

79:39 (Speaker D) I will say model, it's just fine tuning.

79:41 (Speaker A) Yeah. I will say when I got access to Clouds Window, I did like a

bunch of tests with my Twitter data. I just pasted like a bunch of

JSON with Twitter numbers, twitter IDs numbers. And the smaller

model, the not 100K, gave me back results that actually didn't invent

those numbers.

79:57 (Speaker A) The 100K model lost in the middle and started inventing those

numbers. I literally saw this difference between the longer complex

one and the previous one and I thought it's because of like it loses

some complex in the middle. And I need to retry this on the new ones

because the new ones, they claim this doesn't happen with that.

80:01 (Speaker A) I want to go to Al and yeah, one of you I think raise your hand first

to talk about the context length dip and that paper if you have read

this, if you have thoughts and if you have noticed this as well.

80:29 (Speaker F) I just had a quick question for Matt about the differences that he

found in prompting between say, Claude and GPT Four. I noticed like,

the prompts aren't really reusable and maybe you could speak to that

in the general case.

80:42 (Speaker A) Yeah, let's end with maybe this question and move on to other updates

as we have. Go ahead, Matt.

80:48 (Speaker D) Yeah, sure. So it's like talking to two people with two different

personalities, right? They're both people, but they respond

differently to different ways. You're sort of prompting them, if you

will. Claude is sort of like more emotional, I guess, where open xai

is sort of more logical.

81:03 (Speaker D) And it's hard to sort of pin that down to any one thing, and it's

hard to give you sort of like techniques based on that because,

again, every use case is very different, but it's very clearly it's a

prompt them differently. I think also talking about the idea of fine

tuning a prompting model will be very interesting is fine tuning a

model that takes an Open Xai prompt and converts it to the idealized

version of a Claude prompt and vice versa. I mean, I think that could

be very powerful because there are ways to sort of intuit your way

there.

81:29 (Speaker D) It's just hard to sort of distill into a set of rules. One thing I

found actually quite interestingly with Quad two is that it is

insanely resistant to sort of like jailbreak attacks. So I was able

to get it to do it.

81:44 (Speaker D) Turns out the stupidest method worked. It was sort of like modifying

that dan prop that's been going around like reddit but the more

nuanced sort of like complex methods that typically work with OpenAI

they didn't. So I think the model is just qualitatively different.

81:56 (Speaker D) I think it's going to take some time to fully explore it and

understand why and how still super early days.

82:06 (Speaker A) I love the fact that all of us are getting an intuition about

different models and how to approach them right. And that's like

Sweet was here before. This is like a specialization of what I think

he talked about as an AI engineer. We're getting to start to

understand the differences between those to the little fine little

things that you can say.

82:11 (Speaker A) And I think it will be very interesting if you have a model that's

trained to actually convert them or translate them between the models

to work the same. I have an idea where not to get locked into the GPD

Four ecosystem with the functions. I have an idea of wrapping the GPD

Four API package with something.

82:47 (Speaker A) They will actually kind of print the functions into the context

because cloud now has a huge context window. And then try to see

whether or not cloud is able to kind of without additional tech,

without additional changes to the API to replicate the outputs of how

a GPT with functions would do. And that's going to be an idea I'll be

testing, hopefully, and talk about next week.

83:08 (Speaker A) Thanks, Matt.

83:10 (Speaker C) Today, there has been a thing today, maybe yesterday, but anyway,

today there have been a model that generates prompts. By the way, by

giving the data, you generate the prompt. I've written about it today

on Twitter. It is so powerful, it is such a cool method that you can

take whatever you have, like, I don't know, scientific papers and

generate instructions for them.

83:32 (Speaker C) Now you can fine tune a model that generate scientific papers. You

got jokes. Now you can train a model that become funny.

83:35 (Speaker C) You can generate the instruction, convert whatever you want into

instructions. Amazing it is today. One more thing about the deep in

the middle thing.

83:51 (Speaker C) I don't know why it happens. I have no idea how Open Xai trained

their models. But I think if you think about it, many missions, many

instructions, paragraph, and before the paragraph, you tell the

model, please summarize the following, or on the contrary, like a

paragraph and at the end, what was that? Something.

84:10 (Speaker C) So it makes a lot of sense that a model pays a lot of attention to

the beginning at the end, because of this. And on the same note, it's

very easy to fix. So I wouldn't just point fingers.

84:21 (Speaker C) It's good that they pointed it, but I think it's like, I don't know,

a couple of minutes of training, open AI, like, fine tune for a

minute and fix it.

84:28 (Speaker A) I just want to ask yum, yum. The the pin that I just tweet sorry, the

Tweet that I just pinned on top, this was the one that you talked

about, the instructions generation and the problem generation.

84:38 (Speaker C) Yeah.

84:39 (Speaker A) Awesome. So folks, definitely feel free to check this out. I haven't

seen this. You want to give a couple more words about that one.

84:44 (Speaker A) It looks like you wrote, like, a very deep dive. What's the model

like eleven B, three B?

84:54 (Speaker C) Sure. Two models put into the models, whatever you want. Okay, let's

go back. You got a data set of something, emails from your company,

for example, and you want a model that will help you write emails.

85:01 (Speaker C) Okay, you can start thinking about how to train this model, or you

can use this and now generate a text that basically says, help me

write the following email to this following person of something

something and the actual email. And all of a sudden, you have a model

that is extremely you have a data set to train a model or to fuselage

or whatever that is extremely tuned to this. So I think it's a very

cool technique.

85:40 (Speaker C) It's very powerful, has a lot of potential. And the trick, in simple

words, is training the model. What not to say? That's the missing

piece here, that they added the trick.

85:51 (Speaker C) They took instructions and outputs that do not fit just a different

random output from the data and train with a different laws. That the

model should not say this because this input does not with that

instruction, does not result in this output. That's it.

86:11 (Speaker C) That's the trick. And it works perfectly and really cool.

86:17 (Speaker A) Awesome. I have some folks who want to come up and ask questions. I

think we're almost there in terms of the updates. I will just briefly

run to some updates.

86:18 (Speaker A) I don't even have time to go and look for the threads, but if you're

not following Rama CPP, follow gerga is one of the groups that we

have in the States. I think he single handedly is in charge of so

many folks trying to get a MacBook, because it's incredible how much

performance they've been able to squeeze out of Llama. And it's

comparatives.

86:49 (Speaker A) And many people just, like, quantize their models, basically make

them smaller to run on this GGML platform that they have. The recent

news that I have from over there, there's like two pieces of news.

Last week, for those of us who were here last week, we talked about

CFG.

86:58 (Speaker A) I forgot something. I forgot the guidance scale. And we talked about

the CFG parameter moving from diffusion models that we know.

87:17 (Speaker A) Like, in stable diffusion, you can define how close to your prompt

should the model generate the image. Somebody decided, I think, an

illusion reaction. Somebody said, hey, can we have this control of

CFG to our LLM generation? CFG is a classifying guidance scale,

something like that.

87:37 (Speaker A) And they did it. The Chad GGR added this to Llama CPP. And so now you

can actually kind of pass a CFG control and fine tune.

87:48 (Speaker A) It's almost like a running fine tune to an extent. You can test the

model to be closer, farther away from the problem that you have.

Contrasting this with the stuff that we have on a GPD, four API,

which is temperature.

88:01 (Speaker A) And I think, Matt, you mentioned something to logic bias, logged

bias, something like that, right? Where you can ask it not to say

certain things. So contrasting CFG, it's like a different beast that

we now have a different control. And so GGML just merged into their

platform.

88:18 (Speaker A) Definitely worth checking out. And the second thing is, I need to

find the Tweet. Yesterday, Georgia was like, oh, yeah, by the way,

here's the 48% inference speed improval that somebody just merged in.

88:30 (Speaker A) Have you guys play and try this. For the 33 billion parameter model

of Llama, somebody just merged in a 50% increase on inference speed

just on the way. And I find this incredible because Gmail already

runs many stuff on Raspberry Pi or whatever, iPhones, and now

somebody's like, oh, yeah, here's a 50% increase in infinite speed.

88:41 (Speaker A) And then I think Nissan was here before he was talking about GGML

runs on the iPhone, because iPhones, even from three years ago, have

the same neuron chip that like the latest Max or some such, and that

this performance boost on GGML also applies to iPhones as well. So,

incredible stuff. And as we hear every week, we keep seeing leaps,

incredible leaps in speed and performance.

89:15 (Speaker A) Definitely worth checking out GGML and the five folks that work on

those stuff. GML comments, folks who use Llama, CCP, feel free to hop

up and raise your hand and give us more updates from that length. I

denied it.

89:28 (Speaker A) You are gay at the spaces, but sometimes as a guest as well. Other

than that, I think we'll move on to some more updates and then we

just have questions. No? Cool.

89:41 (Speaker A) So the next update that I have is from the diffusion side that we

sometimes cover. We don't cover it often, but we do cover it from

sometimes time to time. So two things from stability stable

diffusion.

89:46 (Speaker A) We talked about Sdxl, the new Excel model that can generate 1024

images. We've talked about last week about the 0.9 weights dropping.

90:01 (Speaker A) Sdxl 1.0 is now available in the Stable Diffusion discord. If you've

played with Me Journey before and you looked at Stable Diffusion,

it's like, it's not that great.

90:05 (Speaker A) Stable diffusion sdxl one is really impressive. And besides being

really impressive, they plan to release this open source. So we're

going to see a bunch of folks fine tune loras and specific versions

of the specific things.

90:16 (Speaker A) And I think it's like, incredible. If you want to play with those

models and you haven't yet, go to Stable Diffusion discord and hit up

that bot and then Netflix let us know how incredibly different that

is. And we're waiting for the wait for the Sdxl 1.

90:47 (Speaker A) 0 to drop. And I will mention this every day until the year mark.

It's been less than a year since table Diffusion.

90:57 (Speaker A) It's been less than a year. I remember I think it was August 22 when

they actually dropped the full open source model. Less than a year.

91:12 (Speaker A) And we've seen just such incredible progress. So, like Matt said

before, it's really hard to keep up, but it's also really hard to

internalize how far, just how far we're coming with those incredible

leaps and changes every week. And again, to just plug in this

Thursday I space.

91:21 (Speaker A) This is why we're here. Every thursdai talking about everything and

everything that's changed and updated. And the other thing that I

want to I see art in the audience with apart.

91:28 (Speaker A) If you play the list, the Excel, feel free to raise your hand to come

up. The other thing that they released, I don't know if you guys

familiar with Clip Drop. So Stable Diffusion bought Clip Drop as a

company and started implementing that interface compared to their

Dream Studio interface.

91:49 (Speaker A) So ClipDrop is like a way simpler interface day to day release,

something called Stable Doodle. Stable Doodle is I don't know if

folks in the audience remember this. Meme how to draw an owl.

91:51 (Speaker A) Step one, draw a circle. Step two, draw some eyes. And step three is

like, draw the rest of the fucking owl.

92:06 (Speaker A) And then you have, like, a beautiful owl painting at the end of this.

This is now the go to test on how the Doodle models work. And I

pinned my attempt at this, but definitely check out ClipDrop Doodle

thing.

It's really fun to play with. So those are, like, the updates from

the diffusion world.

92:10 (Speaker D) Hey, real quick. I was just looking at the repository for Comfy UI,

and then I saw that I don't know how to say his name. Scousekip is in

here. So I just wanted to come on and say, like, hey, this is

incredible.

92:24 (Speaker D) This is what we've been talking about for months now, right? This

node based character codex, if you will, of like there's just

infinite possibilities. I just want to listen, but thanks.

92:35 (Speaker A) For bringing me up.

92:36 (Speaker D) This is really cool, man. I was just thanks for bringing up Comfy UI.

92:42 (Speaker A) I feel guilt at not being up to date on every single possible thing.

I know it's impossible. I really try, and Comfy I has been on my list

to try, but then Quad was released and Code Interpreter was released.

Comfy I seems like the thing we want, man.

92:42 (Speaker A) I think stabilization when they tried to bring up Dream Studio, they

talked about, like, a node based thing where you can pipe models to

other models, you can find filters, et cetera. Comfy UI for folks who

have tested it out, it looks like that's it. And I definitely want to

agree with Art.

93:16 (Speaker A) It's something to watch out and maybe try because automatic one on

one, even though it's, like, super advanced and has been there for a

beginning since Stable Diffusion, it's just like a shit show of a UX.

Just like horrible, horrible. I'm sorry, guys.

93:30 (Speaker A) I've built a web UI before automatic. It's really hard to get Gradio

to play as much as you want. It's really hard to maintain a good UX

product with many, many people contributing, with many, many things

are changing under your feet.

93:45 (Speaker A) So it's really not their fault, but it's a shit show to get started

with. And Comfy UI seems like a fresh, clean start. So definitely if

you're playing with this, test this out and let us know.

93:55 (Speaker A) Max, you have your hand raised and you play with the Excel. Give us

some of your thoughts.

94:01 (Speaker I) Yeah, I have played through the website in a studio, so I'm lately

working with a company that make toys for kids. They want to start

incorporating AI. And one of my concerns we're working with them is

like, okay, we want to generate images for kids. Something that is

going to probably freak them out is two things that diffusion models

have been lacking.

94:27 (Speaker I) One is the ability of painting things like complicated shapes or

intricate shapes like hands. SD. Excel is not better at it.

94:40 (Speaker I) Another one is this concept of what is named like concept bleeding,

which is this diffusion model tends to mix objects that are similar

in shape or form is not good at it, neither. Now, I was reading the

paper from Stability or the report. They claim they are outperforming

Mid Journey in five of seven categories now, mid Journey 5.

1, right?

95:12 (Speaker A) Just to make sure. Mid Journey since then released the new version

also because we're in same pace, but yeah, they've compared to Mid

Journey 5.1. Yeah.

95:20 (Speaker I) Well, now this is a report internal released by Stability. It's a

paper, it might have some credibility, I don't know. I like the

results. It's very close to me journey, but I think there is still

one or two steps behind, in my opinion.

95:36 (Speaker I) What is different is what you have mentioned, Alex. Once they release

the weight and we can see Lotus about this, I'm expecting to see the

results that we can get because probably that is what is going to

position this model like a step above Mid Journey, but not yet. This

is my opinion.

95:58 (Speaker A) Yeah, definitely. And thanks for that. And I love folks coming up and

sharing their opinion about these things. I will say on the top.

96:05 (Speaker A) Thanks Mike. Or I guess I know you're a new name, but I'm not sure if

I can if I should.

96:10 (Speaker I) Yeah, totally, totally have it, in my view. I'm Juan Spanish, living

in Mexico and I like these things.

96:17 (Speaker A) We appreciate you coming up here on the topic of UIs that we've

mentioned with somebody or somebody folks released Pinocchio. They

call this the AI browser. And I want to highlight this because I want

to give you practical tips. Janae, I think, is coming in with some

breaking news.

96:28 (Speaker A) I don't know if Janae wants to come up or can, but if you can, feel

free to come up and tell us there's some news from Bard. Until we

talk about Bard, the topic of UIs for those things, and you guys know

we're mostly focused on the LLM side and the Engineer side. Less than

there's a fusion, but we sometimes have love for both the above tool

that you can download and not deal with the terminal, not deal with

the bunch of stuff, unifies all of them.

97:08 (Speaker A) It's really nice. Check out the Nokio AI browser. I think it's open

source.

97:12 (Speaker A) You download this once, it's cross platform, Mac, PC, et cetera, and

then you're able to download Llama CPP, and then you're able to also

download table diffusion. And then fairly quickly, without knowing

how to code, without going through the terminal, without installing

packages, folks here know that installing the packages is like a

whole pain we all share and we all hate without doing all of that.

That's the promise that they have, you are able to pipe Llama outputs

into stable diffusion.

97:38 (Speaker A) So Yam previously mentioned kind of the model that can do, and Yam

and Method are talking about a method of generating prompts for LLMs,

but also we know that there's models prompts to actually generate

prompts for diffusions and they're trained on different and fine

tuned on different ways to generate diffusion prompts. Right, and

this Pinocchio browser is actually allowing you to run like an and

then pipe the output into stabilization model and then see the output

of that. I think it's incredible that this exists and is

downloadable.

98:07 (Speaker A) I haven't tried this yet. If you in the audience or somebody on stage

have tried Pinocchio, please raise your hand. I want to bring you up

and talk about Pinocchio and your experience with this.

98:19 (Speaker A) And if we haven't, I want to bring this to our attention so that next

week we're able to talk about this. This is added to my list of

things I like. The Comfy UI that I haven't tried it yet.

98:29 (Speaker A) Anybody use pinocchio yet? No? Cool. I wanted to get Cocktail Peanut.

The guy who wrote Cocktail Peanut.

98:36 (Speaker A) If you're in the audience, feel free to raise your hand. I don't

think you are, but feel free to follow the thread. He goes fairly

deep.

98:44 (Speaker A) And feel free to use and try Pinocchio by next week and then come up

next week and talk about the differences between this and running

automatic one one. All right, folks, thanks everyone for coming to

another Thursday. I space.

98:58 (Speaker A) Hope this has been helpful for a bunch of you. We tried a few new

things here. We tried to give updates, but also deep dive into a

conversation with Matt and looks from the reactions here that maybe

this is worth putting down on paper and sending out an email for

those of you who want to maybe sign up for this and not don't have

the time to listen to two hour spaces, so I'll definitely try at

least to do that.

99:19 (Speaker A) I want to thank a few folks on stage that have joined consistently

and providing a lot of signal yum follow Yam. He has great insights

into models and training and different things al in the audience.

Thanks always for coming up.

99:33 (Speaker A) Junaid is running the Denver meetup, and if you're in the Denver

area, feel free to join us next week. Thanks for coming. Haven't seen

you in a while, buddy.

99:45 (Speaker A) Juan sorry. Yeah, I think Juan great. Maxi and Lentos has recently

been joining us.

99:51 (Speaker A) It's been great. We have some more folks in the Evans who are

regulars, and we invite you to also be regulars and come up and talk

about Thursday. I will say this one thing, tag me in anything that's

new.

100:01 (Speaker A) I would love that. And help promote the message for other folks. If

you did like the space, this also really helps for more folks to get

to the bottom of this for those folks.

100:01 (Speaker A) I didn't get to their questions. I apologize. I'm trying to keep this

as a balance of a high signal thing versus letting everybody

questions as well.

100:22 (Speaker A) Last thing I'll say is about myself, a little bit consultant. I stay

up to date so you don't have to. That's my tagline.

100:29 (Speaker A) If you're in the company and needs consultancy for somebody who's up

to date on everything, I try to be that guy. Feel free to tap me in

the DMs. And, yeah, thursdai folks, keep tagging us everything that's

new. We're going to try to cover next week with that.

100:34 (Speaker A) I thank all of you. Thanks for coming. Thanks for giving us two and a

half hours of your attention.

100:34 (Speaker A) I really appreciate it. Attention is sparse and very important, and I

really thank everybody who gave us, like, two and a half hours. Thank

you, folks.

101:00 (Speaker A) Hey, Alex, we really appreciate you.

101:04 (Speaker B) Thanks, Alex.

101:05 (Speaker H) Thanks for doing a good space and keeping us on track, actually.

101:09 (Speaker A) Yeah, thank you.

101:10 (Speaker D) Yeah, alex definitely want to kind of.

101:13 (Speaker A) Give our thanks to you as well.

101:15 (Speaker E) For curating an awesome space.

101:17 (Speaker D) I think I'm definitely not the only one that gets a lot of good

signal out of this. And I know a lot of hard work goes into keeping

yourself up to.

101:27 (Speaker A) Date so that you can share it.

101:28 (Speaker E) With all of us.

101:29 (Speaker D) So just on my own behalf, thank you. And I'm sure that is echoed by.

101:34 (Speaker E) A lot of people on stage and in the audience.

101:36 (Speaker A) Humble man thank you. I appreciate you. Thank you, folks. Have a nice

Thursday and bye next week.

ThursdAI July 13 - Show recap + Notes

Topics we covered in July 13, ThursdAI

GPT 4.5/Code Interpreter:

Claude V2:

X.AI:

Stable Diffusion:

GPT Prompt Engineering:

Discussion about this episode

Ready for more?