Welcome Friends, to the first episode of ThursdAI recap.
Every week since the day GPT-4 released, we’ve been meeting in twitter spaces to talk about AI developments, and it slowly by surely created a community that’s thirsty to learn, connect and discuss information.
Getting overwhelmed with daily newsletters about tools, folks wanted someone else to do the legwork, prioritize and condense the most important information about what is shaping the future of AI, today!
Hosted by AI consultant Alex Volkov (available for hire), CEO of Targum.video, this information-packed edition covered groundbreaking new releases like GPT 4.5, Claude 2, and Stable Diffusion 1.0. We learned how Code Interpreter is pushing boundaries in computer vision, creative writing, and software development. Expert guests dove into the implications of Elon Musk's new XAI startup, the debate around Twitter's data, and pioneering techniques in prompt engineering. If you want to stay on top of the innovations shaping our AI-powered tomorrow, join Alex and the ThursdAI community.
Since the audio was recorded from a twitter space, it has quite a lot of overlaps, I think it’s due to the export, so sometimes it sounds like folks talk on top of each other, most of all me (Alex) this was not the case, will have to figure out a fix.
Topics we covered in July 13, ThursdAI
GPT 4.5/Code Interpreter:
00:02:37 - 05:55 - General availability of Chad GPT with code interpreter announced. 8k context window, faster than GPT-4.
05:56 - 08:36 - Code interpreter use cases, uploading files, executing code, skills and techniques.
08:36 - 10:11 - Uploading large files, executing code, downloading files.
Claude V2:
20:11 - 21:25 - Anthropic releases Claude V2, considered #2 after OpenAI.
21:25 - 23:31 - Claude V2 UI allows uploading files, refreshed UI.
23:31 - 24:30 - Claude V2 product experience beats GPT-3.5.
24:31 - 27:25 - Claude V2 fine-tuned on code, 100k context window, trained on longer outputs.
27:26 - 30:16 - Claude V2 good at comparing essays, creative writing.
30:17 - 32:57 - Claude V2 allows multiple file uploads to context window.
32:57 - 39:10 - Claude V2 better at languages than GPT-4.
39:10 - 40:30 - Claude V2 allows multiple file uploads to context window.
X.AI:
46:22 - 49:29 - Elon Musk announces X.AI to compete with OpenAI. Has access to Twitter data.
49:30 - 51:26 - Discussion on whether Twitter data is useful for training.
51:27 - 52:45 - Twitter data can be transformed into other forms.
52:45 - 58:32 - Twitter spaces could provide useful training data.
58:33 - 59:26 - Speculation on whether XAI will open source their models.
59:26 - 61:54 - Twitter data has some advantages over other social media data.
Stable Diffusion:
89:41 - 91:17 - Stable Diffusion releases SDXL 1.0 in discord, plans to open source it.
91:17 - 92:08 - Stable Diffusion releases Stable Doodle.
GPT Prompt Engineering:
61:54 - 64:18 - Intro to Other Side AI and prompt engineering.
64:18 - 71:50 - GPT Prompt Engineer project explained.
71:50 - 72:54 - GPT Prompt Engineer results, potential to improve prompts.
72:54 - 73:41 - Prompts may work better on same model they were generated for.
73:41 - 77:07 - GPT Prompt Engineer is open source, looking for contributions.
Related tweets shared:
https://twitter.com/altryne/status/1677951313156636672
https://twitter.com/altryne/status/1677951330462371840
@Surya - Running GPT2 inside code interpreter
tomviner - scraped all the internal knowledge about the env
Peter got all pypi packages and their description
added Claude to to smol menubar (which we also discussed)SkalskiP awesome code interpreter experiments repo
See the rest of the tweets shared and listen to the original space here:
https://spacesdashboard.com/space/1YpKkggrRgPKj/thursdai-space-code-interpreter-claude-v2-xai-sdxl-more
Full Transcript:
00:02 (Speaker A) You. First of all, welcome to Thursday. We stay up to date so you
don't have to. There's a panel of experts on top here that discuss
everything.
00:11 (Speaker A) If we've tried something, we'll talk about this. If we haven't, and
somebody in the audience tried that specific new AI stuff, feel free
to raise your hand, give us your comment. This is not the space for
long debates.
00:25 (Speaker A) We actually had a great place for that yesterday. NISten and Roy from
Pine, some other folks, we'll probably do a different one. This
should be information dense for folks and this will be recorded and
likely we posted at some point.
00:38 (Speaker A) So no debate, just let's drop an opinion and discuss the new stuff
and kind of continue. And the goal is to stay up to date so you don't
have to in the audience. And I think with that, I will say hi to Alan
Janae and we will get started.
00:58 (Speaker B) Hi everyone, I'm NISten Tahira. I worked on, well, released one of
the first Docker chat bots on the market for Dr. Gupta and scaled it,
and now we're working on getting the therapist bought out once. We
can also pass more testing and get Voice to work at a profitable
manner because we don't really have VC. So at the scale of few
hundred thousand users, the API bills matter quite a bit.
01:31 (Speaker B) So, yeah, these spaces have been pretty helpful because I have some
trouble with running a Voice transformer, trying to run it on the
browser on web GPU, and then the person that wrote Transformers JS
comes in here and just says, oh yeah, that back end is messed up.
Just try blas and synth and stuff. So these have been very
interesting and technical spaces.
01:54 (Speaker A) Yeah, we need to get Zenova in here. Zenova is the guy who NISten was
referring to. Al Janae, do you want to give a few words of intro and
say hi and then we'll start? Just briefly, please, because I think we
need to get going.
02:09 (Speaker C) Sure. Hi, I'm Janae.
02:11 (Speaker D) I'm the resident noob, I started messing around with AI at the
beginning of.
02:16 (Speaker E) The year, and I also host the.
02:18 (Speaker D) Denver AI Tinkerers coming up next week.
02:20 (Speaker A) And if you're in Colorado area, greater Denver, please join us. It's
going to be a blast.
02:27 (Speaker F) Hi, I'm Al Chang. I'm kind of an old school technologist. Just
getting started with the AI again and just here to help.
02:36 (Speaker A) Yeah. All right, folks, so I think we've had a whole space on this.
Simon Wilson and me and many, many other folks chimed in. The second
this was released.
02:50 (Speaker A) Was that six? Was that Sunday? It's hard to keep track of actual
days. Saturday, Saturday, last week, exactly during those spaces, by
the way, as we were talking, Chad GPT, Logan and everybody else from
OpenAI announced general availability of Chad GPT with code
interpreter. So GPT four with code interpreter.
03:12 (Speaker A) And I think we just heard from Matt that even some folks who got
access to the slept on it a little bit because it's maybe potentially
because of its very horrible name that's really hard to type
interpreter and get lost in the R's. But it's an extremely powerful
new superpower that we've got. And we've had the whole space talking
about use cases that people already had.
03:37 (Speaker A) It was like three days into it and since then I bet that many more
people tried it. I think Swyx 20,000 listens to that space, plus the
pod. At least people definitely want to hear new use cases, right?
03:53 (Speaker G) Yeah, not much else to add about it. I think it's the feature for
Switch.
03:59 (Speaker A) Posted a whole deep dive essay and coined it GPT 4.5 between us
friends. And one of the interesting things about it is that we think
at least that's where we are currently after playing around with
this, is that it's a fine tuned model. So they kept training this on
actually running code and executing code.
04:21 (Speaker A) That's what we believe. We don't know, nobody confirmed this and then
that it's fine tuned from an earlier checkpoint of GBT Four. And so
we actually had some folks on spaces talking about that it's less
restricted and better like previous times.
04:36 (Speaker A) So it's an interest, I think NISten right. We have some folks who
tell us they're using code interpreter without the code part. They
just stopped the GPT Four just because it's that model.
04:48 (Speaker A) And I think also they took down the 25 messages per hour restriction
on code interpreter. I've had like four hour sessions and it stopped
like I didn't saw complaints.
05:03 (Speaker G) So it's just better.
05:06 (Speaker A) It's also fast. I think it's fast because not many people maybe use
this by default and this could be the reason for the speed, but it's
definitely faster for sure. I think also context window, was it Yam?
Somebody summarized the context window and they told us the context
window for code interpreter is eight k versus the regular GPD for
actually that could be also a kick.
05:29 (Speaker G) You mean Yam copied and pasted.
05:34 (Speaker A) I would encourage you and Yam need to kiss in the cup because Yama is
doing a lot of legwork to take down the stuff that he posted and Yama
is working on that and it's very visible and you guys need to do
there you go, yam, you need to clear the air. However, Pharrell and
Gabriel bring you up as well. And we're going to keep talking about
code interpreter because that's what we're here to do. NISten and a
few other folks and we started cooking with code interpreter.
05:59 (Speaker A) And by cooking I mean we started stretching the complete boundaries
of what's possible there. And I think Simon Willison kick started
this with the latent space Pod. So for folks who are not following
latent space pod, feel free to follow SWIX, his main account, not
this hidden one.
05:59 (Speaker A) And SWIX reposted the spaces we had simon Wilson was able to run node
JS and Dino within code interpreter, even though OpenAg didn't allow
for that by uploading like a binary and asking code interpreter to
generate. Simon then promptly said they fine tuned the model away
from that and we found ways anyway to ask it to do some stuff. I have
a thread on how I was able to run a vector DB chroma inside code
interpreter.
06:10 (Speaker A) I ran whisper CPP. We saw some folks running GPT-2 inside code
interpreter, right? So imagine an Ll GPD Four running another and
talking to it. It's like a little brother inside.
06:10 (Speaker A) I personally love that inception. I don't know if the person who ran
GPD Two is in the audience as Dan I think was the nickname NISten. I
don't know.
07:22 (Speaker A) Surya.
07:23 (Speaker B) Surya. He also wrote the search to PDF plugin for GP Four plugins and
he wrote that in like two days and it's more used than any other
enterprise thing, which is pretty hilarious.
07:36 (Speaker A) We need to get surya.
07:38 (Speaker B) Yeah, he just did that as I'm just going to do a search plugins for
PDF and it's like the most used.
07:45 (Speaker A) So dope pretty amazing. Again, in that space we've talked about
having like a living manual, so to speak, for code interpreter use
cases because it's coding. So it covers pretty much everything that
we can think of as coders, maybe just in Python, maybe restricted to
an environment. And I've been trying to do that with the code
interpreter can hashtag and I encourage all of you, let me pin this
to the top of the space, to the jumbotron if you have an interesting
code interpreter thing and I'll bring up Skalsky P to the stage as
well.
08:03 (Speaker A) And Lantos, so many good friends. If you have a very interesting code
interpreter technique or skill or new thing that people can do
without coding skills, please tag with this hashtag so folks can find
this. Otherwise I will cover the main three things the code
interpreter gave us besides the new model.
08:42 (Speaker A) One of them is uploading files. And since we've talked, we've noticed
that you can upload up to 250 megabyte files and those can be zips of
other files. So we've uploaded like full models weights.
08:55 (Speaker A) We've uploaded bin files. It's incredible that you can now drag and
drop whole directory and have JPT just know about this and read about
this. We've uploaded weights in embeddings.
09:08 (Speaker A) You can then obviously execute code in a secure environment, which is
again incredible, and you can download files, you can ask it to
actually generate a download for you, which is also super, super
cool. Maybe one last thing I'll say before I'll give it to the
audience for a few more cool use cases. And folks in the stage,
please feel free to raise your hand.
09:21 (Speaker A) I'll get to you in the order that you raise your hand if you have a
use case. Some folks built like a built in memory built in brain
within code interpreter just to save to a file. That's what I try to
do with my vector DB and then they download that memory at the end of
every session and then upload this to the next one and have some like
a prompt that reminds the jgpd like to start from that point.
09:50 (Speaker A) So in addition to the context window, they're also having a separate
offloaded file persisted memory. So code interpreter incredible.
Again.
10:00 (Speaker A) Potentially GPT 4.5. And if you haven't played with this, feel free
to if you don't know what to play with, follow the code interpreter
can hashtag and let's get to Skowski.
10:11 (Speaker A) What's up, man?
10:14 (Speaker H) Hi, hello. Do you hear me?
10:15 (Speaker A) Yeah, we can hear you fine.
10:19 (Speaker H) Yeah, I've been playing a lot with code interpreter over the past
five days, mostly with computer vision use cases because that's what
I do. I haven't introduced myself. I'm pretty much doing computer
vision full time for the past five years and was focusing on like
when I saw that you can input image and video, that was immediately
what I was thinking, we need to make it to computer vision. So I went
through some low effort tasks.
10:46 (Speaker H) So I managed to run old school computer vision algorithms, face
detection, tracking of objects, stuff like that. But I also managed
to exploit it a little bit. So you can add yolo object detection
models to the list of models that were run in code interpreter.
11:15 (Speaker H) There are some problems with memory management, so I'm not yet fully
happy with the result. But yeah, I managed to run it on images and on
videos and the things that are super cool and are kind of like
underrated right now, false positive. So when the model detects
something that shouldn't be detected, you can really use text to ask
code interpreter to filter out false detections.
11:48 (Speaker H) You can just give it your feeling like why that stuff is happening or
when or where. And it's very good at cleaning the detections, which
was kind of like mind blowing for me. And one thing that I noticed
that it sucks at is I managed to create an application that counts
objects moving on the video when they cross the line.
11:55 (Speaker H) And I didn't use any off the shelf libraries, I just had detector and
say, okay, now draw a line and count objects when they cross the
line. It's terrible at that, writing math logic to figure out that
something crossed something, we had like ten prompts or twelve
prompts exchange and I basically bailed out on that, forget it. So
there are some things that blow my mind, but there are something that
probably not.
12:49 (Speaker A) So folks, feel free to follow Skowski. And also I just pin to the top
of the Tweet his brand new awesome code interpreter use cases, git
repo, and there's a list, there's a bunch of use cases there. This
could also serve as a de facto manual. So feel free to go there at
PRS and follow that for updates.
12:52 (Speaker A) And I want to get to Lentos because he seems to be unmuting. What's
up, Lentos?
13:12 (Speaker H) I was just going to say I can't follow him because he's blocked me.
13:15 (Speaker C) Sad face.
13:16 (Speaker H) Oh, no, I noticed that, but I'm not sure why. I will undo that.
13:20 (Speaker A) All right, I'm the peacemaker in the status. Please kiss and make up.
You two as well. Everybody should get along.
13:26 (Speaker A) Yay. I want to get to some other folks who came up on stage recently.
And Gabriel, welcome to talk about code interpreter and your use
cases.
13:35 (Speaker A) Jeanette, if you play with this, I would like to hear two more
opinions before we move on to the next incredible thing. Yeah. Oh,
you guys are talking about let's get together and then June sorry, I
should have been explicit about the order.
13:54 (Speaker E) No worries. So I just posted a comment on this space about the
message cap on a conversation. So even though in the UI, it still
says 25 messages per 3 hours, if you look at the network request, you
can see that. And I posted this, it's actually 100 messages per 3
hours now.
14:12 (Speaker E) And I don't know if they're scaling that up and down as demand
increases and decreases, or they're just trying to trick people into
conserving their messages, but it's definitely been on 100 for a
little while now. Can you confirm same thing you can see in the
network?
14:32 (Speaker A) Can you confirm the same for the regular mode, or do you think the
regular mode is still restricted? Well.
14:41 (Speaker E) Based on just the fact that there's only one message cap, they don't
have message cap per model. So I think it's just consistent across
all the GP four models. And that's also my experience in the last
it's been a little while now. It's probably at least a couple of
weeks that it's been higher.
14:51 (Speaker E) And same thing we discussed, I think, on Saturday about the context
window. And you can also see it in the API that the context window is
eight K for plugins and code interpreter, and it's 4K for the base
GPT four model.
15:16 (Speaker A) That's awesome. Like suicide. Better in every single way.
15:22 (Speaker D) Yeah.
15:23 (Speaker A) Awesome. Thanks.
15:24 (Speaker E) Yeah. In terms of use cases I can share, I've been digging around a
lot in the code interpreter, and I was really trying to hone in on
why are the packages that are installed there, the Python packages in
the environment? Why are they there? Some of them seem really random,
and some of them make a lot of sense. And they released it, saying
it's for, basically data analysis. And a lot of them make sense for
that, but some of them are just really wild, like the ML packages.
15:54 (Speaker A) And the Gabriel folks in the audience. If you look up at the jumbo
tone where we pin Tweets two Tweets before there's a Tweet by Peter
Zero Zero G, who actually printed all the packages and asked GPT Four
to kind of summarize what they do. So if you have no idea about the
potential capabilities of what it can do, feel free to pin that tweet
for yourself. And then it has a bunch of descriptions of what's
possible.
16:11 (Speaker A) So go ahead. Gabriel. Yeah, cool.
16:28 (Speaker E) Yeah, I've done the same kind of thing with just a short yeah, I got
it to do a four word description for each one. So if you're looking
for a really short description of each package, I'll post that tweet.
And if you're looking for a long one, I think Peters is great. And
what you can see there is that there are packages for web
development, right? There's Fast API, there's Flask, there's a bunch
of other packages for Web development.
16:40 (Speaker E) And besides the fact that there's no network access, which obviously
other people using it might be turning it on, but it was just
interesting to me. My perspective is that OpenAI has been using this
internally throughout all their teams for development and testing it
internally, but probably also using it pretty consistently. They
probably have access to the Internet.
17:14 (Speaker A) Yeah, I'm sure they have access to.
17:15 (Speaker E) The Internet and they can install new packages. But I think they also
have the ability, instead of uploading files and downloading files,
they have the ability to just mount persist memory, I don't think, to
persist. I think they just mount their local working directory on
their computer right wherever they're working. So they have their
active directory where they have their project, and they just mount
that and give the code interpreter access to the whole directory with
their whole repo of their project.
17:48 (Speaker C) Yeah.
17:48 (Speaker E) And then Chat Gvt is just writing code to the working directory and
reading from there and it can explore their whole project. We can do
that now by uploading, you can zip your whole project and upload the
whole thing zipped and have it unzipped. And then it can kind of
explore your whole project. But then once it makes some changes, you
want to commit them, you have to ask it to zip the whole thing back,
download it and upload it.
17:48 (Speaker E) And then I think what they're able to do is more of like a kind of
peer programming thing where the developer makes some changes and
then Chat GPT makes some changes and they're kind of working
together. This is taking it one step further. I don't know if they
have this or not, but it would be super.
18:29 (Speaker A) Cool in the realm of updates unless there is no speculation. But I
would love to explore this more with you in the next stage because
this applies to open source and how people already saw somebody tag
us after the last space and said, hey, I'll build this open source. I
would love to pin this to the top of the space. However, I want to
move on to new space and then move on to other updates.
18:51 (Speaker A) Sorry to interrupt, but thanks. I think that the collaborative,
persistent code superpower that probably maybe at some point will
come to us as well. Plus the internet access is like another ten x I
want to get to Skowskin and lent us and I think we'll move on to
Claude.
19:08 (Speaker A) Thanks Gabriel.
19:11 (Speaker H) Yeah, I have a question. I'm not really sure guys, if you notice that
I was obviously experimenting with PyTorch because I needed it for
computer vision. I noticed that the PyTorch version that is installed
in the environment actually pre compiled to work with CUDA. So it's a
GPU version of PyTorch.
19:31 (Speaker H) Even though that in the environment you don't have access to GPU, you
only have CPU. So I'm curious guys, what you think about that. Why is
that? Any ideas?
19:42 (Speaker A) Ideas that just come from what Gabriel just said? Likely we're
getting the same Kubernetes container. However, the open AI folks
have like unlimited stuff. They probably also have CUDA that would
make sense right there is probably connected to a GPU as well, but
that's just an idea. Lantos, I want to get to you and then we'll move
on to Claude.
20:02 (Speaker A) Folks and folks in the audience, feel free to hit the little right
button on the bottom left looks like a little message and leave
comments through commenting as well. Moving on to Claude V Two. Folks
in the audience and folks on stage, feel free to hit up the emojis
plus one.
20:19 (Speaker A) Minus one if you have tried Claude V two if you like it and you
haven't liked it. I'm going to cover this anyway because I think
somebody called me, I think Roy from Python called me a Cloud V Two
fanboy yesterday and I first got offended and I told him that I'm
just a fanboy for 24 hours. Before that I was a code interpreter
fanboy and then I figured with myself whether or not I am a fanboy of
Claude V Two.
20:43 (Speaker A) And yeah, I am and Sweet told me to relax and in fact I invited him
here to be the red blanket on the other side of the list. Anthropic
the company that we can definitely consider number two after opener.
I think that's fair in terms of quality.
21:02 (Speaker A) Have long released Claude version and they made some ways when they
released Claude AKS clong with 100K complex window, they have
released Cloud V Two and let me paste some Claude sorry, pin some
Claude thingies in the jumbotron, sorry. However, Cloud V Two
released with multiple stuff and I want to focus on two stuff and I
think we'll cover the UI first and then we're going to talk about the
model itself, UI wise and product wise. My hot take and I'll pin this
to the top.
21:38 (Speaker A) Unfortunately not debate this, but I love you, all of you. Is that as
products, Cloud V Two right now beats JPD as a product. My mom can go
into two websites and she'll prefer one versus the other one.
21:51 (Speaker A) Or my friends that don't know Xai as plugged in as we are, theirs is
free. And I think Cloud V Two beats GPD 3.5, which is also free, and
100K context window with the model being traded, 200 unleashes, a
bunch of use cases that were not possible before.
22:12 (Speaker A) It just frees you up. If you heard Skowski just say the limitations
of code interpreter. A bunch of these limitations stem from the eight
K context window.
22:13 (Speaker A) If you print a bunch within the code that you're doing, code
interpreter sometimes forgets what you guys talked about 20 minutes
ago. And the 100K context window also means a long, long conversation
history with the model. And I think it's really great.
22:37 (Speaker A) Not to mention that you can drag and drop full books in there. Those
books need to be in like one or two files and they still don't accept
zip files. And I'm planning to release an extension soon that does
this for us and unifies and single files.
22:51 (Speaker A) So hopefully by next week we'll have some updates. However, once you
upload that much or you can upload like a transcript or a podcast,
you can do a bunch of stuff because Cloud V Two is also better
trained on code and we saw a significant jump in wait, I'm switching
to the model, so let me get back to the UI. The UI allows you to
upload files.
23:09 (Speaker A) The UI has a command k interface, which I personally love. I hit
Command K in every website and see if they support it. You can just
start a new chat real quick.
23:21 (Speaker A) It doesn't have Share, but it's definitely refreshed and free UI.
It's called Cloud AI and that's the URL, and if you haven't tried it,
definitely try it. Comments about just the product side and the UI
side before we move to the model? Anybody play with this? Anybody
like it? Anybody loves the upload files feature? I would love to hear
hands and comments.
23:42 (Speaker A) Go ahead, Matt.
23:44 (Speaker D) A bit of a weird thing, but what I've noticed is it's actually quite
frustrating if you want to paste text in it actually, if it's over a
certain length, will paste in as a file. Little small thing.
Hopefully they'll change it, but it is really annoying because then
you can't edit it. Chat GP does do that much better, but I generally
agree with you that overall the product experience on Claude is.
24:03 (Speaker A) Significantly the new one. The fresh coat of paint they released for
us. I will say that Cloud so far was kind of a hidden gem, that only
folks who got access to the API actually got access to their UI, and
that UI was very restricted and folks who have access to Cloud API
know what I'm talking about. I think that UI is still around.
24:22 (Speaker A) It still shows your history. It's like very restrictive. It's not as
cool as this it's not as leak as this.
24:27 (Speaker A) So we like cloud AI, definitely a plus. Check it out. Now, let's talk
about the model behind this UI, because that model also changed and
several incredible things that changed with it.
24:38 (Speaker A) First of all, they released a new model, same price as the previous
one. We love to see this. Please everybody, including opinion,
continue giving the same price and cheaper and cheaper down the line.
24:41 (Speaker A) We love to see this. Second of all, they claim it's been fine tuned
on several things. One of them is code.
24:54 (Speaker A) And we actually saw a bump in the evaluation called Human Eval, which
is a set of questions that OpenAI released and I think the bump was
from like 55% to 78%, which I think beats 3.5 and is not there
compared to GPT four. Correct?
25:14 (Speaker C) Yeah, and four and four on past first on the first, not on GPT four
that is allowed to refine and fix it there, but on the first trial.
Yeah, by a little bit.
25:33 (Speaker A) So, news to me and thank you for joining in the past numbers is how
many times it's able to reflect upon the sensors and improve them.
25:43 (Speaker C) The past time is kind of what I meant by reflection is even stronger
GPT four. If GPT four sees the exception, it can come up with a
solution. So this is not in the Human Eval test, but if you use GPT
four this way, you get to 90 something percent, which is which I
think it's more realistic if you think about it. No programmer writes
the whole code in a one go.
26:10 (Speaker C) You write it intuitively, six bugs and so on. And also in code
interpreter, you see it. But it is remarkable to see state.
26:19 (Speaker A) Of the art on first and it's significantly better in code. And I
suggest folks who previously tried quad and haven't impressed to try
as well. An additional crazy thing that they've trained on is 100K
contacts window and they've actually trained, they claim on 200K
contact window, so twice as much as the previous round. And we follow
this one guy of your press, the guy behind Self Ask with Search and
the guy behind Alibi, the ability to extend complex windows.
26:55 (Speaker A) He just defended his PhD and he talked about complex windows and he
was impressed with the way they presented and the way they showed
their loss curve. And so this could be we saw the paper maybe this
week the folks saw the paper where the window dips in the middle.
There's like less attention in the middle of the beginning at the
end.
27:03 (Speaker A) And it looks like that's not the case for Claude as well. So I
suggest you try the huge context window and al you have your raised
hand and then we'll talk about some other model changes.
27:26 (Speaker F) Yeah, I would talk a little bit about I used Claude about a month and
a half ago to win Best Solo Hacker at the Craft Ventures hackathon
david Sachs won. Yeah, it had like 200 entries, but it's
exceptionally good at creative writing and also like comparing and
contrasting. I don't think people have really taken advantage of what
the context window is capable of doing. It's more than just loading
single files in.
27:53 (Speaker F) So what I did for the project was I loaded these large legislative
bills, these like 50 page unreadable bills, and you turned them into
relatable narratives. So one of the things that Claude can do is you
can adopt a persona. So a lot of times with summaries, summaries just
compress the text that you see, but you can tell it to say, write
1000 words from a social conservative point of view, or a bus
driver's point of view, or a social liberal point of view.
28:21 (Speaker F) And what that does is it takes all of its knowledge about the outside
world and gives you not a summary, but it gives you essentially an
essay about the practical effects of something like a bill. I've
actually been working with the idea of reading a book and having it
tell you what I would have learned from this, because that's actually
probably what you're more interested in. What it can do in terms of
comparing and contrasting large essays is exceptional.
28:51 (Speaker F) So you could have it say, write 2000 words from a social conservative
point of view, 2000 words from a social liberal point of view, and
then have it contrast the essays, which is something that would be
very difficult for a human to do. So you get to give it multiple
files and have it just give you a more balanced approach so you get
rid of some of the bias that comes in.
29:18 (Speaker A) My dream, go to my dream project that I never get to is to create
this for Twitter as like a Chrome extension that I can select a bunch
of tweets and then say, remove the bias from this and just give me
the debiased version of all of this. Yeah, completely. Like the cross
reference ability of Cloud between because of this context window is
incredible for many, many use cases.
29:41 (Speaker F) Yeah, I would say that as far it's not as good as GPT Four for
certain things. But that context window is fantastic. And I would say
a lot of people that are using embeddings and retrieval, you can
actually just put the whole thing in the context window and ask
questions to that and then you have a baseline to compare your
results from it. Most people, if they're chatting to a website or
something like that, you actually can just put the whole thing in
there as opposed to trying to chunk it up and do questions and you'll
see that your results are much better that way.
29:51 (Speaker F) And for most people, that would be good enough.
30:17 (Speaker A) So additional thing that the additional thing that Cloud was trained
on, they've talked about the output tokens, just the number. Of
output tokens of how much cloud is able to generate. And they've said
that previous models, I don't know if the same about GPT, I haven't
seen numbers on GPT Four, but they've said that previous Claude
models were focused on shorter outputs just as they were trained. And
this latest model was trained to output up to 4000 tokens in output.
30:47 (Speaker A) This is added to the fact that they also fine tuned it and trained to
output JSON files, complete JSON files as responses, which we as
engineers, we waited for this and Open Xai gave us functions via kind
of here you go, there's the function interface. And we love the
function interface. The function interface kind of locks us down to
the OpenAI ecosystem.
31:04 (Speaker A) And it's great to see another model that's like very close to state
of the art in human evil that also is now fine tuned to respond in
full intact JSONs. And those JSONs can be 4000 tokens at length. Any
thoughts on these?
31:28 (Speaker F) Yeah, I can confirm on it being able to write large amounts of
output. I mean, I was having it write like 2000, 3000 word like sort
of essays and outputs and it was fine with that.
31:40 (Speaker A) Yes. And I think it's I'm going to.
31:45 (Speaker B) Stick with GPT Four myself. But this might be pretty useful for just
dumping in an entire code base, given the 100k context window and
then getting some reviews and stuff, and then maybe moving some of
the stuff.
32:02 (Speaker A) Once I stop posting status and build that chrome extension that you
upload the zip and it flatlines it to one file and then upload it,
then we'd be able to do, like, a proper comparison, because code
interpreter can take zip files and then extract them. Oh, one
difference that I want to for folks in the audience, GPD Four with
code interpreter allows you to upload zip files, et cetera. We talked
about this. It does not load them into context window, right? So
there's like eight k context window.
32:30 (Speaker A) The files that you upload are not automatically in the context
window. The model doesn't it has to write Python code that actually
prints the files. And it usually does like the first few lines, hint,
hint.
32:30 (Speaker A) The folks in the audience who get my drift. But it doesn't usually
read all the unless you specifically ask it to and Claude does. So
everything you upload to, Claude goes directly to the immediate
working memory of the complex window.
32:38 (Speaker A) And that's a major difference to watch out for and also take care of.
Go ahead.
33:00 (Speaker C) I would like to ask everyone before I say my opinion, what do you
think about it in comparison to GPT Four about the performance? What
do you think?
33:10 (Speaker A) I would like comments from folks who actually use both and did the
comparison. And before I get to folks, please raise your hand to
answer. I want to call out SWIX's small menu bar which allows you to
actually Swyx. Can you give us like a brief two minutes on the menu
bar thing?
33:28 (Speaker G) Yeah, well, you don't have to choose. Just run it all the time on
every single chat. So it's a little electron app that runs in the
menu bar. And I've been maintaining it and I just added Cloud Two
this week.
33:42 (Speaker G) Cloud Two is not super stable yet. Sometimes it will fail to submit
the button. So you just have to retry manually to submit the button.
33:50 (Speaker G) But yeah, it's a great way to a B test models, but then also just
amplify every question with between four to five different chat
models with the answers. So I've been trying it. It's up to you if
you want.
34:07 (Speaker A) To.
34:10 (Speaker C) Find it.
34:14 (Speaker A) With the announcements, if you can. Yeah, awesome. Yeah, just
basically and maybe for instance, you don't have to stop using, you
don't have to choose. So I think the last thing that we need to
acknowledge it's, Claude, is the multilinguality.
34:28 (Speaker A) So they actually focused on showing us how much better, like, the new
ones from previous ones, and they posted blue scores, Bleu scores,
clock Two is significantly better at languages than the previous
versions. I think, to answer your question, I think it's close to GPD
Four, if not better at some things. Hebrew goes fluently, and usually
Hebrew is not that great.
34:57 (Speaker A) Russian and Ukrainian that I use also go fluently. And that part is
really good with a lot of context because you sometimes need to do a
lot of translation, or at least I need to do a lot of translation.
35:11 (Speaker C) Yeah, multilinguality works great. I was surprised. Absolutely. What
I think if you just compare the two on the same prompt, the same
question, I have a feeling that GPT Four is slightly better, but I
just don't have an example to tell you.
35:31 (Speaker C) Okay, here I don't know, it's a strange situation, but I really
wanted to ask you, like, what did you try and work better here and
there?
35:38 (Speaker A) So here's my use case that GPT Four currently cannot do. Yesterday,
Lex Friedman interviewed Israel's Prime Minister Benjamin Netanyahu
in one of the weirdest turns of history this podcast was, and given
that I know kind of who Benjamin Netanyahu is from, before I decided
to not listen to this, I decided to use the tools that we have at our
disposal. So I ran this through Whisper with Diarization. So I have,
like, a very nice transcript of who's talking.
36:10 (Speaker A) When I took that, I just dumped this as a text file. And I agree with
Matt, it's a little bit annoying that Claude turns whatever you paste
into like, a little text file uploads. That because you can't edit
it.
36:21 (Speaker A) However, I uploaded that transcript directly to Cloud, and then I
asked it to do sentiment analysis, entity extraction, and sentiment
analysis and entity extraction. Something that if I'd asked GPT code
interpreter, it would probably write some Python code to do this, and
Quad just kind of did it. And I haven't seen GPT Four being able to
do this for bigger files.
36:38 (Speaker A) And once I could just let me just this point. I continued by saying,
hey, because of the new coding abilities of Quad, I asked it like,
hey, print me a Python file that dumps whatever table of topics he
mentioned and sentiment, negative, positive, dump it into a word
cloud. That's something the code interpreters can actually do and
show you.
37:03 (Speaker A) But I asked it from Quad because previously Claude was shit at coding
and it gave me Python files that ran from the first time. I didn't
have to change anything, there was no bugs. And then showed me a word
cloud of everything that was mentioned by BB in that podcast and it
all took like maybe seven minutes.
37:11 (Speaker A) And I don't know if for bigger complex windows, GPT Four can
currently do this. Go ahead, Al.
37:28 (Speaker F) Yeah, I've actually been putting a lot of transcripts for podcasts in
there and you can actually have the because it seems so much about
the speakers and it knows about the speakers, you can actually have
them continue a discussion about things that they didn't actually
discuss. Yeah, so it's like you can have it say, okay, well, what are
some topics they disagreed on and then some things that they didn't
cover? Tangentially, you can just have it give you another two
minutes of interview and it does a pretty reasonable job, especially
with public figures that it actually has a lot of their background
on. So it's pretty interesting.
38:01 (Speaker A) And not to mention free, ngbt Four needs a $20 a month payment and
quality is free.
38:08 (Speaker F) That's a good point, too. For those of you that have eval keys,
you'll notice that they're actually not charging you for them, so you
can actually go on as long as you want. The limitation is that you
can only do one request per organization. So if it's just a single
person, they only charge you basically when you start deploying for
commercial purposes.
38:21 (Speaker F) So that's something that people may not have realized.
38:32 (Speaker A) So I think we've covered everything right, trained on 200K context,
which they can enable tomorrow for us, and we'll get like two X. It's
going to be insane. There is some stuff that they have in Cloud in a
tropic called Constitution AI, so they have a mix of Rlhf access and
Constitution AI. So they're working on their model to actually be
more helpful, but also more safe and less jail breakable.
38:57 (Speaker A) They talked at length about this. We talked about human evil better
and same price and free playground. I think we've covered most of it.
39:03 (Speaker A) So anything else about Quad that we haven't covered, feel free to
raise your hand and tell us, and if not, I think we can move on. What
do you guys think?
39:17 (Speaker G) I'll mention briefly, did you talk about the multiple file uploads?
39:21 (Speaker A) No, go ahead.
39:24 (Speaker G) So I think it's just an interesting way difference between co
interpreter and Claude code interpreter. You can only upload one
file, right? But it can be a zip file with multiple files in Zion. So
it's de facto multiple files, but then you can only run code on that.
Whereas what Cloud here is doing is something slightly different,
which is to me is interesting, which is you can upload multiple
files, it just reads the file straight into the context and it's
using that 100K context to synthesize answers.
39:24 (Speaker G) So you can do, for example, PDF A and PDF B and give me a comparison
between the two of them or synthesize knowledge across them. And I
think that is something that code interpreter cannot do because code
interpreter will only run code across files. So I think that's
noteworthy.
40:15 (Speaker G) It's called genuinely coming up with one new thing that is not
copying chat GBT and good for them.
40:23 (Speaker A) Yeah. And unfortunately no zip allowed. But we're going to fix this
with an extension and hopefully talk about this next week. I want to
say hi to Weather Report.
40:33 (Speaker A) Feel free to chime in. Sorry you raised your hand open to come up
before. So if you have a comment about code interpreter, we've moved
past it, but if you have a comment about Claude, feel free to tell us
what's up with the report.
40:46 (Speaker A) Actually, I had only one thing about code interpreter that in the
previous space I talked about that there was a hypothesis I had about
code interpreter, which.
40:56 (Speaker B) Is to use it as a huddle because it's recorded.
40:59 (Speaker A) We'll move on and let's talk about code interpreter next time. I
think that some folks are saying that their audio is glitching and so
they're not able to and I want to see if I think Joseph has comment
about code interpreter. Joseph Polak. We'll give him a second to log
in and then I think we'll move on to other updates because we have
many other things to talk about.
41:29 (Speaker A) What's up, Joseph? Welcome to stage.
41:31 (Speaker G) Hi there, folks.
41:33 (Speaker A) Thanks for taking my question. I didn't even know all about that code
interpreter stuff with the file.
41:40 (Speaker G) So I'm really happy to have heard it. About Cloud, though.
41:46 (Speaker A) For Cloud. Well, I'm still on waitlist. First of all, it's free now.
You can access it right now.
41:53 (Speaker A) Cloud AI. There's no waitlist anymore unless you live in the States
and you'll have to get a VPN. Okay, I'll definitely check that out.
42:03 (Speaker A) My question was about using Cloud and actually code interpreter
through API. Do you think that's ever going to exist or if it's
coming so clogged API? But I think that's waitlisted. I have talked
with Claude folks and they said the waitlist is now going faster.
42:24 (Speaker A) So they are ready to get more people in. I think because of the new
safety updates, they're less afraid. So definitely apply for the
waitlist on quads account.
42:35 (Speaker A) Code interpreter is not available via API, and we've seen some folks
who hack it together with like, I think a browser plugin that proxy
something. Sweets I don't know if you remember the unofficial quote
unquote code interpreter API and it's how to access this, but it's
not available in the official OpenAI APIs as of yet. We haven't seen
them.
42:56 (Speaker G) No. For the record, there's no unofficial code interpreter API.
There's the browser side thing that we are trying to but nobody's
made any.
43:07 (Speaker D) Adapter for it yet.
43:08 (Speaker G) I think you can, if you want, using puppeteer.
43:12 (Speaker A) I would not recommend definitely, if anything, there was some folks
that tagged us and I need to go and find this that they're working on
like an open source version of code interpreter that uses laws and
stuff. And that one this will likely be the way forward. If you do
want something programmatic that has code interpret capabilities, go
ahead. NISten.
43:35 (Speaker B) There's also Chatbot UI on GitHub. So yeah, for the other people that
are hacking something together, I'll wait until there is something
public before, because then.
43:45 (Speaker D) We don't know everything.
43:47 (Speaker G) Open source is going to be worse. Because you are missing the model.
43:51 (Speaker A) Yeah, because we think that it's fine tuned on actually knowing how
to run code. Right. That's kind of the highlight that we get with
from the less space. We think it's smarter because of that.
44:01 (Speaker A) And one of the main things again, sorry, going back to code number
just real quick, it is able to then fix itself and ask itself, oh,
oops, I made a mistake. Let me try again. Matt, I saw you unmute
yourself.
44:13 (Speaker A) Feel free to go ahead.
44:16 (Speaker D) Well, yeah, just a quick thing. So from what I know, openi will be
offering fine tuning relatively soon. So at that point, you
theoretically could go and fine tune your own code interpreter like
Model, even if they don't offer it, which is going to you.
44:31 (Speaker A) Can also theoretically not that we would recommend, but theoretically
right now you could start distilling some stuff from code interpreter
by asking it questions. Generate code and store it to a file. Ask it
to download and then quote, unquote, generate the data set. But not
that you should, but you can theoretically as well, so that when it's
time to fine tune, you have some data set.
44:52 (Speaker D) Yeah, theoretically. I don't know if a shared GBT currently supports
those types of conversations, but if it does, I'm sure that's going
to happen really soon.
45:00 (Speaker G) I don't think it's maintained because chat GPT itself well, I want to
speak for share GBT. I know, Steven, but I can help you move the
conversation back to cloud.
45:11 (Speaker A) Yes, please. Let's move back to cloud. Thank you.
45:14 (Speaker G) So just between the how many people are listening to this chat
anyway? I think it's like 60 people. Email support@anthropic.com for
the Cloud API.
45:26 (Speaker A) Yes, email them, state your use case and they'll likely get you in
and you can use SWIX's menu bar to actually kind of run them in
parallel with the megaprom feature. Megapron super prompt, what is it
called? I think SWIX dropped. There is like one prompt that you type
and then it all goes to both to all the models. I want to recognize
some folks in the audience.
45:50 (Speaker A) Hey, feel free to regime if you.
45:52 (Speaker D) Want to come up.
45:52 (Speaker A) Obviously, I saw some other Euro I saw in the audience. Max AI.
Welcome, Dexter. There's a bunch of folks who are usually here and
it's great to see, and I think we're moving on to a very spicy one.
46:06 (Speaker A) What do you guys think about Xai? So I'm pasting the summary of the
people. Elon Musk and a bunch of other folks have announced X. AI
they're essentially answer to OpenAI.
46:22 (Speaker A) We've all seen Elon kind of talk about safety and talk about helping
open Xai and then could not be open since then. He talked about truth
GPT at some point. And finally they announced Xai as we were talking.
46:37 (Speaker A) By the way, I have an application from Xai which they're going to
have spaces tomorrow to go deep into deeper into Xai. But so far
there's not a lot of detail. There are some details about the folks
who work there.
46:50 (Speaker A) So they have folks who wrote the Adam Optimizer. There are other
folks thoughts about Xai before we get to hear what they do.
Obviously, there's no product yet.
46:59 (Speaker A) I don't think they've started training. The one thing that I will say
is that they will have premium access to Twitter, obviously, because
Twitter is now rebranded.com Xai. After closing down the APIs and
closing down the scraping for Twitter, xai will now have a data set
that's insane to train on Twitter.
47:21 (Speaker A) And we wish them, quote, unquote, good luck. I would love to hear
from folks on stage. What do you think about the announcement, the
direction, the people? And we're going to wait for tomorrow to
actually hear them talk.
47:24 (Speaker A) I know. NISten, you have some ideas if you want to share to get
started.
47:40 (Speaker B) Well, this is more of an old lady babushko opinion that's just
talking about stuff. I found it interesting that they went from, what
was it? Base GPT through street taking on GPT four and this entire
competition to doing something more noble like dedicating it to be
better at math and discovering new things in physics. So the way I
see that, that's pretty noble. But at the same time, I feel like
that's a result of having problems hiring in order to be competitive
with the other ones.
48:26 (Speaker B) So, yeah, this will be interesting. But the way I see the whole set
up right now is, as the kids say, it's pretty mid, in my opinion.
48:39 (Speaker A) As the kids you don't use with that. I will say that we will see
tomorrow from their space. They're probably going to use Elon's Cloud
to maybe try to hire and it's probably harder now to hire because
everybody knows how quick they're getting fired and how much. It's
not like super fun to work for X, but we're in for a nice ride
because they do have access to the cross pollination from Tesla as
well, right? So if they have big questions, tesla does have a few
good folks still, even after Andre Capati left, and so they'd be able
to ask them for assistance.
49:20 (Speaker A) There's obviously the whole Dodgy thing in play, which we can I don't
think we have time to talk about Dodgy, and it's not new, but there
could be something there. Gabriel, you wanted to come up? Maybe you
have. Yeah, go ahead.
49:33 (Speaker A) Gabriel.
49:34 (Speaker E) Yeah, I was just going to say about Xai, I mean, you mentioned
Twitter's data, and I'd be interested in hearing other people on the
stage opinion on this because recently there's been a lot of work
done on quality of data over quantity of data. And of course, Elon
also has a ton of GPUs. Reportedly, he's bought tens of thousands of
GPUs. So that's definitely important in building these big models.
49:58 (Speaker E) But I'd be interested in hearing from people on the stage if they
think Twitter's data and the kind of data that Twitter has is
actually going to be really powerful for training good models.
50:11 (Speaker A) Anybody wants to take this?
50:13 (Speaker F) Yeah, I'll take a little of it. One of the things that Twitter has
that other people don't is that people are actually debating issues.
So I think that's one of the reasons why he's really focused on the
idea of Twitter being a source of truth and being sort of
unrestricted so that you're not just following like, one thread, you
watch the narratives being debated and he has access to all that.
50:35 (Speaker A) Data and community notes. And it's really hard to scrape. Like, I
don't think it's API ball at all. It's not super simple to scrape at
all.
50:42 (Speaker A) I want to get yum before I think Matt wanted to unmute and go and
then yum. If Matt, you still want to chime in and then yum.
50:53 (Speaker D) Yeah, I mean, nothing too much to add here. I think the community
notes are very interesting as a way to sort of like, reduce
hallucinations. I think one of the things that they're going to want
to do heavily is invest in sort of filtering that data set because
there's a lot of great stuff on Twitter. There's a lot of crap on
Twitter.
51:07 (Speaker A) A lot of yeah.
51:09 (Speaker D) And the more of that that seeps in, the worse the model is going to
perform. Obviously, scale is important, but data quality is
incredibly, incredibly important and the scale kind of doesn't negate
bad data quality. So I think if they do one thing right, it's going
to have to be getting the sort of filtering of the data set down. But
they do have a ton of incredibly high quality data.
51:27 (Speaker A) Yes, I think Yam was next and then we have a few folks wanted to come
in. I think Pharrell wanted to come up. So yam. And then pharrell.
51:34 (Speaker A) And then Gabriel.
51:37 (Speaker C) I just want to say, of course, if you just take Twitter data and
start training your model, you can expect it to be average Twitter,
which is not what you want. What you can do, which is a gold mine, is
to transform this data or just rephrase it as other forms. And this
just makes the data a gold mine because Twitter does have very high
quality content here and there. Absolutely.
52:05 (Speaker C) If you can, and transform it and rephrase it to a different form if
you want an example. So the paper textbooks are all you need.
Basically, they just take data and make it into a tutorial, make it
into a textbook, like perfect, clean and everything.
52:22 (Speaker C) It is very easy to do, and you don't need a powerful model to do
that. You don't need chachi PT. You can use it to do it with a small
model.
52:30 (Speaker C) I'm currently doing off the record, I'm currently doing it myself in
a large model I'm training. It doesn't it doesn't matter matter
anyway. It's a gold mine.
52:43 (Speaker C) What I'm saying, it's a gold mine.
52:45 (Speaker D) About Twitter.
52:46 (Speaker A) An additional thing before I get to Farrell and then gabriel
additional thing. NISten I talked about yesterday at length in our
late night line cook space. That's not going to be scheduled. If you
guys are on, feel free to join that one.
53:00 (Speaker A) Twitter Spaces is also a gold mine. Transcribing Twitter spaces and
seeing all the reaction emojis that they have in real time. Like the
space that Elon ran with RFK Jr. For example, if you know in the
audience who are actual people instead of bots, and you're able to
get like emoji reactions in real time, that's a definite, definite,
very high signal kind of training set that they have and almost
nobody else has.
53:25 (Speaker A) And through how to get Pharrell, you are next, I think. And then
gabriel yeah, I wonder what.
53:30 (Speaker D) The relation is and how useful the Twitter data will be for their
goal of building a sort of math reasoning machine. Right. Also, do we
know if they're open source, as in truly open source or not?
53:49 (Speaker A) No, we don't know yet. Hopefully tomorrow we'll be able to answer
questions. However, we've seen Elon take Twitter's algorithm to open
source, and now he's like, boasting this comparatively competitive
advantage versus something like Threads. He's saying, like, hey, open
source.
54:07 (Speaker A) If you go to Threads, you're under the Zucks influence algorithm. So
there is definitely an attempt to open source from their side, but we
don't know anything about that beyond that. Gabriel.
54:17 (Speaker A) And then Johnny.
54:20 (Speaker C) Yeah.
54:22 (Speaker E) First of all, I think it's funny that Elon's shit posting is
polluting his data set. I would say that.
54:34 (Speaker A) By the way, if there's anybody with the option to detect Shit
posting, it's them, right? They're going to be able to build a model.
Understand, this is shit post. This is like somebody who made an
effort to give us clean information. But sorry, go ahead.
54:49 (Speaker E) Yeah, that's exactly my point that I was going to make, that Elon was
on this crusade before he bought Twitter. And this is kind of why he
got forced into buying Twitter, because he was going after the bots
and he made a big deal about the bots. And I think they spent a lot
of resources on figuring out what's good content and what's bought
content. And another thing is that we each are kind of experiencing a
different Twitter, right? Because we're within whether it's an ML
Twitter or Israel based Twitter, and there's many different
communities and their Twitter is very good at segmenting those
communities and figuring out which content belongs to what community.
54:55 (Speaker E) And they'll have the ability, I think, to segment this data and train
many different models that are good at different things because
they're in a literature community or in an ML community or MMA
community or whatever.
55:37 (Speaker A) I actually saw a map of like 5 million, 7 million tweets all embedded
in Nomic Xai Atlas. I don't know if you guys follow Nomic, they just
recently announced like a 17 million round A, by the way. So kudos to
Nomic good friends. Andre, the GPT for all team, and they have like
an embedded map before the API was shut down that they were able to
siphon, et cetera.
56:00 (Speaker A) And Gabriel, what you're saying is actually visible in the embedding
map. You can actually see those tweets and then different areas of
the political Twitter. There was a journalist Twitter until all of
the journalists started living there's like a bunch of different
pockets of Twitter that we don't get exposed to, not to mention the
different languages.
56:20 (Speaker A) There's a whole Japanese Twitter that's like insane. And people go
super, super hard. And translating is easy.
56:26 (Speaker A) We talked about Cloud being able to translate. So they have a bunch
of very interesting data. And I think Zuck is also going after that
data with Threads.
56:31 (Speaker A) And I think this is the reason why we'll see Threads getting
continued work and we'll see a lot of investment from their side. But
to compare to Threads, and we talked about this yesterday, is that
Twitter has back history and a lot of historical data that they can
train others. Threads is fairly new as well.
56:54 (Speaker A) So definitely a bunch of interesting data sets. Johnny and then
Lentil. Hey.
57:00 (Speaker H) So one I think about when I think about the data from Twitter that is
potentially lacking and some of the other data sets is colloquial
language. Because what Twitter has that Facebook doesn't have and a
lot of other things don't have, especially from what you're talking
about, like historic, is the way that people actually interact with
each other. You know what I mean?
57:26 (Speaker A) Not only that, how it evolved as well, right throughout exactly.
57:35 (Speaker H) To be honest, I think the data sets from earlier is probably better
and stronger because it's just gotten out of hand. But I agree with
what I'm not sure it was Yam or who said the filtering because all
right, this is black box, it's not open source. Elon has not been shy
about his kind of response to what he perceives as wokism and all of
that stuff. I'll be super curious.
57:36 (Speaker H) I mean, there's a big team on this, but I will be super curious to
see what that bears out in the actual model. Because, God, there's
equal parts or more parts disinformation on Twitter than there is
information. So if we're talking about source of truth, that rings
some alarm bells for me, for me personally.
58:21 (Speaker H) So those are just my thoughts.
58:29 (Speaker A) Yeah. Thanks, johnny Lentil. Go ahead. And then Gabriel.
58:33 (Speaker A) Let's finish on the Gabriel and then we'll move on to the next topic.
58:36 (Speaker H) Cool.
58:37 (Speaker A) Yes.
58:37 (Speaker H) So I think it's going to be hugely bullish for this data. And from
the perspective of relating idea space and people and the relations
between those, I think that's probably going to be more of a goat
information than conversation because you can build so much from
that. Like dating this is just one like a dating thing. Or finding
people, finding brain power compute, that's going to be huge.
58:40 (Speaker H) And to touch on the open sourceness of the data, I think not open
sourcing it at some point is going to be hugely politically bad for
Elon to do.
59:23 (Speaker A) That'S.
59:23 (Speaker H) My thoughts on that.
59:24 (Speaker A) Awesome. Thanks, Lance. Gabriel, let's end up and then, Matt, we're
going to talk about some interesting stuff.
59:31 (Speaker E) Yeah, just on the kind of data. I think for those of us who ran,
like, the early versions of Llama before they got fine tuned in all
kinds of ways, and you run it, and especially the smaller models, you
put in a prompt and it spits out some generic Facebook type of
content. It sounds like a Facebook post of like a 15 year old or
something like that. That shows what you get when you use all this
kind of unfiltered data.
59:59 (Speaker E) But I think the interesting thing is that Llama was then fine tuned
in many different ways and some really powerful models are built on
top of it. So I think in some sense, almost any data is valuable in
the sort of pretraining stages and maybe you need really high quality
for the fine tuning, but I think that big volume might be really
useful, maybe not the most economical.
60:21 (Speaker A) So I want to wrap up things why they potentially have like a leg up
versus not a leg up. We definitely know that Twitter was used to
train other models that we currently use. We know this for a fact.
This was the reason why Elon and Sam Hoffman, who used to be friends,
are no longer friends, sheet posting about them.
60:40 (Speaker A) And the current models we use. Do use this data set, but it's old for
them. It's no longer like recent and relevant.
60:40 (Speaker A) And we know for a fact that Twitter is significantly biased and
probably the best place in the world for uncovering news as they
happen before the bias sets in, before the narrative sets in, before
folks know how to before folks get their marching orders from MSNBC,
from the Other Side, how to think about things when not. The Twitter
is really good at talking about issues as they arise, the second they
arise. And I think that on its own is going to teach the models a
very great deal.
61:16 (Speaker A) Naval Ravican, if you guys follow Namal, he always said Twitter makes
him a better writer. So we definitely know also that tweet in short
form condense information better. And if their model trains on that,
obviously taking all the precautions we talked about before, bots and
shit, posting, et cetera, if they're able to actually get this into
the model, likely their model will be more up to date and more fine
tuned like reaction.
61:20 (Speaker A) So with that, I want to close. We'll see about Xai. It's definitely
exciting, right? We're potentially getting another big one,
potentially open source one.
61:20 (Speaker A) So we'll see. I'm going to wrap up this update and I think the next
one I want to move on. Matt, let me know if you're still around if
you want to cover.
61:20 (Speaker A) So we have Matt, who introduced himself in the beginning. So I'll let
you do this quickly again because maybe and then we're going to talk
about the stuff that GitHub Stars is rising on, which I think is
super cool. And I invite you to give us a little bit of an interview
about this.
62:16 (Speaker A) Go ahead, Matt.
62:17 (Speaker D) Yeah, sure. So I'll try to summarize it a bit better than the last
time. A lot of practice, but very long story short, co founder, CEO
of Other Side AI, creator of Hyperwrite, and a number of other
things. Basically, we've been around for a number of years now.
62:30 (Speaker D) We're one of the first companies in the space working with LLMs. The
goal always has been to build a personal assistant that scales to
everybody, just like a real human personal assistant, but at scale,
way cheaper, digital. The tech wasn't there at the beginning. So we
built other products to sort of learn and gather resources, whether
that's users, revenue, bunch of other things that we can do.
62:50 (Speaker D) What we do today. Today we are actually building that personal
assistant. So an AI that can operate a computer, any software to do
what a human can do on pretty much anything.
62:53 (Speaker D) So it'll help you with your tasks. It's very simple. Today it's a
Chrome extension that lets you sort of like control Chrome just by
sort of talking to it.
62:53 (Speaker D) So you could say, go order me a pizza, or go send this person an
email or go filter my email, or anything else it works okay today.
The idea is that over time, it's going to get a lot better, a lot
cheaper, a lot faster, to the point where six months from now, a year
from now, it might actually be as good as, if not better than a human
on many tasks. But that being said, while I work on this, I also like
to learn about getting the most out of these technologies because
they're so fast moving and you really have to stay on top of it to be
effective, or you.
63:34 (Speaker A) Can every week and then stay up to date with us together. But yeah,
go ahead.
63:40 (Speaker D) Exactly. I mean, a lot of what I do to learn really, is just build
things that I find interesting, and I find that often, even if I'm
not expecting it, a lot of those learnings do translate to stuff
we're doing at other sides. So this sort of just came out of that.
Happy to sort of dive into the project, or if you want to sort.
63:56 (Speaker A) Of stop me and let's pause here for a second and I'll just tell folks
that I pinned Matt's Tweet from a couple of days ago with the
introduction. Since then you got a few thousand stars, I think, on
GitHub, and we're going to talk about the GPT Prompt Engineer project
and the different reasons why Matt and folks kind of written this and
what it's here to serve. So maybe give us an introduction to the GPD
Prompt Engineer and what kind of made you come up with this and how
it works. Yeah, go deep, man.
64:29 (Speaker A) Sure. Yeah.
64:30 (Speaker D) So forget about rambling in advance. Essentially, I find prompt
engineering so fun. I've been doing it pretty much every day for
everything, honestly, to the point of excess, from what I would do
for work to having it decide what I'm making for dinner for years
now. And as I've gone through this process, sort of like learning how
to use these models, it's become very clear that especially as these
models evolve, there's no best practice for anything.
64:54 (Speaker D) Prompts change ways to prompt change. Something that works for one
task might not work for a very similar task. And the only way sort of
get out of that is to sort of get an intuition of the model and try a
lot of things, but that doesn't always work perfectly.
65:01 (Speaker D) And also you don't really know kind of what works and what doesn't.
Even when you're trying things right, you have to do it sort of like
in a very scientific way, but there's no real right answer to
anything. It's kind of like alchemy.
65:18 (Speaker D) So starting to think I think this was right. When GPD Four came out,
I was using GPD Four pretty often to just ideate prompts. I would
say, here's what I'm trying to do.
65:20 (Speaker D) I would say, write a prompt me, and I would use the ideas from that
to help me improve my own prompts and that actually got a lot of
interest. We ended up building a sort of thing similar to that into
the hyperwrite platform. At the time it was really cool, but really
wasn't something that would replace what I do every day, which is
really hardcore prompting.
65:43 (Speaker D) Eventually I was just sort of thinking about it, and I think this was
on the 4 July, I was just sitting there kind of thinking, what if we
tried it? And I started thinking about how could you design a system
that actually comes up with good prompts? Not just a prompt that does
the job, but something that's actually optimal, because as humans,
right, we can only try so many things at once. But the magic of these
LLMs is they're creative and they think faster than we do. In the
time that I could write half a prompt, LLMs could write 5100.
65:48 (Speaker D) And what if you could leverage that? Because even if the average
prompt isn't very good, you're going to luck into one or two that
happen to be exceptional for your task. So I started by doing it
actually with a classifier. I only released this notebook yesterday
just because it's like a step on the road.
65:48 (Speaker D) And what we ended up using it for was actually something at other
side where we needed to build a classifier for something with
personal assistant. And I just wasn't getting good performance out of
the prompts that I was writing. So I said fuck it, what if we have
the AI try to do this? And I built this so that essentially I
describe the task, I give it some test cases, so I'll give it some
true false test cases.
66:11 (Speaker D) Because the classifier was classifying things as true or false. It
was like classified the statement as true or false. And it was like
New York is in America, it would be true.
66:54 (Speaker D) If it was new York is in Paris it would be false. And I basically
created like ten or 20 of these test cases. I described the task and
I had GPT generate something like, I think 20 or so prompts.
66:57 (Speaker D) And surprisingly, the quality of them just at first glance was pretty
good, right? It was kind of shocking considering I spent so much time
trying to do this manually. Then what I did was I just basically had
each of these prompts test against each of these test cases. And I
plotted sort of the success of each and turns out some of them
actually outperformed what I did.
66:57 (Speaker D) I was kind of shocked, right? Like you wouldn't expect that,
especially doing this for years.
67:30 (Speaker A) Just to recap real quick on this, the GPT four, I assume that's what
you're using generated prompts actually performed better than Match
rumors. Prompts and Matchroomr is the founder of a prompt company
with a lot of prompt use cases for a long time, from GPT-3 to four,
et cetera. And some of the ones that it came up with performed better
than yours.
67:52 (Speaker D) Yeah, it was kind of scary. Some of them performed way worse. But the
idea is that you're going to sort of luck into something that is
better. Maybe two out of 20 will be better, but they're great.
68:02 (Speaker D) So I was sort of just so fascinated by this, I was like, how do you
take this further? Because classification is one thing, but real
prompts where you're actually having it generate text, those are
harder. How do you judge that? You could use GPD four to judge them,
right? If you have two prompts and you say each of them generate me
something and they give you your responses and you want to know which
is better, you can ask GPD four. And so I figured we could apply
that.
68:29 (Speaker D) Turns out there's some issues with that and there are some papers
written about this where essentially it'll be sort of like more
favoring the one that's on the bottom. So just do it twice, flip the
order and see if one wins. And I took that approach and I sort of
combined it with sort of like an ELO style tournament where
essentially you have each of them go head to head, like one on one,
and each of them gets their ELO score either bumped up or down based
on whether they win, lose or draw.
68:53 (Speaker A) Can you give two sentences on ELO scores as a concept? Yeah.
68:57 (Speaker D) I'm actually not super familiar with it. Funny enough, I had GPC
write the code for that part, but basically think of it like a
ranking system in a video game. Yeah, think of it like a ranking
system in chess or a video game where you have two people competing
and the one that wins gets their score increased by x. The one that
loses gets their score decreased by x.
69:18 (Speaker D) And it also sort of like weighted based on the previous scores. So if
somebody that has a high score beats somebody with a very low score,
their score won't increase that much because they're very likely
going to win. So it's sort of just like a weighting system to help
figure out what's the best so instead of just sort of getting a clear
cut, yes, this is right, or no, this isn't what you can do with
classifiers, because there is a right and a wrong ground truth
answer.
69:39 (Speaker D) I just had each prompt sort of generate for a test case and the sort
of opposite prompt the competition prompt would generate for that
test case. So I was a little bit complex and they would have the
model judge which one was better. And it's expensive, right? It might
cost like $20 in GPT calls to get to an answer, but turns out at the
end, the prompts again were just kind of blowing me away.
70:04 (Speaker D) Awesome creativity in them. Like the words it used, the trigger
words, it didn't do what I would do. And in a really good way.
70:10 (Speaker D) And it also opened up my eyes to sort of like new ways of prompting
that I never would have thought of and just sort of like aren't
standard. And that's kind of the magic of all this. I think that this
sort of abstracts away the sort of atomic level of prompts, right?
You talk about prompts as sort of a prompt in and of itself and then
a system built around the prompts with many prompts kind of working
together.
70:31 (Speaker D) This makes it so that you don't have to guess about, do I have the
best prompts for this single atomic part of our system? Where the
magic really comes in then, is how do you string these amazing
individually crafted by AI prompts together to make something that
actually works really well.
70:46 (Speaker A) And how you robustly build the evaluation system, right? Because the
classifier is a simple example of evaluating, because maybe you know
this, et cetera, but how do you actually scale up the evaluation
system such that this could potentially run in loops and then
generate the best of the best prompts for a task?
71:03 (Speaker D) Exactly.
71:03 (Speaker A) That's also like a very interesting piece. How do you think about
evaluation going forward?
71:08 (Speaker D) Yeah, so I think it's sort of like that, where you could have this
thing run in the loop three times and take the three winners and then
have GPT read those winners right, and be like, here are prompts that
worked really, really well. Here are the test cases where they
failed. Now I want you to write new prompts that take what's good
about these but also mitigate the failure cases and generate a whole
new set of prompts. Sort of like evolution really doesn't just have
to stop at one point in time after the first run.
71:37 (Speaker D) It's like, let's learn from what these amazing ones still did wrong
and continue to make this better and better and better. Obviously,
this relies on a relatively large test set. I'm also experimenting
with ways where you can have the test set autogenerate, but that's a
little bit finicky.
71:50 (Speaker D) But I do think that sort of like evolution of this could lead to some
really exceptional prompts. But what I found was even on the first
run I was seeing it outperform myself. For example, there was a
classifier we were using GPT four with logic bias to do because it
was such a hard challenge and we were getting some like 90% accuracy.
71:50 (Speaker D) I had it do these prompts with GPT four, but then I had it run them
using GPT 3.5 and it got 96%.
72:19 (Speaker A) We've talked about this pattern before where you can outsource kind
of the hard work to GPD four, but then once you get really good at
prompting, GPD 3.5 is actually very decent in many things and it's
way faster, cheaper, and has a 16K context now that you can use. And
so we've seen this pattern with many folks that if you don't need the
full power of the GPT four, human evil for coding, et cetera. You can
go far into GPT 3.
5 and get very far along, especially as you're getting better
prompts. And now, Matt, you have like a recursive crafter helper guy
that's here. And my next question for you is, have you used anything
else? So you mentioned GPD 3.
5 where you run the prompts. Have you tried them on different models,
like Cloud maybe, or the open source llama ones?
73:07 (Speaker D) I actually haven't just because I wanted to see if this worked. It
was sort of just an interesting thing for me and my time is really
focused on other side and personal assistant, but it wouldn't be hard
to get Claude in. I suspect Claude prompts would perform better on
Claude. Open ad prompts would perform better on Open xai just because
the models give the prompt them very differently.
73:18 (Speaker D) Claude is sort of like a more emotional thinker. Open xai is more of
like a logical thinker. It's a very sort of simple, not perfect
analogy, but I suspect you'd want to sort of like stick within the.
73:36 (Speaker A) Ecosystems, maybe, not to mention inflections pie, which is like a
whole different beast.
73:41 (Speaker D) Yeah, that's an interesting one.
73:44 (Speaker A) We discussed by a couple of times and I've seen some reactions, but I
don't think maybe at the end of this, if we have time, matt, one
question I will have for you on this and I think we'll move on. Is
that where folks can find more work of this? Is it open source? What
are you looking for contributions? If you are. And yeah, just give us
a wrap up of this project.
74:07 (Speaker D) Yeah, so you can find it on GitHub. It's called GPT prompt engineer
Currently there are two notebooks. It's all done in Jupiter notebook
format, so it's pretty easy to edit. One is for the classification
system, the other is for the generation system.
74:20 (Speaker D) We're honestly sort of like at a point where it works well, so it's
like, what do you build around it? One thing that's missing is the
classification version only supports true and false labels, but it's
not hard to use TikTok into or TikTok and whatever it is to allow it
to support arbitrary labels like happy, sad, angry, whatever. That's
probably like a 20 minutes ad that if somebody goes in and does that
opens up a whole new set of use cases. The evolution idea that I
mentioned before, right? Taking the best prompts and then saying,
here's where it went wrong on these test cases, and then throwing it
back to GPT and having it generate more and rerunning it, that's
interesting.
74:45 (Speaker D) The ability to use Claude would be awesome if anybody wants to add
that. I could even see it evaluating each prompt on each model,
right? Because right now we only generate with GPD four. We only
evaluate with GPT 3.
75:19 (Speaker D) 5. But imagine if you generate with GPD four half of them, you
generate half of them with Claude and then you evaluate each prompt
on GPT four, GPT 3.5 and Claude.
75:27 (Speaker D) And you can see sort of the latency success rates for each along with
scores. I think all that would be super interesting. Also sort of
like just open to ideas.
75:40 (Speaker D) I'm not really sort of supporting this at all. So if anybody wants to
kind of take it and run with it, I am all for that. Also sort of just
like a shameless plug right now or thing that we're looking for just
because I have an audience here.
We are at other side in hyperwrite, really looking for somebody to
help on back end hopefully with a security set of expertise. And then
also if anybody is experienced in training machine learning models, I
would love some help there because we're doing a lot of LLM training.
75:55 (Speaker A) So just quick thing and also to add that now with the Prompt Engineer
that's automated, the results of this would likely generate like a
great data set that you can add and continue fine tuning, especially
as GPT four fine tuning is coming soon. So Matt, definitely store
everything you generate with the yellow score and everything and from
a GPT prompt engineer that runs and doesn't know about the rest run,
maybe there's going to be a path forward to actually fine tuning a
prompting model, which could be exactly. Well, yeah, exactly.
76:28 (Speaker D) Imagine taking a prompt and taking one that has a slightly higher
score and fine tuning a model to take the initial prompt and then
sort of output the one that has a higher score and you can do that
evolutionarily continue to get better prompts in theory.
76:40 (Speaker A) Awesome. So folks, if you want to work in a cool place, I can write,
hit met up and also check out GPD Prompt Engineer on GitHub. Thanks
for coming. Feel free to stay and kind of continue commenting and
talking with us as we go through a bunch of other updates that we
have.
76:57 (Speaker A) Just a quick check with NISten who promised me to follow Twitter and
see if anything new comes up. Breaking news as we talk. I haven't
seen anything besides the space of Xai.
77:04 (Speaker A) I will ask people's attention to the last pin tweet from Dr. Jim Fan
that talks about the context length dip. Matt, you also touched on
this context length dip. It's basically a paper, I think.
77:22 (Speaker A) Stanford I'm not sure that figured out. That even longer. Context
windows, they have a dip in the middle, which means that at the
beginning of the prompt at the end of the prompt, the model has more
attention to what you actually asked it to or the details that you
provide in the middle there's like a dip.
77:39 (Speaker A) And this was also released this week. However, the one thing I said
previously I will repeat here claude and some folks who know about
contact windows way more than me. They say the Claude is actually
really good at this without the dip.
77:54 (Speaker D) Yeah, I feel like that's saying. It's an interesting paper. I feel
like it's sort of saying like, hey, if you train on marketing copy,
then it's going to be worse at coding, obviously. Right.
78:03 (Speaker D) We do a lot of long context stuff at other side. That's actually what
I'm focused on right now, training really long context massive
models. And if you train it on data where there's context in the
middle that matters, it is going to be good at that.
78:16 (Speaker A) Interesting. So what you're saying, I think I've seen this kind of
opinion before as well. It's just the outcome of the data that was
fed in and for blog posts and other places, people want to hook your
attention in the beginning and then kind of finish strong. Basically
you're saying that this is potentially an outcome of that and not
necessarily the tech behind it.
78:38 (Speaker D) Yeah, I believe so. I mean, who knows, maybe wrong, but from my
experience, right, why I was given that analogy before is like if you
train it up to do one thing and then you're asking it to do another,
it's not going to do that other thing as well. And I'm guessing the
data set that they sort of did this evaluation on was something that
didn't have a ton of information at all. Part of the reason that so
few of the language model companies have super long context length
models and why it was such a big deal that Anthropic did is because a
lot of the challenge in training them isn't actually in training
them, it's in the data.
79:08 (Speaker D) Obviously, inference becomes a challenge. It's the cost and the
overhead there. But the data to sort of do this is really sparse.
79:10 (Speaker D) It's not very available. Right. So that's I think part of it right
there's not just like a sort of standard data set that has super long
context link, that has information in the middle.
79:25 (Speaker D) We do actually we've been building one another side and that's sort
of given me some of the ideas that I'm sort of spouting here. But my
guess is that Anthropic part of the reason theirs works is because
they focused on the data. The data is really important.
79:38 (Speaker A) Right.
79:39 (Speaker D) I will say model, it's just fine tuning.
79:41 (Speaker A) Yeah. I will say when I got access to Clouds Window, I did like a
bunch of tests with my Twitter data. I just pasted like a bunch of
JSON with Twitter numbers, twitter IDs numbers. And the smaller
model, the not 100K, gave me back results that actually didn't invent
those numbers.
79:57 (Speaker A) The 100K model lost in the middle and started inventing those
numbers. I literally saw this difference between the longer complex
one and the previous one and I thought it's because of like it loses
some complex in the middle. And I need to retry this on the new ones
because the new ones, they claim this doesn't happen with that.
80:01 (Speaker A) I want to go to Al and yeah, one of you I think raise your hand first
to talk about the context length dip and that paper if you have read
this, if you have thoughts and if you have noticed this as well.
80:29 (Speaker F) I just had a quick question for Matt about the differences that he
found in prompting between say, Claude and GPT Four. I noticed like,
the prompts aren't really reusable and maybe you could speak to that
in the general case.
80:42 (Speaker A) Yeah, let's end with maybe this question and move on to other updates
as we have. Go ahead, Matt.
80:48 (Speaker D) Yeah, sure. So it's like talking to two people with two different
personalities, right? They're both people, but they respond
differently to different ways. You're sort of prompting them, if you
will. Claude is sort of like more emotional, I guess, where open xai
is sort of more logical.
81:03 (Speaker D) And it's hard to sort of pin that down to any one thing, and it's
hard to give you sort of like techniques based on that because,
again, every use case is very different, but it's very clearly it's a
prompt them differently. I think also talking about the idea of fine
tuning a prompting model will be very interesting is fine tuning a
model that takes an Open Xai prompt and converts it to the idealized
version of a Claude prompt and vice versa. I mean, I think that could
be very powerful because there are ways to sort of intuit your way
there.
81:29 (Speaker D) It's just hard to sort of distill into a set of rules. One thing I
found actually quite interestingly with Quad two is that it is
insanely resistant to sort of like jailbreak attacks. So I was able
to get it to do it.
81:44 (Speaker D) Turns out the stupidest method worked. It was sort of like modifying
that dan prop that's been going around like reddit but the more
nuanced sort of like complex methods that typically work with OpenAI
they didn't. So I think the model is just qualitatively different.
81:56 (Speaker D) I think it's going to take some time to fully explore it and
understand why and how still super early days.
82:06 (Speaker A) I love the fact that all of us are getting an intuition about
different models and how to approach them right. And that's like
Sweet was here before. This is like a specialization of what I think
he talked about as an AI engineer. We're getting to start to
understand the differences between those to the little fine little
things that you can say.
82:11 (Speaker A) And I think it will be very interesting if you have a model that's
trained to actually convert them or translate them between the models
to work the same. I have an idea where not to get locked into the GPD
Four ecosystem with the functions. I have an idea of wrapping the GPD
Four API package with something.
82:47 (Speaker A) They will actually kind of print the functions into the context
because cloud now has a huge context window. And then try to see
whether or not cloud is able to kind of without additional tech,
without additional changes to the API to replicate the outputs of how
a GPT with functions would do. And that's going to be an idea I'll be
testing, hopefully, and talk about next week.
83:08 (Speaker A) Thanks, Matt.
83:10 (Speaker C) Today, there has been a thing today, maybe yesterday, but anyway,
today there have been a model that generates prompts. By the way, by
giving the data, you generate the prompt. I've written about it today
on Twitter. It is so powerful, it is such a cool method that you can
take whatever you have, like, I don't know, scientific papers and
generate instructions for them.
83:32 (Speaker C) Now you can fine tune a model that generate scientific papers. You
got jokes. Now you can train a model that become funny.
83:35 (Speaker C) You can generate the instruction, convert whatever you want into
instructions. Amazing it is today. One more thing about the deep in
the middle thing.
83:51 (Speaker C) I don't know why it happens. I have no idea how Open Xai trained
their models. But I think if you think about it, many missions, many
instructions, paragraph, and before the paragraph, you tell the
model, please summarize the following, or on the contrary, like a
paragraph and at the end, what was that? Something.
84:10 (Speaker C) So it makes a lot of sense that a model pays a lot of attention to
the beginning at the end, because of this. And on the same note, it's
very easy to fix. So I wouldn't just point fingers.
84:21 (Speaker C) It's good that they pointed it, but I think it's like, I don't know,
a couple of minutes of training, open AI, like, fine tune for a
minute and fix it.
84:28 (Speaker A) I just want to ask yum, yum. The the pin that I just tweet sorry, the
Tweet that I just pinned on top, this was the one that you talked
about, the instructions generation and the problem generation.
84:38 (Speaker C) Yeah.
84:39 (Speaker A) Awesome. So folks, definitely feel free to check this out. I haven't
seen this. You want to give a couple more words about that one.
84:44 (Speaker A) It looks like you wrote, like, a very deep dive. What's the model
like eleven B, three B?
84:54 (Speaker C) Sure. Two models put into the models, whatever you want. Okay, let's
go back. You got a data set of something, emails from your company,
for example, and you want a model that will help you write emails.
85:01 (Speaker C) Okay, you can start thinking about how to train this model, or you
can use this and now generate a text that basically says, help me
write the following email to this following person of something
something and the actual email. And all of a sudden, you have a model
that is extremely you have a data set to train a model or to fuselage
or whatever that is extremely tuned to this. So I think it's a very
cool technique.
85:40 (Speaker C) It's very powerful, has a lot of potential. And the trick, in simple
words, is training the model. What not to say? That's the missing
piece here, that they added the trick.
85:51 (Speaker C) They took instructions and outputs that do not fit just a different
random output from the data and train with a different laws. That the
model should not say this because this input does not with that
instruction, does not result in this output. That's it.
86:11 (Speaker C) That's the trick. And it works perfectly and really cool.
86:17 (Speaker A) Awesome. I have some folks who want to come up and ask questions. I
think we're almost there in terms of the updates. I will just briefly
run to some updates.
86:18 (Speaker A) I don't even have time to go and look for the threads, but if you're
not following Rama CPP, follow gerga is one of the groups that we
have in the States. I think he single handedly is in charge of so
many folks trying to get a MacBook, because it's incredible how much
performance they've been able to squeeze out of Llama. And it's
comparatives.
86:49 (Speaker A) And many people just, like, quantize their models, basically make
them smaller to run on this GGML platform that they have. The recent
news that I have from over there, there's like two pieces of news.
Last week, for those of us who were here last week, we talked about
CFG.
86:58 (Speaker A) I forgot something. I forgot the guidance scale. And we talked about
the CFG parameter moving from diffusion models that we know.
87:17 (Speaker A) Like, in stable diffusion, you can define how close to your prompt
should the model generate the image. Somebody decided, I think, an
illusion reaction. Somebody said, hey, can we have this control of
CFG to our LLM generation? CFG is a classifying guidance scale,
something like that.
87:37 (Speaker A) And they did it. The Chad GGR added this to Llama CPP. And so now you
can actually kind of pass a CFG control and fine tune.
87:48 (Speaker A) It's almost like a running fine tune to an extent. You can test the
model to be closer, farther away from the problem that you have.
Contrasting this with the stuff that we have on a GPD, four API,
which is temperature.
88:01 (Speaker A) And I think, Matt, you mentioned something to logic bias, logged
bias, something like that, right? Where you can ask it not to say
certain things. So contrasting CFG, it's like a different beast that
we now have a different control. And so GGML just merged into their
platform.
88:18 (Speaker A) Definitely worth checking out. And the second thing is, I need to
find the Tweet. Yesterday, Georgia was like, oh, yeah, by the way,
here's the 48% inference speed improval that somebody just merged in.
88:30 (Speaker A) Have you guys play and try this. For the 33 billion parameter model
of Llama, somebody just merged in a 50% increase on inference speed
just on the way. And I find this incredible because Gmail already
runs many stuff on Raspberry Pi or whatever, iPhones, and now
somebody's like, oh, yeah, here's a 50% increase in infinite speed.
88:41 (Speaker A) And then I think Nissan was here before he was talking about GGML
runs on the iPhone, because iPhones, even from three years ago, have
the same neuron chip that like the latest Max or some such, and that
this performance boost on GGML also applies to iPhones as well. So,
incredible stuff. And as we hear every week, we keep seeing leaps,
incredible leaps in speed and performance.
89:15 (Speaker A) Definitely worth checking out GGML and the five folks that work on
those stuff. GML comments, folks who use Llama, CCP, feel free to hop
up and raise your hand and give us more updates from that length. I
denied it.
89:28 (Speaker A) You are gay at the spaces, but sometimes as a guest as well. Other
than that, I think we'll move on to some more updates and then we
just have questions. No? Cool.
89:41 (Speaker A) So the next update that I have is from the diffusion side that we
sometimes cover. We don't cover it often, but we do cover it from
sometimes time to time. So two things from stability stable
diffusion.
89:46 (Speaker A) We talked about Sdxl, the new Excel model that can generate 1024
images. We've talked about last week about the 0.9 weights dropping.
90:01 (Speaker A) Sdxl 1.0 is now available in the Stable Diffusion discord. If you've
played with Me Journey before and you looked at Stable Diffusion,
it's like, it's not that great.
90:05 (Speaker A) Stable diffusion sdxl one is really impressive. And besides being
really impressive, they plan to release this open source. So we're
going to see a bunch of folks fine tune loras and specific versions
of the specific things.
90:16 (Speaker A) And I think it's like, incredible. If you want to play with those
models and you haven't yet, go to Stable Diffusion discord and hit up
that bot and then Netflix let us know how incredibly different that
is. And we're waiting for the wait for the Sdxl 1.
90:47 (Speaker A) 0 to drop. And I will mention this every day until the year mark.
It's been less than a year since table Diffusion.
90:57 (Speaker A) It's been less than a year. I remember I think it was August 22 when
they actually dropped the full open source model. Less than a year.
91:12 (Speaker A) And we've seen just such incredible progress. So, like Matt said
before, it's really hard to keep up, but it's also really hard to
internalize how far, just how far we're coming with those incredible
leaps and changes every week. And again, to just plug in this
Thursday I space.
91:21 (Speaker A) This is why we're here. Every thursdai talking about everything and
everything that's changed and updated. And the other thing that I
want to I see art in the audience with apart.
91:28 (Speaker A) If you play the list, the Excel, feel free to raise your hand to come
up. The other thing that they released, I don't know if you guys
familiar with Clip Drop. So Stable Diffusion bought Clip Drop as a
company and started implementing that interface compared to their
Dream Studio interface.
91:49 (Speaker A) So ClipDrop is like a way simpler interface day to day release,
something called Stable Doodle. Stable Doodle is I don't know if
folks in the audience remember this. Meme how to draw an owl.
91:51 (Speaker A) Step one, draw a circle. Step two, draw some eyes. And step three is
like, draw the rest of the fucking owl.
92:06 (Speaker A) And then you have, like, a beautiful owl painting at the end of this.
This is now the go to test on how the Doodle models work. And I
pinned my attempt at this, but definitely check out ClipDrop Doodle
thing.
It's really fun to play with. So those are, like, the updates from
the diffusion world.
92:10 (Speaker D) Hey, real quick. I was just looking at the repository for Comfy UI,
and then I saw that I don't know how to say his name. Scousekip is in
here. So I just wanted to come on and say, like, hey, this is
incredible.
92:24 (Speaker D) This is what we've been talking about for months now, right? This
node based character codex, if you will, of like there's just
infinite possibilities. I just want to listen, but thanks.
92:35 (Speaker A) For bringing me up.
92:36 (Speaker D) This is really cool, man. I was just thanks for bringing up Comfy UI.
92:42 (Speaker A) I feel guilt at not being up to date on every single possible thing.
I know it's impossible. I really try, and Comfy I has been on my list
to try, but then Quad was released and Code Interpreter was released.
Comfy I seems like the thing we want, man.
92:42 (Speaker A) I think stabilization when they tried to bring up Dream Studio, they
talked about, like, a node based thing where you can pipe models to
other models, you can find filters, et cetera. Comfy UI for folks who
have tested it out, it looks like that's it. And I definitely want to
agree with Art.
93:16 (Speaker A) It's something to watch out and maybe try because automatic one on
one, even though it's, like, super advanced and has been there for a
beginning since Stable Diffusion, it's just like a shit show of a UX.
Just like horrible, horrible. I'm sorry, guys.
93:30 (Speaker A) I've built a web UI before automatic. It's really hard to get Gradio
to play as much as you want. It's really hard to maintain a good UX
product with many, many people contributing, with many, many things
are changing under your feet.
93:45 (Speaker A) So it's really not their fault, but it's a shit show to get started
with. And Comfy UI seems like a fresh, clean start. So definitely if
you're playing with this, test this out and let us know.
93:55 (Speaker A) Max, you have your hand raised and you play with the Excel. Give us
some of your thoughts.
94:01 (Speaker I) Yeah, I have played through the website in a studio, so I'm lately
working with a company that make toys for kids. They want to start
incorporating AI. And one of my concerns we're working with them is
like, okay, we want to generate images for kids. Something that is
going to probably freak them out is two things that diffusion models
have been lacking.
94:27 (Speaker I) One is the ability of painting things like complicated shapes or
intricate shapes like hands. SD. Excel is not better at it.
94:40 (Speaker I) Another one is this concept of what is named like concept bleeding,
which is this diffusion model tends to mix objects that are similar
in shape or form is not good at it, neither. Now, I was reading the
paper from Stability or the report. They claim they are outperforming
Mid Journey in five of seven categories now, mid Journey 5.
1, right?
95:12 (Speaker A) Just to make sure. Mid Journey since then released the new version
also because we're in same pace, but yeah, they've compared to Mid
Journey 5.1. Yeah.
95:20 (Speaker I) Well, now this is a report internal released by Stability. It's a
paper, it might have some credibility, I don't know. I like the
results. It's very close to me journey, but I think there is still
one or two steps behind, in my opinion.
95:36 (Speaker I) What is different is what you have mentioned, Alex. Once they release
the weight and we can see Lotus about this, I'm expecting to see the
results that we can get because probably that is what is going to
position this model like a step above Mid Journey, but not yet. This
is my opinion.
95:58 (Speaker A) Yeah, definitely. And thanks for that. And I love folks coming up and
sharing their opinion about these things. I will say on the top.
96:05 (Speaker A) Thanks Mike. Or I guess I know you're a new name, but I'm not sure if
I can if I should.
96:10 (Speaker I) Yeah, totally, totally have it, in my view. I'm Juan Spanish, living
in Mexico and I like these things.
96:17 (Speaker A) We appreciate you coming up here on the topic of UIs that we've
mentioned with somebody or somebody folks released Pinocchio. They
call this the AI browser. And I want to highlight this because I want
to give you practical tips. Janae, I think, is coming in with some
breaking news.
96:28 (Speaker A) I don't know if Janae wants to come up or can, but if you can, feel
free to come up and tell us there's some news from Bard. Until we
talk about Bard, the topic of UIs for those things, and you guys know
we're mostly focused on the LLM side and the Engineer side. Less than
there's a fusion, but we sometimes have love for both the above tool
that you can download and not deal with the terminal, not deal with
the bunch of stuff, unifies all of them.
97:08 (Speaker A) It's really nice. Check out the Nokio AI browser. I think it's open
source.
97:12 (Speaker A) You download this once, it's cross platform, Mac, PC, et cetera, and
then you're able to download Llama CPP, and then you're able to also
download table diffusion. And then fairly quickly, without knowing
how to code, without going through the terminal, without installing
packages, folks here know that installing the packages is like a
whole pain we all share and we all hate without doing all of that.
That's the promise that they have, you are able to pipe Llama outputs
into stable diffusion.
97:38 (Speaker A) So Yam previously mentioned kind of the model that can do, and Yam
and Method are talking about a method of generating prompts for LLMs,
but also we know that there's models prompts to actually generate
prompts for diffusions and they're trained on different and fine
tuned on different ways to generate diffusion prompts. Right, and
this Pinocchio browser is actually allowing you to run like an and
then pipe the output into stabilization model and then see the output
of that. I think it's incredible that this exists and is
downloadable.
98:07 (Speaker A) I haven't tried this yet. If you in the audience or somebody on stage
have tried Pinocchio, please raise your hand. I want to bring you up
and talk about Pinocchio and your experience with this.
98:19 (Speaker A) And if we haven't, I want to bring this to our attention so that next
week we're able to talk about this. This is added to my list of
things I like. The Comfy UI that I haven't tried it yet.
98:29 (Speaker A) Anybody use pinocchio yet? No? Cool. I wanted to get Cocktail Peanut.
The guy who wrote Cocktail Peanut.
98:36 (Speaker A) If you're in the audience, feel free to raise your hand. I don't
think you are, but feel free to follow the thread. He goes fairly
deep.
98:44 (Speaker A) And feel free to use and try Pinocchio by next week and then come up
next week and talk about the differences between this and running
automatic one one. All right, folks, thanks everyone for coming to
another Thursday. I space.
98:58 (Speaker A) Hope this has been helpful for a bunch of you. We tried a few new
things here. We tried to give updates, but also deep dive into a
conversation with Matt and looks from the reactions here that maybe
this is worth putting down on paper and sending out an email for
those of you who want to maybe sign up for this and not don't have
the time to listen to two hour spaces, so I'll definitely try at
least to do that.
99:19 (Speaker A) I want to thank a few folks on stage that have joined consistently
and providing a lot of signal yum follow Yam. He has great insights
into models and training and different things al in the audience.
Thanks always for coming up.
99:33 (Speaker A) Junaid is running the Denver meetup, and if you're in the Denver
area, feel free to join us next week. Thanks for coming. Haven't seen
you in a while, buddy.
99:45 (Speaker A) Juan sorry. Yeah, I think Juan great. Maxi and Lentos has recently
been joining us.
99:51 (Speaker A) It's been great. We have some more folks in the Evans who are
regulars, and we invite you to also be regulars and come up and talk
about Thursday. I will say this one thing, tag me in anything that's
new.
100:01 (Speaker A) I would love that. And help promote the message for other folks. If
you did like the space, this also really helps for more folks to get
to the bottom of this for those folks.
100:01 (Speaker A) I didn't get to their questions. I apologize. I'm trying to keep this
as a balance of a high signal thing versus letting everybody
questions as well.
100:22 (Speaker A) Last thing I'll say is about myself, a little bit consultant. I stay
up to date so you don't have to. That's my tagline.
100:29 (Speaker A) If you're in the company and needs consultancy for somebody who's up
to date on everything, I try to be that guy. Feel free to tap me in
the DMs. And, yeah, thursdai folks, keep tagging us everything that's
new. We're going to try to cover next week with that.
100:34 (Speaker A) I thank all of you. Thanks for coming. Thanks for giving us two and a
half hours of your attention.
100:34 (Speaker A) I really appreciate it. Attention is sparse and very important, and I
really thank everybody who gave us, like, two and a half hours. Thank
you, folks.
101:00 (Speaker A) Hey, Alex, we really appreciate you.
101:04 (Speaker B) Thanks, Alex.
101:05 (Speaker H) Thanks for doing a good space and keeping us on track, actually.
101:09 (Speaker A) Yeah, thank you.
101:10 (Speaker D) Yeah, alex definitely want to kind of.
101:13 (Speaker A) Give our thanks to you as well.
101:15 (Speaker E) For curating an awesome space.
101:17 (Speaker D) I think I'm definitely not the only one that gets a lot of good
signal out of this. And I know a lot of hard work goes into keeping
yourself up to.
101:27 (Speaker A) Date so that you can share it.
101:28 (Speaker E) With all of us.
101:29 (Speaker D) So just on my own behalf, thank you. And I'm sure that is echoed by.
101:34 (Speaker E) A lot of people on stage and in the audience.
101:36 (Speaker A) Humble man thank you. I appreciate you. Thank you, folks. Have a nice
Thursday and bye next week.
Share this post