ThursdAI - The top AI news from the past week

📅 ThursdAI - OpenAI DevDay recap (also X.ai grōk, 01.ai 200K SOTA model, Humane AI pin) and a personal update from Alex 🎊

0:00

-1:56:19

📅 ThursdAI - OpenAI DevDay recap (also X.ai grōk, 01.ai 200K SOTA model, Humane AI pin) and a personal update from Alex 🎊

This week has been an incredible one for me personally in addition to how incredible it was for the community at large! Open AI, what a show, Humane is cool, and... another thing I been dying to tell

Alex Volkov

Nov 09, 2023

Hey everyone, this is Alex Volkov 👋

This week was an incredibly packed with news, started strong on Sunday with x.ai GrŌk announcement, Monday with all the releases during OpenAI Dev Day, then topped of with Github Universe Copilot announcements, and to top it all of, we postponed the live recording to see what hu.ma.ne has in store for us as AI devices go (Finally announced Pin with all the features)

In between we had a new AI Unicorn from HongKong called Yi from 01.ai which dropped a new SOTA 34B model with a whopping 200K context window and a commercial license by ex-Google China lead Kai Fu Lee.

Above all, this week was a monumental for me personally, ThursdAI has been a passion project for the longest time (240 days), and it led me to incredible places, like being invited to ai.engineer summit to do media, then getting invited to OpenAI Dev Day (to also do podcasting from there), interview and befriend folks from HuggingFace, Github, Adobe, Google, OpenAI and of course open source friends like Nous Research, Alignment Labs, and interview authors of papers, hackers of projects, and fine-tuners and of course all of you, who tune in from week to week 🙏 Thank you!

It's all been so humbling and fun, which makes me ever more excited to share the next chapter. Starting Monday I'm joining Weights & Biases as an AI Evangelist! 🎊

I couldn't be more excited to continue ThursdAI mission, of spreading knowledge about AI, connecting between the AI engineers and the fine-tuners, the Data Scientists and the GEN AI folks, the super advanced cutting edge stuff, and the folks who fear AI with the backing of such an incredible and important company in the AI space.

ThursdAI will continue as a X space, newsletter and podcast, as we'll gradually find a common voice, and continue bringing folks awareness of WandB incredible brand to newer developers, products and communities. Expect more on this very soon!

Ok now to the actual AI news 😅

TL;DR of all topics covered:

OpenAI Dev Day
- GPT-4 Turbo with 128K context, 3x cheaper than GPT-4
- Assistant API - OpenAI's new Agent API, with retrieval memory, code interpreter, function calling, JSON mode
- GPTs - Shareable, configurable GPT agents with memory, code interpreter, DALL-E, Browsing, custom instructions and actions
- Privacy Shield - Open AI lawyers will protect you from copyright lawsuits
- Dev Day emergency pod with Latent Space with Swyx, Allesio, Simon and Me! (Listen)
OpenSource LLMs
- 01 launches YI-34B, a 200K context window model commercially licensed and it tops all HuggingFace leaderboards across all sizes (Announcement)
Vision
- GPT-4 Vision API finally announced, rejoice, it's as incredible as we've imagined it to be
Voice
- Open AI TTS models with 6 very-realistic, multilingual voices, no cloning tho
AI Art & Diffusion
- <2.5 seconds full SDXL inference with FAL (Announcement)

OpenAI Dev Day

Here I’m interrupting Greg in the middle of a conversation with Ron Conway, I’m sorry Ron!

So much to cover from OpenAI that this has it's own section today in the newsletter.

I was lucky enough to get invited, and attend the first ever OpenAI developer conference (AKA Dev Day) and it was an absolute blast to attend. It was also incredible to attend it together with all 8.5 thousand of you who tuned into our live stream on X, as we were walking to the event, and then watched the keynote together (Thanks Ray for the restream) and talked with OpenAI folks about the updates. Huge shoutout to LDJ, Nisten, Ray, Phlo, Swyx and many other folks who held the space, while we were otherwise engaged with deep dives and meeting folks and doing interviews!

So now for some actual reporting! What did we get from OpenAI? omg we got so much, as developers, as users (and as attendees, I will add more on this later)

GPT4-Turbo with 128K context length

The major thing that was announced is a new model, GPT-4-turbo, which is supposedly faster than GPT-4, while being 3x cheaper (2x on output) and having a whopping 128K context length while also being more accurate (with significantly better recall and attention throughout this context length)

With JSON mode and significantly improved function calling capabilities, updated cut-off time (April 2023), and higher rate limits, this new model is already being implemented across all the products and is a significant significant upgrade to many folks

GPTs - A massive shift in agent landscapes by OpenAI

Another (semi-separate) thing that Sam talked about was the GPTs, their version of agents

not to be confused with the Assistants API, which is also Agents, but for developers, and they are not the same and it's confusing

GPTs I think is a genius marketing move by OpenAI and replaces Plugins (that didn't even meet product market fit) in many regards.

GPTs are instances of well... GPT4-turbo, that you can create by simply chatting with BuilderGPT, and they can have their own custom instruction set, and capabilities that you can turn on and off, like browse the web with Bing, Create images with DALL-E and write and execute code with Code Interpreter (bye bye Advanced Data Analysis, we don't miss ya).

GPTs also have memory, you can upload a bunch of documents (and your users as well) and GPTs will do vectorization and extract the relevant information out of those documents, so think, your personal Tax assistant that has all 3 years of your tax returns

And they have eyes, GPT4-V is built in so you can drop in screenshots, images and all kinds of combinations of things.

Additionally, you can define actions for assistants (which is similar to how Plugins were developed previously, via an OpenAPI schema) and the GPT will be able to use those actions to do tasks outside of the GPT context, like send emails, check stuff in your documentation and much more, pretty much anything that's possible via API is now possible via the actions.

One big thing that's missing for me is, GPTs are reactive, so they won't reach out to you or your user when there's a new thing, like a new email to summarize or a new task completed, but I'm sure OpenAI will close that gap at some point.

GPTs are not Assistants, they are similar but not the same and it's quite confusing. GPTs are created online, and then are share-able with links.

Which btw, I created a GPT that uses several of the available tools, browsing for real time weather info, and date/time and generates an on the fly, never seen before weather art for everyone. It's really fun to play with, let me know what you think (HERE) the image above is generated by the Visual Weather GPT

Unified "All tools" mode for everyone (who pays)

One tiny thing that Sam mentioned on stage, is in fact huge IMO, is the removal of the selector in chatGPT, and all premium users now have access to 1 interface that is multi modal on input and output (I call it MMIO) - This mode now understands images (vision) + text on input, and can browse the web and generate images, text, graphs (as it runs code) on the output.

This is a significant capabilities upgrade to many folks who will use these tools, but previously didn't want to choose between DALL-E mode and browse or Code Interpreter mode. The model now intelligently selects what tool to use for a given task, and this means more and more "generality" for the models, as they are learning and getting new capabilities in the form of tools.

This in addition to a MASSIVE 128K context window, means that chatGPT has now been significantly upgraded, and you still pay $20/mo 👏 Gotta love that!

Assistant API (OpenAI Agents)

This is the big announcement for developers, we all got access to a new and significantly improved Assistants API, which improves on several our experience on several categories:

Creating Assistants - Assistants are OpenAI's first foray into the world of AGENTS, and it's quite exciting! You can create an assistant via API (not quite the same as GPTs, we'll cover the differences later), you can create each assistant with it's own set of instructions (that you don't have to pass each time with the prompt), tools like code interpreter and retrieval, and functions. Also you can select models, so you don't have to use the new GPT-4-turbo (but you should!)
Code Interpreter - Assistants are able to write and execute code now, which is a whole world of excitement! Having code abilities (that executes in a safe environment on OAI side) is a significant boost in many regards and many tasks require bits of code "on the fly", for example time-zone tasks. You will no longer have to write that code yourself, you can ask your assistant
Retrieval - OpenAI (and apparently QDrant!) have given all the developers a built in RAG (retrieval augmented generation) capabilities + document uploading and understanding. You can upload files like documentation via the API or let your users upload files, and parse and extract information out of! This is an additional huge huge thing, basically memory is built in for you now
Stateful API - this API introduces the concept of threads, where OpenAI will manage the state of your conversation, and you can assign 1 user per thread and then just send the responses back to the user, and send the user queries to the same thread. No longer do you have to send the whole history back and forth! It's quite incredible, however it raises the question of pricing, and calculating tokens. Per OpenAI (I asked), if you would like to calculate costs on the fly, you'd have to use the get thread endpoint, and then count the amount of tokens that's already in the thread (and it can be a LOT since there's now 128K tokens in the context length)
JSON and Better functions calling - You can now set the API to respond in JSON mode! Which is an incredible improvement for devs, and which we only were able to do via Functions before, and even functions got an upgrade, with an ability to call multiple functions. Functions are added as "actions" in the assistant creation, so you can give your assistant abilities that it will execute by returing to you functions with the right parameters. Thing "set the mood" will return a function to call the smart lights, and "play" will return a function that will call Spotify API
Multiple Assistants can join a thread - you can create specific assistants that can all join the same thread with the user, each with a set of custom instructions and capabilities and tools
Parallel Functions - this is also new, the assistant API can now return several functions for you to execute, which could lead to the creation of scenes, for example in a smart home, you want to "set the mood" and several functions would be returned from the API, one that will turn of the lights, one that will start the music, and one that will turn on mood lighting.

Vision

GPT-4 Vision

Finally, it's here, multimodality for developers to implement, the moment I personally have been waiting for since GPT-4 was launched (and ThursdAI started) back on March 14 (240 days ago, but who's counting)

GPT-4 vision is able to take images, and text, and respond with many vision related tasks, like analysis, understanding, summarization of captions. Many folks are splitting videos frame by frame and analyzing whole videos already (in addition to whispering the video to get what is said)

Hackers and developers like friend of the pod Robert, created quick hacks like a browser extension that lets you select any screenshot on the page and ask GPT4 vision things about it, another friend of the pod SkalskiP created a hot dog classifier Gradio space 😂 and is maintaining an awesome list of experiments with vision on Github

Voice

Text to speech models

OpenAI decided to help us all build agents properly, and agents need not only ears (for which they gave us whisper, and released V3 as well) but also voice, and we finally got the TTS from OpenAI, 6 very beautiful, emotional voices, that you can run very easily, and cheaply. You can't generate more or clone yet (that's only for friends of OpenAI like Spotify and others) but you can use the 6 we got (plus a secret pirate one apparently they trained but never released!)

They sound ultra-realistic, and are multi-linugal as well, you can just pass different languages and voila. Friend of the pod Simon Willison created a quick CLI tool called ospeak to pipe text into and it'll use your OAI key to read that text out with those super nice voices!

Whisper v3 was released!

https://github.com/openai/whisper/discussions/1762

The large-v3 model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisper large-v3 performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to large-v2:

HUMANE

Humane AI pin is ready for pre-order at 699

HUMANE pin was finally announced, and here is the break-down, they have a clever way to achieve "all day battery life" by having a hot swap system, with a magnetic booster that you can swap when you get low on battery (pretty genius TBH)

It's passive so it's not "always listening" but there is a wake word apparently, and you can activate by touch. Runs on the T-mobile Network ( which sucks for folks like me where T-mobile just doesn't have reception in their neighborhood 😂 )

No apps, just AI experiences powered by OpenAI, with a laser powered projector UI on your hand, and voice controls

AI voice input will allow interactions like asking for information (which has browsing) and is SIGNIFICANTLY better than "Siri" or "Ok Google" from the demo, being able to rewrite your messages for you, catch you up on multiple messages and even search through them! You can ask for retrieval from previous messages

Pin is multimodal, voice input and vision

Holding the microphone on Tab while someone's speaking to you in a different language will automatically translate that language for you and then translate you back to that language with your own intonation! Bye bye language barriers!

And with vision, you can do things like tracking calories from showing it what you ate, or buy things you're seeing in the store, but online, take pictures and videos and then store all of them transcribed in your personal AI memory

Starting at $699, with a $24/mo payment that comes with unlimited AI queries, storage and service (again, just T-mobile), Tidal music subscription and more.

I think it's lovely that someone tries to take on Google/Apple duopoly with a completely re-imagined AI device, and can't wait to pre-order mine and test it out. It will be an interesting experience of balance with 2 phone numbers, but also a monthly payment that basically makes the device use-less if you stop paying.

Phew, this was a big update, not to mention there's a whole 2 hour podcast I want you to listen to on top of this, thank you for reading, for subscribing, for participating in the community and I can't wait to finally relax after this long week (still Jet-lagged) and prepare for my new Monday!

I want to send a heartfelt shoutout to my friend swyx who not only let me on to Latent Space from time to time (including the last recap emergency pod), but also is my blood-line to SF, where everything happens! Thanks man, I really appreciate all you did for me and ThursdAI 🫡

Can't wait to see you all on the next ThursdAI, and as always, replies, comments, congratulations, are welcome as replies, DMs and send me the 🎉 for this one, I'd really appreciate it!

— Alex