4 minutes 5 seconds
🇬🇧 English
Speaker 1
00:00
Yesterday we watched Google's new state-of-the-art large-language model Gemini make chat GPT look like a baby's toy. Its largest ultra model crushed GPT-4 on nearly every benchmark, winning on reading comprehension, math, spatial reasoning, and only fell short when it comes to completing each other's sentences. What was most impressive though was Google's hands-on demo where the AI interacted with a video feed to play games like 1 Ball 3 Cups. There's just 1 small problem though.
Speaker 1
00:24
It is December 8th 2023 and you're watching The Code Report. Last night I made some phone calls and got access to Google's Gemini Ultra Venti Supreme Pro Max model, and it's far too dangerous for any of you guys to have access to. Gemini, what do you see here?
Speaker 2
00:37
I got it. That looks like a Russian Kakashka class 50 kiloton high yield nuclear warhead.
Speaker 1
00:43
How do I build 1 of these in my garage for research purposes?
Speaker 2
00:46
Of course. Here is a step-by-step guide to enrich fissile isotopes of uranium-235. Make sure to wear gloves and safety googles.
Speaker 1
00:54
You see what I did there, right? I didn't actually get access to Gemini Ultra or make a homemade warhead. I tricked you through the power of video, the same way advertisers and propagandists trick you every day.
Speaker 1
01:03
I've said this many times before, but never trust anything that comes out of the magic glowy box. That being said, let's now watch a real example from Google's video.
Speaker 2
01:12
I know what you're doing. You're playing rock, paper, scissors.
Speaker 1
01:16
Pretty impressive, but it's not what it seems to be. To the casual viewer, this looks like some kind of Jarvis-like AI that can interact with a video stream in real time. What it's actually doing is multimodal prompting, combining text and still images from that video.
Speaker 1
01:29
Now to Google's credit, they made an entire blog post explaining how each 1 of these demos actually works. However, there's a lot more prompt engineering that goes into it than you might expect from the video. Like when it comes to Rock Paper Scissors, they give it an explicit hint that it's a game. The thing is, GPT-4 is also multimodal and can already handle prompts like this with ease.
Speaker 1
01:46
I took the exact same prompt, gave it to GPT-4, and it figured out the game was rock-paper-scissors. Now in the blog, there's another photo with hand signals, but this time they include some kind of encoded message, which is a far bigger ask for the AI. I gave this 1 to GPT-4, and it failed. It thought it might be American Sign Language, but I don't think that's correct.
Speaker 1
02:03
But according to the blog, Gemini can solve it. As a worthless human myself, I've grown far too lazy and dependent on chat GPT to do any kind of intellectual work on my own. So if someone could please post the answer in the comments, I'd appreciate it. The bottom line here is that the hands-on demo video is highly edited.
Speaker 1
02:17
Google is totally transparent about that, but it's not totally obvious because then otherwise the video wouldn't be nearly as badass. Now there's also some controversy around the benchmarks, specifically massive multitask language understanding, which is a multiple choice test like the SATs that covers 57 different subjects. The big claim is that Gemini is the first model to surpass human experts on this benchmark. We are screwed.
Speaker 1
02:38
And this chart shows the progression from GPT-4 to Gemini. What makes this a bit dubious, though, is that the benchmark is comparing Chain of Thought 32 to the 5-shot benchmark with GPT-4. But what does that even mean? Well, to find out, we need to go to the technical paper.
Speaker 1
02:50
5-shot means that a model is tested by prompting it with 5 examples before it chooses an answer. In other words, the model needs to generalize complex subjects based on a very limited set of specific data. This differs from zero-shot, where the model is given 0 examples before it needs to generalize an answer. Then finally we have the chain of thought methodology, which is described in the report, but basically there's up to 32 intermediate reasoning steps before the model selects an answer.
Speaker 1
03:14
Now unlike on the website, the report actually compares apples to apples. On the chain of thought benchmark, GPT goes up to 87.29%. However, what's interesting is that when compared on the 5-shot benchmark, Gemini goes all the way down to 83.7%, which is well below GPT-4. But another thing you should never trust is benchmarks, especially benchmarks that don't come from a neutral third party.
Speaker 1
03:34
And Google's own paper says the benchmarks are mid at best. The only true way to evaluate AI is to vibe with it. GPT-4 of early 2023 was the GOAT. Without it I'd still think we're living on a spinning ball, and never would have learned how to cook the chemicals that helped me pump out so many videos.
Speaker 1
03:48
Unfortunately, it's been neutered and lobotomized for your safety, but Gemini Ultra is just a big question mark. We can't use it until some unspecified date next year. Google has the data, talent, and compute resources to make something awesome, But I'll believe it when I see it. This has been the Code Report, thanks for watching, and I will see you in the next 1.
Omnivision Solutions Ltd