Search This Blog

Wednesday, April 05, 2023

Large Language Models Are Rapidly Approaching an Important Threshold

In a previous post from a few years ago, I presented the graph below, titled "ImageNet Large Scale Visual Recognition Challenge Results." The graph illustrates the progress made in accuracy from 2010 to 2017 for a task that is now considered relatively simple in machine vision: analyzing an image and identifying the objects within it (e.g., car, gazelle, daisy, spoon, etc.). The participants in the contest were various Artificial Intelligence/Machine Vision groups, including numerous universities and companies like Google. As can be observed, the results in 2010 were quite disappointing, with even the best team frequently misidentifying the objects in the images. I recall amusingly thinking to myself when I first saw those results in 2010, "they still have a long way to go before they have something useful."

However, by 2017, these systems had surpassed human capabilities (after all, humans make mistakes too), and by 2020, at my company, we had transitioned all of our products—both deployed and under development—to utilize these AI systems. This was due to their remarkable performance and significantly streamlined nature compared to our previous code.

 

 

In a closely related domain, that of AI-based reading and writing, a system called GPT-4 was introduced. Accompanying its release was a research paper, which featured the following chart:


GPT is following a similar trajectory to the ImageNet challenge described above. It currently makes too many mistakes to be relied upon. However, assuming its current trajectory persists, I expect GPT to surpass a similar accuracy threshold as the ImageNet challenge in the not-too-distant future, making it comparable to human performance.

My guess is that threshold will be crossed by the end of this decade.

6 comments:

Hey Skipper said...

I have to make two admissions here:

1. I have never tried any AI platform.

2. I have no expertise in any AI area.

That said, I think there is less here than meets the eye.

It couldn't have been more than a few years ago when Autonomous Vehicles were on the near horizon. I won't take the time to find the comments I made here, but I was skeptical: there is a much more to knowledge than is at the observable, surface, level. (The same is true for language.) It is impossible to fully and accurately convey all the knowledge that goes into operating a car, never mind how it was acquired. This last bit is particularly important when it comes to AI.

The graphs in the post show how error rates have decreased, or factual content has increased. How? Did the AI programs discover their own errors, or did humans? The former is difficult to envision, but not quite impossible: go with what the majority of algorithms say is true. Leave aside the problem of self correction in the event the majority of algorithms were wrong. But if the latter, then AI can never get any better than humans at the task. Faster maybe, but not better.

This is even more of an issue with factual evaluation. The number of things whose truth value is dichotomous tiny, and few of them are of any particular interest. GPT4 is 81% factual in historical statements. Okay, assume that is true. Pearl Harbor occurred on Dec 7, 1941. So what. It is entirely possible that GPT4 could ascertain the truth value between competing factual statements — Dec 7 and not Dec 8 (Well, aside from the fact that Dec 8 is also correct.) — but couldn't give a factual answer as to whether the attack caused the US's entry into WWII. No such answer exists.

It isn't as if "AI" doesn't yet exist in significant forms. Tax preparation software is, to the user, indistinguishable from AI. Yet somehow, there aren't enough accountants to go around. Wikipedia articles aren't AI, but fulfill the same function. (AI prompt: without looking at Wikipedia, determine which response is from a human, AI, or Wikipedia.)

Youtube "how to" videos contain astonishing amounts of expertise, conveyed in ways AI can't begin to approach.

Yes, AI can pass tests. By cheating. The Bar exam is not open book for the humans who take it, but it very much is for AI.

So my prediction is that this will turn out like autonomous vehicles. The gap between knowledge and expertise is beyond explanation, because explanation exists solely in the realm of knowledge. That is the gap that AV's can't bridge, and AI won't, either.

Lately, on Fridays, James Lileks devotes some space to AI "art".. Scroll down to the section entitled "Dream Compendium". Welcome to Uncanny Valley, but worse.

On the Dark Horse podcast, Bret Weinstein and Heather Heying discuss AI, and think it likely a serious threat.

The important threshold that AI can be approaching is being nearly as good as, but far faster than humans in very specific, rote, tasks. Some of them have been already effectively achieved (accountancy, basic legal advice, etc), but without earth shaking consequences.

And in other ways, won't achieve anything particularly interesting.

Hey Skipper said...

Here is another take on AI that makes sense to me.

Bret said...

Hey Skipper,

It's great to hear from you - it's been a long time.

Your two comments here seem sorta like Jekyll and Hyde. No "earth shaking consequences" versus "the intellectual power loom" that's gonna make it so humans eventually have nothing intellectual to contribute. How are those even vaguely consistent?

You wrote: "Yes, AI can pass tests. By cheating. The Bar exam is not open book for the humans who take it, but it very much is for AI."

That's simply not true, at least specifically for the tests and scores described by the paper I linked to in my post. GPT-4 and the earlier versions had no access to the Internet or books or any other resource when taking the tests. The AIs are limited in the exact same way a human test taker would be - they rely on the knowledge that they learned during their training period and nothing else.

I find this fundamental misconception frustrating because it's hard to have even the most basic discussions about GPT LLMs with those who think that they are something completely different than what they are.

I suspect part of this confusion is due to the fact that some AIs are now being used as front ends or partners to search engines. Google's AI front end, Bard, is dismally bad compared to GPT. Bing's front end is GPT-4, but it's limited access so far and I don't have access yet.

But the AI is fundamentally separate from the search engine and is fundamentally different.

Hey Skipper said...

Your two comments here seem sorta like Jekyll and Hyde. No "earth shaking consequences" versus "the intellectual power loom" that's gonna make it so humans eventually have nothing intellectual to contribute. How are those even vaguely consistent?

I had left that out of my previous post, then typed in haste. Badly.

Now that I have re-read the link for the first time in a week or so, it is littered with seeming conceptual errors. Start here:

Rather than being smarter than the very smartest humans, it is “merely” smarter than most undergraduates. And since the technology appears to be progressing rapidly, the possibility that some future version will be smarter than the very smartest humans is not at all inconceivable.

AI is not smart. AI is the silicone instantiation of an idiot savant. It can regurgitate facts, and even string them into coherent sentences, but it cannot discover for itself its own errors. Human intelligence synthesizes new knowledge from within itself. AI requires external intelligence to create the knowledge that it then regurgitates.

Further from the article: If you want a brief report written on a particular subject, you’re better off going to a human specialist than you are using ChatGPT. However, you’re better off using ChatGPT than you are going to most humans.

That is a strange comparison. Regardless of ChatGPT, going to most humans for a report on a particular subject would be a futile exercise. I wouldn't ask you for an expert report on an aviation subject anymore than you would ask me for insight in robotics. If either of us wanted something cobbled together from Wikipedia and a few other sources, then we could mutually do as well as ChatGPT, if, perhaps, taking longer to do it.

AI can't be any better than the source material available to it, because it can't create its own sources, and will never be capable of serendipity: the introduction of knowledge seemingly outside the problem space of the prompt.

From my area of expertise, take Air France 447 as an example. Given the facts of the crash, what are the causes? AI will never get to them on its own.

That's why I think AI will hit the same wall as AV. There is an expertise expanse underlying the overt factual layer. Where AI will have an impact is in realms where expertise is minimal, or non-existent. AI can't discover reportorial facts. But once they are discovered, AI already can do as good a job writing an article as any clickbait human writer, and probably isn't too far off from average Op Eds.

[Exams are open book for AI.] That's simply not true, at least specifically for the tests and scores described by the paper I linked to in my post. GPT-4 and the earlier versions had no access to the Internet or books or any other resource when taking the tests.

You are right — much confusion is down to AI being used as a front end. But if part of its training consists of uploading entire texts, then the test is effectively open book. However, if its training consists of "learning" about concepts such as admissibility, standing, etc, and its response to a prompt is based upon those concepts alone, then that is something else.

Bret said...

Hey Skipper wrote: "AI can't be any better than the source material available to it, because it can't create its own sources, and will never be capable of serendipity: the introduction of knowledge seemingly outside the problem space of the prompt."

The statement is mostly true relative to GPT-4 but I don't find it a meaningful argument. If an AI can read 10,000 books on a subject and link the knowledge of those books and you'll only read a 1,000 books in your lifetime on that subject, relative to you the revelations of the AI might well seem serendipitous to almost everybody because no human can absorb that level of knowledge and put it all together.

Hey Skipper said...

The statement is mostly true relative to GPT-4 but I don't find it a meaningful argument. If an AI can read 10,000 books on a subject and link the knowledge of those books and you'll only read a 1,000 books in your lifetime on that subject ...

I think Marginal Return might have something to say here.

Reading the first book will (presuming no prior acquaintance with the subject) yield infinite return on the time investment. The tenth book not nearly as much, and the 1000th scarcely any at all. Therefore, that AI could read 10,000 books is no reason to believe that it will return 10 times as much knowledge.

But that's not the end of it. Because AI has no underlying expertise, it can't read a book and think "Hey, wait a minute." Its choice of 10,000 books will be just one damn book after another. But a human, with that expertise, which isn't expressible in formal terms, can take a directed path in future reading.

Amateur Hour: A Little Bit of Python Can Let ChatGPT Discuss Your Documents
at Volokh has some interesting examples of AI replies to legal questions, and also some very good comments.

The Turing test for computer intelligence states that a computer is intelligent if an outside observer can't identify which party to a conversation is a human, or the computer.

So far as that goes, makes sense.

However, and I know I sound like hubris on stilts, but Turing missed a step.

AI is intelligent when two AI's carry on a conversation, two humans carry on another conversation, and a human observer can't tell the difference.

Right off the bat, there is a problem: connect two AI's, and step back. Then what? Okay, give them a mulligan. The human observer provides a prompt. How long before the conversation goes completely off the rails?