How Do We Know How Intelligent Artificial Intelligence Is?
With some "IQ tests" specially designed for AI
This is a question I should have asked myself earlier; I think I took it for granted. I talk to various AIs every day and have never really wondered how intelligent they actually are, just as I wouldn't if I were talking to a human. But in this case, understanding it seems incredibly important to me.
I've been using Claude 3.5 Sonnet (spoiler alert: it has surprised me with how good it is). However, I couldn't help feeling a bit confused when I saw some benchmarks that accompany its launch page.
According to this, Claude 3.5 Sonnet is very intelligent relative to certain benchmark scores. Pretty impressive, right?
And here we can see how Anthropic's latest model compares with other LLMs like Claude 3 Opus (also from Anthropic), GPT-4 from OpenAI, Gemini 1.5 Pro from Google, and Llama-400b from Meta.
The first time I saw these graphs and tables, I thought: Wow, this must be incredibly intelligent! After 5 minutes, I realized I actually had no idea what GPQA, MMLU, BIG-Bench-Hard, and other mysterious acronyms were. I imagine them as IQ tests for measuring artificial intelligence, and I'm sure I can't be the only one asking that question. With that in mind, I decided to dig deeper into the topic, so now I'm going to try to give a brief explanation of each one.
By the way, you'll notice we mention Large Language Models (LLMs) a lot, if you're not familiar with them, I recommend taking a look at the article I wrote about LLMs.
How is intelligence measured?
We humans have designed tests to evaluate language models like ChatGPT, Claude, Llama, etc., and understand how good they are at answering certain types of questions or performing certain tasks. These benchmarks are what you see in the table when you read things like MGSM, HumanEval, GPQA, etc.
Besides those strange acronyms, in the table we also read things like 5-shot, 0-shot CoT, 3-shot CoT, which I plan to explain in a future post. For now, just know that they refer to how we "ask questions" to the model during testing.
Now, let's look at each of these different "intelligence tests for AI"
GPQA Diamond
GPQA stands for "A Graduate-Level Google-Proof Q&A Benchmark," meaning it's a graduate-level question and answer test that's Google-proof. It's a set of 448 multiple-choice questions created by experts in biology, physics, and chemistry. These questions have been specifically designed to be both extremely high quality and incredibly challenging.
Just how difficult can these questions be?
Well, so difficult that experts who have or are about to get a PhD in these areas only manage to solve 65% of the questions. So difficult that non-experts could only achieve 34% accuracy even when taking their time and being allowed to use the Internet!
Of those 448 questions, there are 198 that are the most difficult to answer without deep knowledge in these subjects. These 198 are known as GPQA Diamond.
MMLU
MMLU stands for Massive Multitask Language Understanding. This test measures an LLM's multitasking capability by evaluating it across 57 tasks including areas like elementary mathematics, history, sociology, and more.
The idea behind this test is that to pass it, the model must have comprehensive world knowledge and good problem-solving ability.
So, with questions covering 57 areas of knowledge, we can see how well the LLM performs in each one.
It's estimated that humans solve these tests with accuracy ranging from 34.5% for non-specialists up to 89.8% for experts.
The 57 areas of MMLU
I imagine you're curious about what subjects are covered in this test. Here are all 57 areas:
Abstract Algebra, Anatomy, Astronomy, Business Ethics, Clinical Knowledge, College Biology, College Chemistry, College Computer Science, College Mathematics, College Medicine, College Physics, Computer Security, Conceptual Physics, Econometrics, Electrical Engineering, Elementary Mathematics, Formal Logic, Global Facts, High School Biology, High School Chemistry, High School Computer Science, High School European History, High School Geography, High School Gov't and Politics, High School Macroeconomics, High School Mathematics, High School Microeconomics, High School Physics, High School Psychology, High School Statistics, High School US History, High School World History, Human Aging, Human Sexuality, International Law, Jurisprudence, Logical Fallacies, Machine Learning, Management, Marketing, Medical Genetics, Miscellaneous, Moral Disputes, Moral Scenarios, Nutrition, Philosophy, Prehistory, Professional Accounting, Professional Law, Professional Medicine, Professional Psychology, Public Relations, Security Studies, Sociology, US Foreign Policy, Virology, World Religions.
HumanEval
This test measures an LLM's ability to create code in a programming language from human language. Another interesting thing is that it's not limited to a single language or programming language but is tested across 23 natural languages and 12 programming languages.
A few days ago I wrote about how LLMs respond to different languages, it's worth taking a look.
How does it work?
As I understand it, a prompt is created in English, which is then translated to other languages using AI, then the prompt is presented in each language to the LLM to generate code in different programming languages, and the quality of the generated code is measured.
Languages and Programming Languages
Here's the list of natural languages: Arabic, Hebrew, Vietnamese, Indonesian, Malay, Tagalog, English, Dutch, German, Afrikaans, Portuguese, Spanish, French, Italian, Greek, Persian, Russian, Bulgarian, Chinese, Turkish, Estonian, Finnish and Hungarian.
And programming languages: Python, Java, Go, Kotlin, PHP, Ruby, Scala, JavaScript, C#, Perl, Swift and TypeScript.
MGSM
This test evaluates how good an LLM is at elementary school math problems. Yes, the kind that children do. MGSM stands for Multilingual Grade School Math.
To conduct this test, there are 250 problems that have been translated from English into 10 different languages. Each problem specifically requires between 2 and 8 steps to solve. What makes this special is that, unlike HumanEval, all translations were done by native human translators without any AI assistance.
Multilingual (translated by humans)
Here are the languages the questions were translated into: Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai.
DROP
DROP is a reading comprehension test, but not for complex texts - specifically paragraphs. DROP stands for Discrete Reasoning Over the content of Paragraphs.
It's a collection of more than 96,000 questions created via crowdsourcing from a set of texts obtained from Wikipedia. One notable feature is that some of the questions include mathematical operations using data present in the text.
When it first appeared in 2019, models had an accuracy of 32.7% while humans reach 96%. While LLMs have improved significantly since then, they still haven't reached human level.
Just paragraphs... in English
Keep in mind that this is reading comprehension for paragraphs only, and there's still a long way to go before these models have a deep understanding of more complex texts. Moreover, this limitation is compounded by the fact that these paragraphs are only available in English.
BIG-Bench-Hard
This is an interesting test: it involves asking the language model to perform tasks that we know it's not really good at, or at least we know it's not better than humans. At least not yet.
This test comes from the BIG-bench test (Beyond the Imitation Game benchmark), which consists of 204 tasks that we believe are beyond the current capabilities of language models. From these 204 tasks, the most difficult ones were selected - that is, the ones that models found particularly challenging - and this subset became known as BIG-Bench-Hard: the most difficult challenges from BIG-Bench.
MATH
MATH is a database with 12,500 high school level mathematics problems. These questions were taken from mathematics competitions, so we can't exactly say they're easy. Plus, they're designed to be solved in multiple steps.
When these questions were tested on language models in 2021, the accuracy of answers was between 3% and 6.9% - not very impressive if you're a high school student, but pretty impressive for a machine that just learned to speak English. These models even reached 15% accuracy on the less difficult questions and even when they got things wrong, they still managed to generate step-by-step solutions that were coherent (though incorrect).
The same evaluation was done with humans, and according to the tests, an average computer science PhD student who wasn't particularly fond of mathematics achieved 40%, while a math olympiad champion achieved 90%.
GSM8K
Launched in 2021, this database contains 8,500 secondary school math questions, its name comes from Grade School Math 8K.
These questions were specifically designed to evaluate language models on this type of problem. The questions have been specially selected to have high linguistic diversity. However, following the common pattern we've seen, this question database is available only in English.
That was quite a collection of tests!
Now that we've explored some benchmarks and intelligence tests for Artificial Intelligence, we can better understand how these models' capabilities are evaluated.
Something I found particularly striking was the deceptively simple nature of some tests, or rather, how simple they would be for a human (adult) to complete. The fact that these tests evaluate reading comprehension of a single paragraph and elementary school mathematics clearly gives us an idea of these models' capabilities. Although they are very articulate and can give the impression of being flawless, they might not be solving some math problems or reading comprehension tasks at the level of a 10-year-old.
Moreover, it's crucial to keep in mind that many of these tests are conducted in English. This means we can't assume these LLMs will be equally precise in other languages.
That's all for today! See you soon!
G
Hey! I'm Germán, and I write about AI in both English and Spanish. This article was first published in Spanish in my newsletter AprendiendoIA, and I've adapted it for my English-speaking friends at My AI Journey. My mission is simple: helping you understand and leverage AI, regardless of your technical background or preferred language. See you in the next one!