LLM reasoning, AI performance scaling, and whether inference hardware will become commodified, crushing NVIDIA's margins

Current skills of LLMs are a mirage of human projection; solving AI reasoning requires AI architecture innovation; training and inference will converge; static transformer inference will commoditize.

Jul 25, 2024

TL;DR: Scaling static transformers won't bring us AGI; current skills of LLMs are partly a mirage due to human projection; AI reasoning will require new architectures while training-inference divide is set to converge, both implying continued demand for and dominance of NVIDIA/AMD general purpose GPUs; nonetheless static transformer inference is extremely useful and will become a basic computing building block in the future, implying commoditization, Jevons paradox demand explosion and path to profitable growth if not market capture for at least one of the transformer inference semiconductor startups or CSP custom silicon.

Rise of the transformer and scaling laws for LLMs

The 2017 invention of the transformer replaced "hard" recurrency with "soft" positional encoding + multi-headed attention to achieve better training scalability of sequence learning tasks like language understanding via massively parallel matrix multiplication ("matmul"). It worked: By 2019, OpenAI trained GPT-2 which they found to have developed so-called "emergent skills" of such a magnitude that they turned their back on open science for ostensibly "safety reasons", but likely just as much motivated by Sam Altman's commercial ambitions for OpenAI. Whatever the case, LLMs (large language models) were becoming really useful in text manipulation, language task and potentially even agentic skills. On the back of this research success, OpenAI struck a partnership with Microsoft in July, 2019 to enable further scaling of their language models. This was motivated by the so-called scaling hypothesis or "scaling law" which was later empirically established in the 2022 Chinchilla paper by Google DeepMind researchers who investigated the relationship between LLM performance and two measures of LLM compute input: size of the model in terms of parameters (aka "weights"), and number of text tokens on which the model is trained. This firmly established the existence of, or at least the collective belief in, scaling laws of LLMs, i.e. models get better the more data, parameters and GPUs we give them. Later the community discovered that in practice training a LLM for much longer, on much more tokens (i.e. "overtraining") than what "Chinchilla" and also classical machine learning literature estimates as being "compute optimal" still yet increases model performance. These scaling laws so far seem to have held up, and despite scaling logarithmically with compute we have so far observed roughly linear scaling of LLM performance over time due to exponentially rising hardware compute capacity, as both hardware performance itself as well as capital deployment in AI have risen exponentially.

Next, let’s understand what these LLMs seemed to become capable of and what functionally enabled those skills.

"Emergent skills" and the primacy of memorization in LLMs

When testing the performance of autoregressive transformers on a range of skills, e.g. human-like conversation, high school exams, chess puzzles, etc., for which they weren't specifically trained, researchers at OpenAI and other labs noticed something incredible: The language models were able to "perform tasks" which the researchers didn't specifically intend them to perform. These skills were interpreted to arise "naturally" and were called emergent skills, with the hypothesis being that the next-token prediction learning motivation (i.e. loss function) of LLMs is forcing the transformer to abstract and encode a kind of "generalized world model" to achieve further performance on its learning task, which they equated to a pressure for the model to learn to maximally compress the training data into its limited set of weights. The implications were, of course, incredible: "So you're saying we can just throw more tokens and more compute at these ever-larger models and they will learn every skill just like that with progressive scale?"

Not so fast! We cannot jump to that conclusion directly. What we do know is that LLMs achieve insane compressions of huge datasets. Using 4-bit quantization you can even run a large language model on a modern MacBook and access most of the information on the internet that was contained in the training dataset relatively accurately. That is insane! What is less clear about this, is whether the models achieve this by abstracting general skills from specific examples they saw in the training data, and then apply those skills during inference to generate its answer, like people who are convinced of emergent skills and intelligence growth in LLMs tend to claim.

A direct example of why this might be false is that we tend to be able to trace an answer to a prompt, that supposedly shows zero-shot skills of LLMs, back to some obscure websites that wounded up in the training set, which provide the exact data for that particular task. The truth is, nobody can possibly know everything that is inside the training dataset which basically contains the entire internet and all books and news articles ever written. For example, the ability of LLMs to evaluate chess positions stems from websites that provide exactly that analysis. When prompting the LLMs to evaluate a position, often enough the answer it gives is pretty much a 1:1 copy of that piece of the training set. Even more creative answers tend to be traced back to multiple pieces of the dataset between which the model interpolates. This makes sense, seeing as transformers layer-by-layer perform transformations on high-dimensional embedding spaces, that represent that training data in a compressed way, and specific prompts allow us to query approximate interpolations of different points in those embedding spaces.

A indirect example of why the generalized skills hypothesis (or hypothesis of a world model which bears any similarity to brain-produced world models) might not be quite right, is that of counter-factual tasks. Another chess example here is that LLMs perform incredibly well when analyzing the legality of chess openings, yet they are reduced to mere 50/50 guessing if we change the situation slightly by saying "suppose the knights and bishops change places, now determine whether the openings are legal according to standard chess rules, think step by step". Any human that knows the chest rules could easily do this, state-of-the-art LLMs cannot. This shows you, the LLMs have indeed not learned to apply chess rules but are more or less "copy-pasting" answers from their training data set, and fail if we change the conditions ever so slightly such that it falls outside of their training dataset. This brittleness of their skills suggest LLMs are using memory more-or-less exclusively to solve the skill tests they are asked to do. This is supported by views in the community, that the dataset essentially determines the performance of transformers on certain benchmarks.

In this context, I find the views of AI researchers Yann Lecun (Meta), Subbarao Kambhampati (Google), Francois Chollet (Google) and Tim Scarfe (Machine Learning Street Talk) convincing. The idea is that autoregressive transformers mainly do so-called "approximative retrieval", i.e. it's not exact retrieval like you would expect from SQL queries of static databases, but it is definitely not knowledge generated from scratch by reasoning over basic principles and basic knowledge. Francois' perspective of LLM prompting as interpolating between learned, mini vector programs is insightful as well and aligns with my interpolation of points in the embedding space view from above. From that perspective, LLMs are capable of regurgitating memorized pieces of text and they have some capacity for creativity by mixing different learned concepts like e.g. the topics of gangster rap with the poetic style of Shakespeare. Since autoregressive LLMs are effectively prompting themselves (they feed-in their past output), we can achieve some limited form of reasoning with prompt strategies like "think step by step" where it can use reasoning-like textual schemata to induce itself to produce thought-like, semi-rational text and then try to infer via summarization a conclusion from it. This kinda, sorta, sometimes works but its definitely not close to what we'd like to have as reasoning machines.

Next, let's try to understand why expectations and interpretations of LLM performance have diverged so much in the machine learning community.

Anthropomorphization, or why we need to be careful when interpreting LLM performance

The fundamental issue we have when interpreting the true capabilities of LLMs based on results of different benchmarks and skill tests is that autoregressive transformers are structurally significantly different than human cognition in the brain while at the same time humans have a very strong tendency to project human-like form and behavior onto dynamics we observe (possibly a unintended consequence of our ability to have a theory of mind), which is called anthropomorphization (literally: human + form).

For that reason, when we see examples of successful task execution, even if you are the best machine learning engineer in the world and know the system inside-out, we just cannot help but presume that the process behind that task execution must be similar to how we execute that task. It is just so natural for us to interpret behavior by empathizing based off our own subjective experience. Thus, even if we know the exact architecture of a transformer and how we call it auto-regressively, that anthropomorphic bias is just so easy to slip into our interpretation of what we see. Put concretely, when inputting some task x into a LLM f and we observe y = f(x), it is just so very hard to keep the interpretation of what f does untainted from subconscious anthropomorphic bias. We as humans then abstract and generalize based on that interpretation and start projecting it onto other use cases, skills and our forecasts of future abilities of LLMs and AI more broadly. That's why I am such a big fan of counter-factual tasks for checking our interpretations of LLM skills, since it gives us an epistemologically clear-headed and straightforward way to falsify our beliefs/hypothesis about LLM performance by constructing simple counter-examples which an LLM could easily solve if it would truly learn general skills like it seems to us at first glance. The results so far show that they do not, or are at least brittle in ways we have a hard time predicting. Why is that the case?

There is broad consensus that "intelligence" roughly speaking is a combination of memorization and/or intuition (system 1 type thinking) and reasoning (system 2 type thinking). The reality is that in everyday life when conversing with people or reading an article or a tweet, we never know whether the text was freshly generated "from scratch" in this very moment via reasoning or "actual skill application" or whether the text had been pre-conceived at an earlier time, possibly by someone else, and it is now simply being repeated to you from memory. Since we only converse with human beings though, that is not a huge issue since we have self-experience and thus intuition about the origin of knowledge other humans share with us. However, large language models are an entirely different beast. We cannot intuit what it would be like to have read 1000x more than any human being can ever read in their entire lifetime and to be able to remember practically anything of it given the right stimulation (aka prompt). If a human would talk like ChatGPT, we would probably have to presume that they are decently smart (though they do make a bunch of weird mistakes for a human, which is a tell). However, our intuition deceives us here since we cannot imagine that anyone could build all those answers and conversations just from Lego-pieces of trillions of text tokens they read, and thus we are easily persuaded that LLMs are smart in the way we are. Subbarao Kambhampati made this point quite well and I encourage you to read it here.

This interpretation of LLM performance via memorization would also explain why apparent "overtraining" doesn't lead to diminishing and reversing returns as classical machine learning would suggest, which says that training for too long on the training set leads to overfitting of the data and a deterioration of generalizability and thus worse performance on test sets. However, due to benchmark leakage (test data leaks into training data), and the current ability of LLMs stemming from effective compression and thus memorization of the training data, overfitting to data actually leads to progressively better "performance" beyond what we would expect. This need not be bad, if we had a dataset about literally everything and anything, which is impossible due to it being infinite, we would exactly want an overfitting to that dataset. Since that is impossible, however, the key ingredient we are generally looking for in building artificial intelligence is an ability to "bridge the gap" from finite examples to infinite applications. That is true general intelligence, and we humans posses it. Our limitations are not due to generality but due to limited lifetime, limited cognitive capacity and limited cooperative ability amongst humans. Over time, however, there is in principle nothing that we could not solve, a thought beautifully set out in David Deutsch's book The Beginning of Infinity. ChatGPT does not share that same characteristic, i.e. even with infinite time, it will not be able to solve more than a finite amount of problems, and also it is not capable of checking itself, of critiquing itself, to see if it's right. Thus, what about reasoning in AI?

Reasoning and the need for architecture innovation in AI

Due to the above criticism, Francois Chollet and Mike Knoop (Co-founder Zapier) recently re-invigorated a challenge called the "ARC Prize" (ARC for "Abstraction and Reasoning Corpus") on which modern AI has not had much progress over the past five years, while most other benchmarks are converging towards saturation (90+% performance). Why is that the case? The ARC-AGI benchmark was specifically designed to not be solvable via memorization, while being relatively easy for most humans who achieve median accuracy of 85%. Chollet and Knoop make the explicit point that current LLMs are not good enough since they lack general reasoning capability, and they have kickstarted the machine learning community into finding new solutions to achieve true general intelligence in machines (AGI). If it wasn't clear before, it is clear by now that we need further architecture innovation in AI, despite claims by few remaining scaling maximalists, to achieve AGI. Francois Chollet suggests the eventual solution will likely be LLM-like systems serving as system 1-type intuition machines that guide a program search process coupled with a strict solution checker, iterating progressive solution candidates until the reasoning task is solved.

Let's now start to draw practical implications from this for the AI hardware industry and possible trends for the future and their consequences for AI inference semiconductor startups.

AI innovation is not stopping, general-purpose GPUs will remain in demand

We have not reached "the end of history" with regards to AI innovation yet. Static, autoregressive transformers are not the end all, be all, like people expected 1.5 years ago, who started startups based off that hypothesis. Thus AI accelerators or GPUs by NVIDIA and AMD, which can run any new AI reasoning algorithm researchers will invent, will likely retain their dominance and their pricing power, since despite the usefulness of current LLMs the usefulness of reasoning-capable AI systems will be far greater and thus economic share will largely remain with current AI hardware juggernauts, mainly NVIDIA.

Hypothesis: training and inference are likely to converge in intelligent systems, look to library of LoRAs for inspiration

While humans certainly profit from pre-acquired knowledge which shapes our worldview and helps us decipher uncertainties and solve complex tasks, for any problem we haven't memorized the answer to, we are capable of learning and investigating in the moment via active inference and come up with solutions about things we did not understand before. From these first principles, it seems to me, that the hard boundary in AI between "training" and "inference" people talk about, especially non-technical WallStreet types, is non-sensical and in its current state transitory. Any intelligent system crucially needs the capability to learn in the moment, as Francois Chollet points out as well (here and here at 0:02:59), and "zero-shot in-prompt learning" that is published as skills of state of the art LLMs is not enough due to failure of out-of-distribution learning, i.e. adapting to novel situations and doing real learning.

For that fundamental reason I expect future intelligent systems to constantly "train" in alternation with "inference" and that separation will converge over time until the distinction becomes meaningless. More specifically, there will still be pre-training but inference will look a more like simultaneous, reciprocal training+inference. We can observe this trend already in comments by Sam Altman, Dario Amodei (Anthropic), and Mark Zuckerberg who say that LLM releases will not necessarily be discrete checkpoints anymore going forward but will become continuous. Furthermore, synthetic data generation is seen as an opportunity to create a run-away intelligence growth dynamic where a foundation model generates data that it can ingest for further training which, again, implies reciprocity and convergence of training and inferencing. Similar approaches were just used for Meta's newest Llama3.1 LLMs. We should also take note of Apple's approach, showcased during WWDC24's presentation of Apple Intelligence, of using small, on-device foundation models which dynamically use task-specific Low Rank Adapters ("LoRAs") that help the LLM perform particularly well for a currently given task, e.g. summarizing notification, proposing a fully written email from bullet points, etc.

LoRA is a common, compute-efficient finetuning technique for pre-trained neural networks like LLMs or diffusion models (used in image generation). I'm personally betting that a library of LoRAs-like approach will deliver the medium-term solution to in-situ learning of LLMs as a milestone on the path to AGI. In this approach a large pre-trained foundation model is augmented by a library of LoRA finetunes with LoRAs being dynamically applied during run-time, separately or in combination, possibly using a second model that does routing of foundation models and gating between sets of LoRAs. Task-specific feedback could then be used to re-train some of the LoRAs in near real-time to incorporate new learnings and improve future performance, or new LoRAs could be added to the library. This could be one avenue to get light-weight in-situ learning and adaptation to novel situation, and thus a convergence of training and inference. I think we will see first glimpses of that future in 2025 when Apple Intelligence LoRAs will finetune for personalization on users' iPhones during the night (due to the large power draw of onboard GPU).

The implications of this trend, if it will actually come to pass, are that training-capable AI hardware (read: NVIDIA GPUs) will remain dominant for general intelligence systems. However, this need not imply vanilla, static transformer inference falls off the map. Let’s look at that case next.

Groq and transformer-ASICs are useful, static transformer inference will commoditize, this could imply Jevons paradox and be profitable

Though current LLMs are not all we had hoped they would, i.e. AGI machines, they still remain extremely useful text manipulation and knowledge intuition tools, if used correctly in a framework of different tools and augmented by external data sources (RAG, web search, etc). Though I cannot foresee the future, I'd imagine that even if AI innovation stopped in mid-2024, the diffusion of this technology across the entire economy and society, which is still left to accomplish, would likely have huge economic impact, notwithstanding the missing reasoning piece. So there is clear value and clear demand for static (i.e. no out-of-distribution, in-situ learning to adapt to novel situations) inference of language model transformers using autoregressive decoding.

I do not subscribe to the lottery hypothesis of hardware-algorithmic co-evolution when applied narrowly to the transformer architecture, as transformer chip designer startup etched.ai does for example. But that does not mean Etched, Groq and other startups of the sort, or custom inference silicon by Cloud Service Providers (CSPs), are not going to be needed. Quite the opposite, I think they will proliferate as vanilla inference of static LLMs becomes commoditized from the semiconductors to the hardware programming interface using OpenAI Triton and/or AMD ROCm to the orchestration layer and then to the API and application layer. NVIDIA's CUDA moat was never and was never intended to hold forever for this narrow use case, more on that in a future blog post.

We saw hints for commoditization everywhere this year. LLMs in 2024 so far were mostly a story of increasing efficiency rather than increasing performance, as well as open (free) models catching up to closed models. Quantization, around since beginning of 2023, marched forward, allowing decent-sized LLMs to be run on any edge device imaginable; as did algorithm<->infrastructure optimization advance, exemplified by OpenAI's release of vastly more efficient GPT4o and 4o-mini after that. Groq also made a huge splash with significantly faster token/sec generation speeds on Llama and other open-source LLMs, opening up new use cases like natural, realistic audio-to-audio conversations between human and machine. Though there are doubts about the generality of Groq's technological approach, the general thrust is clear: static LLM inference will become cheaper and faster, and ubiquitously available. Moreover, high performance open source LLMs are here to stay for the foreseeable with Zuckerberg's open source strategy and the release of Llama3.1 405B that rivals currently available closed models by OpenAI and Anthropic.

Generation speed of LLMs especially will become truly impactful the more LLMs start to talk to each other instead of just to humans. The medium-term result of autonomous interaction of LLMs is an unknowable at the moment, but a complexity explosion in this domain could very well bring new capabilities to systems of static LLMs. Existing research indicates that while a single transformer is not Turing complete, two interacting transformers might be; and a "society of minds" type application of LLMs interacting in a debate of multiple rounds has recently been shown to dramatically improve LLM performance on some math and reasoning tasks. Thus, in the end the above point of need for architecture innovation to solve machine reasoning need not hold since interacting agents of static LLMs might do the job just fine. I'm confident of the positive potential for autonomously interacting LLM agents, but to solve system 2 type reasoning, which in humans enables general intelligence, I stand by my above expectation of the need for more innovation, and thus my implication of continued demand for general purpose GPUs.

Putting reasoning aside, with the commodification trend of LLM inference clear, I expect Jevons paradox to hold in this case too and we'll see exploding demand and usage, implying steady demand for inference hardware providers. However, how this field of GPU designers, semiconductor startups and CSP custom silicon will consolidate is unclear to me. I will go out on a limb here, though, and say that due to the general usefulness of text manipulation and other narrow skills LLMs posses today I expect that not only will LLM API calls become as full-stack commoditized as database SQL queries are today but that static transformer inference will evolve to be a basic computing building block in the future, almost on the level of what matmuls are today.

Exciting times!

Marvin Baumann's VC Blog

Discussion about this post