Marvin Baumann's VC Blog

The Feedback Loop in Venture Capital that keeps a16z, Sequoia, and YC on top

Marvin Baumann — Wed, 13 Nov 2024 17:30:30 GMT

This post is based on my masters thesis, you can find its full treatment of venture capital here.

Success in venture capital (VC) is about building great products and businesses, it's about understanding technology and markets deeply, but most importantly it's about knowing many of the right founders, investors and talents, and having access to the juiciest funding rounds. VC is a highly competitive industry that simultaneously would be impossible without extensive cooperation between its actors. Most investments into startups are done by syndicates of VC firms (>70% in the US according to Nguyen, Vu, 2021). Through this extensive use of syndication in VC lasting relationships develop over time which form a network of VC investors and firms that has a specific structure and is deeply influential in investment decisions and reciprocally tied to the entire process of venture capital.

The importance of being well-connected to the right people in venture capital

As I write in my thesis:

"In essence, venture capital is about the clever allocation of multiple types of resources: money, experience, guidance, know-how, talent, customers, suppliers, sources, information, contacts, access and opportunities. From a functional perspective, these resources are discovered, sourced, made available, channeled and applied via a complex network of professional relationships."

Thus, the structure of the network of syndicate relationships between VCs, and the position a general partner (GP) has in it, touches and influences every part of the process of investing in startups and helping those companies grow. Conversely, the success a GP has in that process improves their position in the network, setting them up for even better performance down the line. There is a direct relation, to the point of being treated almost equivalently in literature, between the reputation of a VC firm and the centrality and strength of its network position. Important literature like "Whom You Know Matters: Venture Capital Networks and Investment Performance" (Hochberg et al., 2007) has shown that VC firms with the best reputations, and network positions, consistently and significantly outperform competing VC firms with worse positions in the network of VC syndicate relationships. Literature, however, so far failed to provide a comprehensive, integrative explanation of how reputation ends up functionally influencing VC fund performance, as it either only finds correlations between high-level aspects (e.g. reputation and performance) or focuses on in-the-weeds details.

In this post today, I want to give an overview of exactly such a comprehensive, integrative explanation of the VC process and its interrelation with the syndication network, which I distilled from the current body of research on venture capital in my thesis. Given this explanatory model, it is then quite easy to understand the importance of network position in the VC process as well as the "central feedback loop of VC", as I call it, which leads to top funds like a16z, Sequoia, YC, and others to consistently outperform most others, causing a high barrier of entry for new firms and investors in the market.

To understand where my model is coming from, I'd like to shortly introduce the perspective of systems dynamics to help you frame how to think about the rest of this post.

Primer on systems dynamics

Systems dynamics, to me, is a way of thinking about and explaining the world. We want to understand high level outcomes and behaviors of a given topic, and for that we realize they are embedded in a larger system inside of which many different aspects are connected to each other in functional, causative relationships where feedback and control loops are frequently found. For this reason, we a priori reject intellectually lazy generalizations and aggregations, which are commonly made unthinkingly, since they fail to capture the true behavior of a systems as they gloss over real, even if tiny causative relationships that through iteration and emergence can come to dominate the macro level in which we ultimately are interested in. A natural way then to visualize a process in a system such as venture capital is to gather all elements, aspects and micro behaviors we can think of and start relating them to each other in terms of how they effect one another. In this way, I came up with the following explanatory model of the VC process and the importance of GP network position in it, where every causative relationship (every arrow) is based on findings in literature (full bibliography see here).

The explanatory model

This “causal loop diagram” (in systems dynamics lingo), or simply flowchart, highlights the main feedback loop of reputation leading to an information advantage, leading to picking and building better startups, leading to better returns and ultimately better reputation again.

Reputable VCs gain their information advantage via a wide, strong, and diverse network that is incentivized to share information and deals with them due to their high reputation, which for network partners is associated with great VC performance. The information advantage improve four different factors that lead to the VC having higher quality startups in their portfolio: investment opportunity awareness, investment decision quality, investment management quality and investment access.

Explanatory model of the VC process. The main elements of reputation, information advantage, startup quality and VC fund returns constitute “the central feedback loop of VC”.

Take a moment to go through the major elements of the flowchart and see if the causative relations resonate with you. Comment below or on X if you disagree or think I missed something important. For any detail on each connection, you can find it here.

Let me finish with some observations about VC which the above implies.

Power law of VC fund returns

I believe VC has many power law-like behaviors operating on different levels: talent-to-problem matching, iterative preferential allocation in capitalist systems between capital/resources/talent to profit opportunities which for startups either compound or fizzle out, and as I described self-reinforcing feedback loops between VC firms due the interrelation between syndicate network structures and venture performance. The only thing killing power law fund outcomes in VC, according to (Lahr, 2023), is buying too late stage, and especially selling too early. I generally agree with the implications of Lahr's analysis: If you truly have a power law outcome, why sell the entire stake at IPO instead of riding (some of) it for 20 more years? I explain more in the fund returns chapter of my thesis.

As I showed above, highly reputable VCs do not consistently outperform just because "they are famous" or "they are experienced" or "they have more capital than others". They outperform because their advantageous network position gives them resource access and thus makes them an attractive syndication partner which leads to the best information and the best opportunities flowing to them. VCs without a technical advantage, without a reputational/network advantage, without a personal connection advantage, simply cannot compete for access to the best founders since those founders generally get the most value from highly reputable VCs which via their superior resource, network, experience and talent access can help supercharge a great founders startup more than other VCs could, as long as they have bandwidth of course.

The implication for smaller and emerging VCs thus must be: get busy networking with the big boys or have an insanely correct, wildly contrarian understanding of your specific niche that allows for crazy outperformance which will catapult you to reputable status.

NVIDIA is speedrunning accelerated computing adoption across industries to entrench itself in all economic verticals

Marvin Baumann — Tue, 27 Aug 2024 15:52:44 GMT

NVIDIA had an incredible, rollercoastery ride in 2024 so far. Much has been said about the reason for NVIDIA's dominance, which initially baffled many. A year ago, Wall Street saw NVIDIA's run-up as fake, hyped, and a sign of tulip-like exuberance, some still do. But the rally only started there and every quarter P/E ratios faithfully returned to average values after each earnings beat. Thus, investors looked more closely and came to embrace NVIDIA's "CUDA moat" as the answer to its margins and unique market position. This story sustained NVIDIA's rally through its June 2024 stock split. Recently, AI chip stocks have taken a beating due to interest rate-story driven market rotations, AI ROI fears and questions about Hyperscaler CapEx sustainability. For NVIDIA specifically not only are their sales volumes in doubt but even more so the sustainability of their margins since if CUDA were the only thing justifying their green dominance then it's only a short matter of time until AMD with it's ROCm and frameworks like OpenAI's Triton open up machine learning inference to agnostic chipsets. To me, these fears are overblown. I won't be going into AI ROI and CapEx spend in this post, but instead focus on the more general 5-year picture of NVIDIA's strategic positioning with implications for the endurance or erosion of their moat and margins in the market.

To understand where NVIDIA is going, we need to first understand the character and psychology that is driving NVIDIA and its founder-CEO Jensen Huang:

Jensen sees the future a decade out due to first-principle, in-the-limit extrapolation

Jensen tells the story that in 2012 when AlexNet broke the ImageNet glas ceiling using two NVIDIA GPUs he started to see the future of AI: models would be broadly, generally useful and incredibly capable if scaled up; and his role in it: to provide the semiconductor soil on which machine learning engineers could grow their models exponentially. This led to human-level models in image recognition, image segmentation, recommender systems and recently natural language understanding and generating (read my last post on LLMs). Jensen's foresight started even before that when in the early days of CUDA around 2007 he collaborated with scientists to help them parallelize their HPC workloads on NVIDIA GPUs which were purpose-built for parallelization of graphic processing workloads. From this behavior of re-orienting the company towards AI, ten years before the release of ChatGPT, we can learn an important characteristic of Jensen and his firm: Huang is a true first-principles, in-the-limit thinker. Unfortunately these words have become buzzwords now in VC World, but having a physics background myself, I can say that it's true for Jensen.

When Jensen saw AlexNet, he broke it down to the essentials and reasoned up from there. This relatively simple model vastly outperformed all other approaches, it was trivially scalable and that scaling was assured as it depended on scaling of data (which is plentiful) and of compute (which grows by orders of magnitude every couple of years). Besides that, while AlexNet was about object recognition using CNNs, the underlying principle of autonomously learning neural network parameters using gradient descent and simple loss functions meant that what AlexNet achieved was basically domain agnostic, making machine learning of the neural network kind a strong contender for a path to generalized intelligence machines. Extrapolating out from these essential insights, the new multi-decade mission of NVIDIA became clear and Jensen was not afraid to risk the company on it which itself is culturally geared to invest way ahead of any hope of financial returns, based on first-principle analysis alone. Thus, we know Jensen has the capability to see far into the future and has the courage to create exactly that future. Now let's try to look into his psyche.

Jensen breathes Andy Grove's paranoid spirit

There is something in Jensen's psychology that is driving him to look that far into the future and move to secure his/NVIDIA's position in it. Jensen himself comes from a humble background, the typical American Dream story. Starting NVIDIA was incredibly hard and bumpy in the beginning, with 1997 turning into a make-or-break year for them. It is impossible to locate when exactly, between being brought up in Taiwan, working at Fast-Food restaurants as a youth in the US, his studies, early career and the early years of NVIDIA, but at some point did Jensen Huang develop the Andy Grove'ian "Only the paranoid" trait. The psychology of always looking over your shoulder, never becoming comfortable with your situation and working incessantly to stay ahead and stay in the clear. This guy not only loves his work, loves the technology, loves the future he is bringing about, he is also terrified about it all not working out, about NVIDIA making a mistake, about having understood a long-term trend incorrectly and about playing the wrong bets or not enough of them.

So, yes, Jensen is a true first-principles, in-the-limit thinker, but his psychology also does not allow him to become complacent or rest on his laurels. Huang looks as much into the future, to see and create everything he is fascinated by as much as he does to see where he might be wrong and how NVIDIA could fail. This creates a powerful mix which is culturally imprinted on NVIDIA and all the more potent due to NVIDIA's talent-draw and capital power which bestows them not only with the capacity to develop something out of nothing, decades in advance of it bearing fruit, but also with the stamina to stick it out for that long, without becoming ideological about it.

Now that we know how Jensen is equipped, let's dive into how and where he is driving his firm with Grove’ian paranoia to maintain NVIDIA's dominance for the long term.

Thanks for reading Marvin Baumann's VC Blog! This post is public so feel free to share it.

"It really is about the expansion of our accelerated computing platforms into new markets, into new companies, into new industries. That is probably the single best early indicator of near term future success if you will, within the next six months, within the next year." – Jensen Huang via Stratechery

Jensen is a market builder, an architect of ecosystems. Most prominently, CUDA was elemental in creating a software ecosystem for deep learning. On the hardware side, NVIDIA is embedded in a deep ecosystem of server builders, hyperscalers, enterprise customers and startups. In software, CUDA is way more than a LLM-accelerator: NVIDIA invested in it for over a decade to build a huge range of software development kits (SDKs) on top to enable users to accelerate their specific workloads using NVIDIA GPU parallelization just as he did with the very first scientists NVIDIA collaborated with in 2007. That's the goal. CUDA in Tensorflow and Pytorch enabled the deep learning revolution and the ChatGPT moment. Now, Jensen wants to create such moments in practically all other industries. There are no real incumbents in this space because nothing like it had been possible before. NVIDIA is making it possible.

"It’s really about the Nvidia’s SDKs and where it’s going, [they] open us directly into new worlds and those SDKs accelerate applications that people are already using." – Jensen Huang via Stratechery

NVIDIA and accelerated computing will become indispensable to most industries

In this sense, ChatGPT did not change much about the overall NVIDIA strategy of conquering "new worlds" via enabling them to do accelerated computing. What it did do is shorten timelines drastically because the "GenAI Hype" created a sociological imperative for shareholders to pressure CEOs to pressure their CIOs to "do something with AI". The AI trade has been wobbly lately and AI ROI is yet uncertain, but I believe the trend is irreversible now, because not only are LLMs economically useful as seen tangibly in Klarna and Perplexity, but accelerated computing is sound from a first principles perspective as well. So Jensen essentially uses the LLM hype to connect every industry he can with NVIDIA GPUs and co-develop industry specific accelerated computing solutions with and beyond GenAI, for example: BioPharma and Protein Design, Customer Service, Robotics, Driverless Trucking and autonomous vehicles incl Tesla, Industrial Digital Twins, ERM, and more.

“As you listen to Huang’s keynotes, and hear all the talk about their vision of the future – AI Factories, Omniverse, Digital Twins, and all the rest – take them all with a grain of salt. They are good ideas (mostly). Some of them could turn out to be incredible businesses. But they each matter less than the fact that Nvidia is willing to try all of them out. The company’s vision of the future, like its products, is not deterministic. Nvidia is playing, and playing, the odds, and that ability to take risks is the key to their success.” – Digits to Dollars, "On Paths Not Taken"

This means NVIDIA's long-term moat won't be CUDA, but it's broad accelerated computing software ecosystem embedded in an overwhelming amount of industry verticals. Jensen will not only create those markets in the first place but what NVIDIA builds here is also unlikely to be replicated any time soon by its AI accelerator competitors, who are still catching up with what CUDA has done in deep learning.

In some cases, it's not so much NVIDIA creating entirely new, zero-revenue today markets, but injecting itself and massively speeding up (20x) existent wide-spread computing processes with real economic value behind it:

"Developing new software library means opening up whole new markets, even basic ones like data processing with cuDF: SQL, spark and pandas." – What’s Next in AI: NVIDIA’s Jensen Huang Talks With WIRED’s Lauren Goode, 39min12s

NVIDIA's strategy can thus be summarized as an "up- and outwards" push, verticalizing into higher-level software ecosystem as well as hardware systems-building to nurture and establish accelerated computing in as many industries as possible, merging NVIDIA solutions into the fabric of future economies and becoming fundamentally indispensable for all kinds of applications, products and processes. At least, that's the idea.

Judging by Jensen's potent mix of decades long experience, clear-headed first-principled, in-the-limited long-term thinking, Grove'ian paranoia, an organization with proven executive aptitude and a large and growing partner-ecosystem, I see the odds for this future coming about decidedly in NVIDIAs favor. I therefore expect NVIDIA’s margins to remain ridiculously elevated for the next decade.

LLM reasoning, AI performance scaling, and whether inference hardware will become commodified, crushing NVIDIA's margins

Marvin Baumann — Thu, 25 Jul 2024 17:45:22 GMT

TL;DR: Scaling static transformers won't bring us AGI; current skills of LLMs are partly a mirage due to human projection; AI reasoning will require new architectures while training-inference divide is set to converge, both implying continued demand for and dominance of NVIDIA/AMD general purpose GPUs; nonetheless static transformer inference is extremely useful and will become a basic computing building block in the future, implying commoditization, Jevons paradox demand explosion and path to profitable growth if not market capture for at least one of the transformer inference semiconductor startups or CSP custom silicon.

Rise of the transformer and scaling laws for LLMs

The 2017 invention of the transformer replaced "hard" recurrency with "soft" positional encoding + multi-headed attention to achieve better training scalability of sequence learning tasks like language understanding via massively parallel matrix multiplication ("matmul"). It worked: By 2019, OpenAI trained GPT-2 which they found to have developed so-called "emergent skills" of such a magnitude that they turned their back on open science for ostensibly "safety reasons", but likely just as much motivated by Sam Altman's commercial ambitions for OpenAI. Whatever the case, LLMs (large language models) were becoming really useful in text manipulation, language task and potentially even agentic skills. On the back of this research success, OpenAI struck a partnership with Microsoft in July, 2019 to enable further scaling of their language models. This was motivated by the so-called scaling hypothesis or "scaling law" which was later empirically established in the 2022 Chinchilla paper by Google DeepMind researchers who investigated the relationship between LLM performance and two measures of LLM compute input: size of the model in terms of parameters (aka "weights"), and number of text tokens on which the model is trained. This firmly established the existence of, or at least the collective belief in, scaling laws of LLMs, i.e. models get better the more data, parameters and GPUs we give them. Later the community discovered that in practice training a LLM for much longer, on much more tokens (i.e. "overtraining") than what "Chinchilla" and also classical machine learning literature estimates as being "compute optimal" still yet increases model performance. These scaling laws so far seem to have held up, and despite scaling logarithmically with compute we have so far observed roughly linear scaling of LLM performance over time due to exponentially rising hardware compute capacity, as both hardware performance itself as well as capital deployment in AI have risen exponentially.

Next, let’s understand what these LLMs seemed to become capable of and what functionally enabled those skills.

"Emergent skills" and the primacy of memorization in LLMs

When testing the performance of autoregressive transformers on a range of skills, e.g. human-like conversation, high school exams, chess puzzles, etc., for which they weren't specifically trained, researchers at OpenAI and other labs noticed something incredible: The language models were able to "perform tasks" which the researchers didn't specifically intend them to perform. These skills were interpreted to arise "naturally" and were called emergent skills, with the hypothesis being that the next-token prediction learning motivation (i.e. loss function) of LLMs is forcing the transformer to abstract and encode a kind of "generalized world model" to achieve further performance on its learning task, which they equated to a pressure for the model to learn to maximally compress the training data into its limited set of weights. The implications were, of course, incredible: "So you're saying we can just throw more tokens and more compute at these ever-larger models and they will learn every skill just like that with progressive scale?"

Not so fast! We cannot jump to that conclusion directly. What we do know is that LLMs achieve insane compressions of huge datasets. Using 4-bit quantization you can even run a large language model on a modern MacBook and access most of the information on the internet that was contained in the training dataset relatively accurately. That is insane! What is less clear about this, is whether the models achieve this by abstracting general skills from specific examples they saw in the training data, and then apply those skills during inference to generate its answer, like people who are convinced of emergent skills and intelligence growth in LLMs tend to claim.

A direct example of why this might be false is that we tend to be able to trace an answer to a prompt, that supposedly shows zero-shot skills of LLMs, back to some obscure websites that wounded up in the training set, which provide the exact data for that particular task. The truth is, nobody can possibly know everything that is inside the training dataset which basically contains the entire internet and all books and news articles ever written. For example, the ability of LLMs to evaluate chess positions stems from websites that provide exactly that analysis. When prompting the LLMs to evaluate a position, often enough the answer it gives is pretty much a 1:1 copy of that piece of the training set. Even more creative answers tend to be traced back to multiple pieces of the dataset between which the model interpolates. This makes sense, seeing as transformers layer-by-layer perform transformations on high-dimensional embedding spaces, that represent that training data in a compressed way, and specific prompts allow us to query approximate interpolations of different points in those embedding spaces.

A indirect example of why the generalized skills hypothesis (or hypothesis of a world model which bears any similarity to brain-produced world models) might not be quite right, is that of counter-factual tasks. Another chess example here is that LLMs perform incredibly well when analyzing the legality of chess openings, yet they are reduced to mere 50/50 guessing if we change the situation slightly by saying "suppose the knights and bishops change places, now determine whether the openings are legal according to standard chess rules, think step by step". Any human that knows the chest rules could easily do this, state-of-the-art LLMs cannot. This shows you, the LLMs have indeed not learned to apply chess rules but are more or less "copy-pasting" answers from their training data set, and fail if we change the conditions ever so slightly such that it falls outside of their training dataset. This brittleness of their skills suggest LLMs are using memory more-or-less exclusively to solve the skill tests they are asked to do. This is supported by views in the community, that the dataset essentially determines the performance of transformers on certain benchmarks.

In this context, I find the views of AI researchers Yann Lecun (Meta), Subbarao Kambhampati (Google), Francois Chollet (Google) and Tim Scarfe (Machine Learning Street Talk) convincing. The idea is that autoregressive transformers mainly do so-called "approximative retrieval", i.e. it's not exact retrieval like you would expect from SQL queries of static databases, but it is definitely not knowledge generated from scratch by reasoning over basic principles and basic knowledge. Francois' perspective of LLM prompting as interpolating between learned, mini vector programs is insightful as well and aligns with my interpolation of points in the embedding space view from above. From that perspective, LLMs are capable of regurgitating memorized pieces of text and they have some capacity for creativity by mixing different learned concepts like e.g. the topics of gangster rap with the poetic style of Shakespeare. Since autoregressive LLMs are effectively prompting themselves (they feed-in their past output), we can achieve some limited form of reasoning with prompt strategies like "think step by step" where it can use reasoning-like textual schemata to induce itself to produce thought-like, semi-rational text and then try to infer via summarization a conclusion from it. This kinda, sorta, sometimes works but its definitely not close to what we'd like to have as reasoning machines.

Next, let's try to understand why expectations and interpretations of LLM performance have diverged so much in the machine learning community.

Anthropomorphization, or why we need to be careful when interpreting LLM performance

The fundamental issue we have when interpreting the true capabilities of LLMs based on results of different benchmarks and skill tests is that autoregressive transformers are structurally significantly different than human cognition in the brain while at the same time humans have a very strong tendency to project human-like form and behavior onto dynamics we observe (possibly a unintended consequence of our ability to have a theory of mind), which is called anthropomorphization (literally: human + form).

For that reason, when we see examples of successful task execution, even if you are the best machine learning engineer in the world and know the system inside-out, we just cannot help but presume that the process behind that task execution must be similar to how we execute that task. It is just so natural for us to interpret behavior by empathizing based off our own subjective experience. Thus, even if we know the exact architecture of a transformer and how we call it auto-regressively, that anthropomorphic bias is just so easy to slip into our interpretation of what we see. Put concretely, when inputting some task x into a LLM f and we observe y = f(x), it is just so very hard to keep the interpretation of what f does untainted from subconscious anthropomorphic bias. We as humans then abstract and generalize based on that interpretation and start projecting it onto other use cases, skills and our forecasts of future abilities of LLMs and AI more broadly. That's why I am such a big fan of counter-factual tasks for checking our interpretations of LLM skills, since it gives us an epistemologically clear-headed and straightforward way to falsify our beliefs/hypothesis about LLM performance by constructing simple counter-examples which an LLM could easily solve if it would truly learn general skills like it seems to us at first glance. The results so far show that they do not, or are at least brittle in ways we have a hard time predicting. Why is that the case?

There is broad consensus that "intelligence" roughly speaking is a combination of memorization and/or intuition (system 1 type thinking) and reasoning (system 2 type thinking). The reality is that in everyday life when conversing with people or reading an article or a tweet, we never know whether the text was freshly generated "from scratch" in this very moment via reasoning or "actual skill application" or whether the text had been pre-conceived at an earlier time, possibly by someone else, and it is now simply being repeated to you from memory. Since we only converse with human beings though, that is not a huge issue since we have self-experience and thus intuition about the origin of knowledge other humans share with us. However, large language models are an entirely different beast. We cannot intuit what it would be like to have read 1000x more than any human being can ever read in their entire lifetime and to be able to remember practically anything of it given the right stimulation (aka prompt). If a human would talk like ChatGPT, we would probably have to presume that they are decently smart (though they do make a bunch of weird mistakes for a human, which is a tell). However, our intuition deceives us here since we cannot imagine that anyone could build all those answers and conversations just from Lego-pieces of trillions of text tokens they read, and thus we are easily persuaded that LLMs are smart in the way we are. Subbarao Kambhampati made this point quite well and I encourage you to read it here.

This interpretation of LLM performance via memorization would also explain why apparent "overtraining" doesn't lead to diminishing and reversing returns as classical machine learning would suggest, which says that training for too long on the training set leads to overfitting of the data and a deterioration of generalizability and thus worse performance on test sets. However, due to benchmark leakage (test data leaks into training data), and the current ability of LLMs stemming from effective compression and thus memorization of the training data, overfitting to data actually leads to progressively better "performance" beyond what we would expect. This need not be bad, if we had a dataset about literally everything and anything, which is impossible due to it being infinite, we would exactly want an overfitting to that dataset. Since that is impossible, however, the key ingredient we are generally looking for in building artificial intelligence is an ability to "bridge the gap" from finite examples to infinite applications. That is true general intelligence, and we humans posses it. Our limitations are not due to generality but due to limited lifetime, limited cognitive capacity and limited cooperative ability amongst humans. Over time, however, there is in principle nothing that we could not solve, a thought beautifully set out in David Deutsch's book The Beginning of Infinity. ChatGPT does not share that same characteristic, i.e. even with infinite time, it will not be able to solve more than a finite amount of problems, and also it is not capable of checking itself, of critiquing itself, to see if it's right. Thus, what about reasoning in AI?

Reasoning and the need for architecture innovation in AI

Due to the above criticism, Francois Chollet and Mike Knoop (Co-founder Zapier) recently re-invigorated a challenge called the "ARC Prize" (ARC for "Abstraction and Reasoning Corpus") on which modern AI has not had much progress over the past five years, while most other benchmarks are converging towards saturation (90+% performance). Why is that the case? The ARC-AGI benchmark was specifically designed to not be solvable via memorization, while being relatively easy for most humans who achieve median accuracy of 85%. Chollet and Knoop make the explicit point that current LLMs are not good enough since they lack general reasoning capability, and they have kickstarted the machine learning community into finding new solutions to achieve true general intelligence in machines (AGI). If it wasn't clear before, it is clear by now that we need further architecture innovation in AI, despite claims by few remaining scaling maximalists, to achieve AGI. Francois Chollet suggests the eventual solution will likely be LLM-like systems serving as system 1-type intuition machines that guide a program search process coupled with a strict solution checker, iterating progressive solution candidates until the reasoning task is solved.

Let's now start to draw practical implications from this for the AI hardware industry and possible trends for the future and their consequences for AI inference semiconductor startups.

AI innovation is not stopping, general-purpose GPUs will remain in demand

We have not reached "the end of history" with regards to AI innovation yet. Static, autoregressive transformers are not the end all, be all, like people expected 1.5 years ago, who started startups based off that hypothesis. Thus AI accelerators or GPUs by NVIDIA and AMD, which can run any new AI reasoning algorithm researchers will invent, will likely retain their dominance and their pricing power, since despite the usefulness of current LLMs the usefulness of reasoning-capable AI systems will be far greater and thus economic share will largely remain with current AI hardware juggernauts, mainly NVIDIA.

Hypothesis: training and inference are likely to converge in intelligent systems, look to library of LoRAs for inspiration

While humans certainly profit from pre-acquired knowledge which shapes our worldview and helps us decipher uncertainties and solve complex tasks, for any problem we haven't memorized the answer to, we are capable of learning and investigating in the moment via active inference and come up with solutions about things we did not understand before. From these first principles, it seems to me, that the hard boundary in AI between "training" and "inference" people talk about, especially non-technical WallStreet types, is non-sensical and in its current state transitory. Any intelligent system crucially needs the capability to learn in the moment, as Francois Chollet points out as well (here and here at 0:02:59), and "zero-shot in-prompt learning" that is published as skills of state of the art LLMs is not enough due to failure of out-of-distribution learning, i.e. adapting to novel situations and doing real learning.

For that fundamental reason I expect future intelligent systems to constantly "train" in alternation with "inference" and that separation will converge over time until the distinction becomes meaningless. More specifically, there will still be pre-training but inference will look a more like simultaneous, reciprocal training+inference. We can observe this trend already in comments by Sam Altman, Dario Amodei (Anthropic), and Mark Zuckerberg who say that LLM releases will not necessarily be discrete checkpoints anymore going forward but will become continuous. Furthermore, synthetic data generation is seen as an opportunity to create a run-away intelligence growth dynamic where a foundation model generates data that it can ingest for further training which, again, implies reciprocity and convergence of training and inferencing. Similar approaches were just used for Meta's newest Llama3.1 LLMs. We should also take note of Apple's approach, showcased during WWDC24's presentation of Apple Intelligence, of using small, on-device foundation models which dynamically use task-specific Low Rank Adapters ("LoRAs") that help the LLM perform particularly well for a currently given task, e.g. summarizing notification, proposing a fully written email from bullet points, etc.

LoRA is a common, compute-efficient finetuning technique for pre-trained neural networks like LLMs or diffusion models (used in image generation). I'm personally betting that a library of LoRAs-like approach will deliver the medium-term solution to in-situ learning of LLMs as a milestone on the path to AGI. In this approach a large pre-trained foundation model is augmented by a library of LoRA finetunes with LoRAs being dynamically applied during run-time, separately or in combination, possibly using a second model that does routing of foundation models and gating between sets of LoRAs. Task-specific feedback could then be used to re-train some of the LoRAs in near real-time to incorporate new learnings and improve future performance, or new LoRAs could be added to the library. This could be one avenue to get light-weight in-situ learning and adaptation to novel situation, and thus a convergence of training and inference. I think we will see first glimpses of that future in 2025 when Apple Intelligence LoRAs will finetune for personalization on users' iPhones during the night (due to the large power draw of onboard GPU).

The implications of this trend, if it will actually come to pass, are that training-capable AI hardware (read: NVIDIA GPUs) will remain dominant for general intelligence systems. However, this need not imply vanilla, static transformer inference falls off the map. Let’s look at that case next.

Groq and transformer-ASICs are useful, static transformer inference will commoditize, this could imply Jevons paradox and be profitable

Though current LLMs are not all we had hoped they would, i.e. AGI machines, they still remain extremely useful text manipulation and knowledge intuition tools, if used correctly in a framework of different tools and augmented by external data sources (RAG, web search, etc). Though I cannot foresee the future, I'd imagine that even if AI innovation stopped in mid-2024, the diffusion of this technology across the entire economy and society, which is still left to accomplish, would likely have huge economic impact, notwithstanding the missing reasoning piece. So there is clear value and clear demand for static (i.e. no out-of-distribution, in-situ learning to adapt to novel situations) inference of language model transformers using autoregressive decoding.

I do not subscribe to the lottery hypothesis of hardware-algorithmic co-evolution when applied narrowly to the transformer architecture, as transformer chip designer startup etched.ai does for example. But that does not mean Etched, Groq and other startups of the sort, or custom inference silicon by Cloud Service Providers (CSPs), are not going to be needed. Quite the opposite, I think they will proliferate as vanilla inference of static LLMs becomes commoditized from the semiconductors to the hardware programming interface using OpenAI Triton and/or AMD ROCm to the orchestration layer and then to the API and application layer. NVIDIA's CUDA moat was never and was never intended to hold forever for this narrow use case, more on that in a future blog post.

We saw hints for commoditization everywhere this year. LLMs in 2024 so far were mostly a story of increasing efficiency rather than increasing performance, as well as open (free) models catching up to closed models. Quantization, around since beginning of 2023, marched forward, allowing decent-sized LLMs to be run on any edge device imaginable; as did algorithm<->infrastructure optimization advance, exemplified by OpenAI's release of vastly more efficient GPT4o and 4o-mini after that. Groq also made a huge splash with significantly faster token/sec generation speeds on Llama and other open-source LLMs, opening up new use cases like natural, realistic audio-to-audio conversations between human and machine. Though there are doubts about the generality of Groq's technological approach, the general thrust is clear: static LLM inference will become cheaper and faster, and ubiquitously available. Moreover, high performance open source LLMs are here to stay for the foreseeable with Zuckerberg's open source strategy and the release of Llama3.1 405B that rivals currently available closed models by OpenAI and Anthropic.

Generation speed of LLMs especially will become truly impactful the more LLMs start to talk to each other instead of just to humans. The medium-term result of autonomous interaction of LLMs is an unknowable at the moment, but a complexity explosion in this domain could very well bring new capabilities to systems of static LLMs. Existing research indicates that while a single transformer is not Turing complete, two interacting transformers might be; and a "society of minds" type application of LLMs interacting in a debate of multiple rounds has recently been shown to dramatically improve LLM performance on some math and reasoning tasks. Thus, in the end the above point of need for architecture innovation to solve machine reasoning need not hold since interacting agents of static LLMs might do the job just fine. I'm confident of the positive potential for autonomously interacting LLM agents, but to solve system 2 type reasoning, which in humans enables general intelligence, I stand by my above expectation of the need for more innovation, and thus my implication of continued demand for general purpose GPUs.

Putting reasoning aside, with the commodification trend of LLM inference clear, I expect Jevons paradox to hold in this case too and we'll see exploding demand and usage, implying steady demand for inference hardware providers. However, how this field of GPU designers, semiconductor startups and CSP custom silicon will consolidate is unclear to me. I will go out on a limb here, though, and say that due to the general usefulness of text manipulation and other narrow skills LLMs posses today I expect that not only will LLM API calls become as full-stack commoditized as database SQL queries are today but that static transformer inference will evolve to be a basic computing building block in the future, almost on the level of what matmuls are today.

Exciting times!