The focus in AI is moving beyond training large models to what happens when they are put to work. Inference, the point where AI generates real-time responses, is becoming the new center of investment and innovation.
As reported by Wall Street Journal on March 16th, 2026.
A significant shift is under way in artificial intelligence, and it has huge implications for technology companies big and small.
For the past half-decade, most of the focus in AI has been on training large language models (LLMs), a costly process that requires tens of thousands of chips, consumes enormous amounts of energy and happens in gigantic, remote data centers. Training involves feeding a model billions of bits of information—word definitions, historical facts, financial statistics, photos of kittens—using clusters of thousands of specialized microprocessor chips, running 24 hours a day for weeks or months at a time.
Now, as more companies deploy AI agents and try to monetize new tools built on the backs of LLMs, the spotlight has shifted to inference, the type of computing that allows trained AI models to respond to user queries.
Global spending on on-demand, high-performance cloud computing services for AI inference is expected to surpass such spending for AI training for the first time this year, according to research firm Gartner. By 2029, companies will spend nearly twice as much on such computing services for inference as they will for training ($72 billion vs. $37 billion).
That means big changes in the types of chips tech companies buy. Nvidia became the world’s most valuable company by selling chips, called GPUs, that have the raw processing power needed for model training. But companies that expect to be doing more inference can achieve performance gains by doing it on chips specialized for that task, said Jacob Feldgoise, a scholar who researches AI at Georgetown University.
Chip makers producing chips tailored for inference—including Google, Cerebras Systems and SambaNova and others—have been signing multibillion-dollar deals at an accelerating clip. Nvidia is set to launch its own inference-specific processors after it paid $20 billion in December to license the technology and hire the top talent at Groq, a company that designs chips customized for inference.
So what is inference computing, exactly, and how is it different from the processing required for training? Why has demand shifted so decisively toward inference, and what does it mean for the market?
How does inference work?
Inference is the day-to-day operation of the restaurant. Diners place their orders (often in the form of a query to a chatbot) and the chef prepares their meals (the chatbot’s response).
Inference consists of two phases, known as prefill and decode. Prefill happens when a user enters a prompt, forcing the model to interpret the query by processing each word, symbol or image it contains.
Decode is the process by which the model, using all it has learned in training, spits out a response to the query.
The two phases of inference require different attributes from chips: Prefill demands more processing power, while decode requires more memory, in part because it has to draw on all the knowledge it has accumulated to serve up nice, piping-hot tokens to the user.
Wait, what are tokens?
Although there is a range for different types of data, one token is generally thought of as about three-quarters of an English word. A simple chatbot query such as “What will the weather be today?” will be interpreted by a model as six to eight tokens.
Models usually produce answers one token at a time, and they have to spit out each token in the right order for the answer to make sense.
Companies that are currently trying to monetize AI tools—everything from accounting software to travel-booking services to image-generators—are obsessed with cost metrics such as tokens-per-second-per-watt or tokens-per-second-per dollar.
That puts a premium on the ability of inference chips to serve up results efficiently, said Tim Breen, chief executive of GlobalFoundries, a chip manufacturer. “Getting the cost of inference down is the name of the game now.”
Because training involves crunching so much data over long periods, the chips it is performed on must have huge amounts of processing power, and the data centers they sit in must have access to abundant energy as well as water for cooling the chips. Training also requires memory, but if a GPU doesn’t have enough it can either distribute some of its processing work to other chips, or wait until existing memory frees up.
The process of inference, in contrast, is performed on demand, in seconds, not weeks. “Ten seconds, and you’ve already got people tapping their thumbs on their phones, moving on to the next thing,” said Rodrigo Liang, CEO of the chip-design firm SambaNova.
Inference chips, therefore, must have larger amounts of high-bandwidth memory, and the data centers they sit in must be located close to populations of users to reduce latency time. Chip startups such as Ayar Labs are also increasingly connecting components with fiber optics, which transmit data faster than copper wiring and require less cooling.
Said Ayar Labs CEO Mark Wade, “It’s all about inference scaling today.”
