AI beyond Prompt and Pray by Yoav Shoham

Generative AI Surges, but Enterprises Struggle to Deploy at Scale

Consumer adoption of generative AI, most famously exhibited by ChatGPT, has been the fastest in the history of any technology product—faster than the PC, internet, or smartphone. The enterprise market has also embraced it, and it’s hard to find a CEO of a major corporation who doesn’t present their company as being AI-first or planning to become so. Yet for all the mass AI investment and experimentation going on, the fraction of enterprise gen AI projects that reach production deployment is tiny (6%, according to an AWS study). Why?

Some of it is due to a natural learning curve—but most of it has to do with the unique nature of this particular technology.Gen AI has been centered around large language models (LLMs), multibillion-parameter systems that encode an intuition-defying amount of human knowledge. Borges would have been proud to see his vision of the universal library realized, and the librarians of Alexandria humiliated. LLMs are impressive not only in their knowledge but also in how that knowledge can be used in clever ways. For God’s sake, they even do arithmetic! They are knowledge bases and reasoning engines combined. LLMs are truly wondrous, and nothing here is meant to suggest otherwise.

Unreliable AI

But they are also unreliable. It’s just their nature. LLMs define a complex probability distribution, and when you provide them with input and request an answer—also known as “prompting” them—you get a probabilistic answer. And there’s the rub. LLMs are like a box of chocolates—you never know how good an answer you’ll get. Often you get answers that are remarkably correct, creative, and clever. But in other cases, you get answers that are not just wrong but in fact ridiculous. This is inherent to LLMs’ probabilistic nature. I said earlier that they can do arithmetic, but in fact, they do it poorly—and will never be as reliable as an HP calculator from the 1970s.

This leaves you with two options: Either avoid LLMs or prompt—and pray for the best. That may be good enough for students using AI to do their homework, but for enterprises it’s not. The cost of error is too high.

Until recently, AI didn’t have much more to offer—but that’s beginning to change with the move from LLMs to AI systems, which I’ll describe in more depth below.

LLMs suffer from four shortcomings: lack of access to recent or proprietary data; faulty reasoning; high usage costs; and lack of user control over all three.

The information-access aspect is essentially solved by now: So-called Retrieval Augmented Generation, or RAG, is already a staple of enterprise AI offerings. This is a first step toward the AI systems enterprises need: retrieving information from the web and other information sources external to the LLM, and then feeding that information to the LLM. But dealing with the other limitations calls for more.

Let’s start with reasoning. Imagine a financial analyst attempting to use an AI system to project a company’s financial future based on regulatory filings, industry reports, and media coverage. Besides accessing the various information sources, the system would need to engage in various sorts of reasoning. Some of this reasoning is highly circumscribed, such as performing arithmetic; in this case, the system can often use an appropriate tool (like a calculator). If you have an AI system, not just an LLM, you can invoke such tools.

But the financial analysis would call for much more complex types of reasoning, such as mapping out possible scenarios and assessing their probabilities. This more open-ended type of reasoning doesn’t lend itself to highly specific algorithms and calls for more general reasoning and planning.

DeepSeek and reasoning

The most recent attempts to endow LLMs with general reasoning capabilities have been centered around the so-called chain-of-thought (CoT) models. They are sufficiently different from previous LLMs that some call them LRMs—large reasoning models. Although the CoT technique had been known and discussed before, OpenAI’s o1 model was the first to implement it robustly, and it attracted much attention. A few other similar models have appeared since, with DeepSeek’s R1 making the biggest splash.

An aside: The big splash is due not so much to R1’s innovations, though there were a few, but to two other factors. First, unlike OpenAI, DeepSeek shared the model’s weights and many training details, enabling others to use it for both further research and commercial applications, with few strings attached. It’s likely that this will force OpenAI’s hand to follow suit, and in general accelerate the “open source” trend (an inaccurate term, since sharing the model without the training code and training data limits its usefulness).

The second reason for the attention—and the temporary Wall Street selloff—was the claim that R1 was trained at a fraction of the cost of recent LLMs (under $6 million). This claim is unlikely to be true. Possibly the last training run cost that, but models are developed with many training runs. R1 came at the heels of previous DeepSeek models, and in general the reinforcement learning behind the approach requires a powerful pretrained LLM, which someone had to pay for. So, the project’s true overall cost is likely at least one, and probably two, orders of magnitude higher. This is beside the fact that global inference spend will likely increasingly overwhelm the overall training spend, as AI adoption accelerates; so even if training costs drop, overall demand for AI accelerators like GPUs will not.

Back to reasoning. LRMs have value, but ultimately aren’t the answer. The way they work is (broadly speaking) as follows. In an attempt to compensate for the inherently myopic predictions of LLMs, LRMs take a pretrained LLM and further train it to produce longer chains of predictions. The intention is that these chains capture sound multi-step reasoning, and that they culminate in correct final outputs. Sometimes they do, especially in well-structured domains such as math, with clear ways to definitively verify answers. But then, like good old LLMs, they don’t. We’re still in the realm of prompt and pray; an LRM is still very much a box of chocolates. The reasons for this touch on deep technical matters and deserve a separate discussion. But briefly, one can point to three related factors: the suboptimal ways in which LRMs search the exponential space of possible chains; their inability to separate sound reasoning from statistically likely prediction; and their inability to spend a time and compute budget that’s appropriate to a given problem. Regarding the latter, the internet is replete with examples of o1 “thinking” for 30 minutes. At o1 pricing, this costs several dollars (how much exactly depends on various factors). Even with R1 pricing that is an order of magnitude less, this is not tenable to do at scale with no control and oversight.

So the term LRM is misleading. Models such as o1 and R1 are better thought of as large musing models, or LMMs. There’s a big gap between that and precise, reliable, controllable reasoning.

AI systems: the orchestrators

That’s where AI systems come in. AI systems don’t try to shoehorn everything into a statistical model, be it an LLM or LMM. They embrace these models, but augment them with internet search, hooking them up with external information sources (RAG systems are a basic form of it), invoke tools (such as a calculator), and run traditional code. AI systems function as an intelligent orchestrator among these different pieces, invoking the right elements at the right time, and imposing checks and balances to ensure high reliability and control cost. This is actually the current practice in the enterprise; LLMs are never used as standalone solutions but are embedded in a more involved system.

These AI systems are currently essentially crafted manually, with humans doing much of the orchestration. The direction AI is headed now is to have AI play a role in crafting and running these AI systems, with human involvement when appropriate. By the end of this year, the technology will be mature enough for AI systems to play a major role in the enterprise.

This is the missing piece of the puzzle, which will enable AI systems to provide the robustness and reliability needed in the enterprise. In 2025 we will get AI that’s smarter and capable of reliably solving complex, real-world problems at scale. Enterprises will gain the confidence to deploy these systems widely, unlocking immense value. With this, we will enter the third stage of the modern AI revolution in the enterprise: from sporadic experimentation to mass experimentation but sporadic deployment to, at last, mass deployment.

The full article is available on Fortune: Click here