STEER-ME: Assessing LLMs in Information Economics

· 15 min read · workshop

Note: this blog post deviates slightly from our Workshop paper.

Introduction

Recently, Large Language Models (LLMs) have increasingly been deployed as decision-making engines, either directly acting as economic agents \citep{cai2023large, homo_silicus, voyager} or as essential components within broader systems designed for economic decision-making \citep{zhuge2023mindstorms, wang2023unleashing, huggingGPT}. Despite constituting promising demonstrations of LLM capabilities, such systems have also exposed significant brittleness in LLM performance: models succeeding in one scenario often fail unpredictably in closely related contexts, suggesting at least some reliance on superficial pattern matching vs robust economic reasoning \citep[e.g.,][]{hendrycks2021measuringmassivemultitasklanguage, ribeiro-etal-2020-beyond}. Despite this behavior, most existing economic benchmarks narrowly evaluate specific applications and do not rigorously assess the foundational strategic and computational reasoning skills necessary for reliable economic decision-making. We argue that before LLMs can be meaningfully evaluated or deployed in information economics, their capacities for these foundational reasoning capabilities must be systematically assessed.

To address this need, we developed the STEER benchmark \citep{ramansteer}, providing a comprehensive assessment of strategic reasoning foundational to economics. STEER was constructed by taxonomizing distinct "elements of economic rationality," ultimately comprising 64 strategic reasoning elements, including core concepts from game theory and foundational decision theory. Leveraging state-of-the-art LLMs, we systematically generated diverse questions across multiple domains (e.g., finance, medicine, public policy) and varying difficulty levels, creating an extensive and continually expandable dataset for benchmarking LLM economic reasoning. Building upon this methodology, we recently expanded the scope of our benchmark with STEER-ME \citep{ramansteerme}, introducing 58 computational microeconomic reasoning elements—such as competitive market analysis, optimal consumption, and utility maximization. STEER-ME is significantly more challenging than STEER, not only because it demands precise mathematical computation but also because the evaluated concepts frequently require careful sequential reasoning, which directly underpins many critical scenarios in information economics.

Information Economics in STEER-ME

In particular, STEER-ME evaluates elements essential for decision-making under uncertainty, such as correctly computing expected utility, managing state-contingent consumption, and evaluating the prices and risks inherent in uncertain economic environments. It further includes elements explicitly testing models' abilities to systematically update beliefs in response to new information, such as precisely applying Bayes' rule and optimally adapting decisions based on revised probabilities. These reasoning capabilities are foundational building blocks for canonical information economics scenarios—including costly information acquisition, adverse selection, Bayesian persuasion, and rational information disclosure—where economic agents must balance the expected value of information with its cost and dynamically update their beliefs to make rational decisions.

A natural way to see these building blocks working in concert is the element we call Dynamic Profit Maximization. A firm is tasked with maximizing profit across two periods. They start with some capital $K_1$, a known output function $\mathcal{O}$, a cost to growing capital described by $\mathcal{Cost}$, and tomorrow faces a price $p_2$ that may be deterministic, uncertain, or inspectable at cost $c$. Below, we show the three different objective functions that arise under each "flavor."

The deterministic flavor is the simplest: the agent must choose a capital $K_2$ to maximize expected profit, given that tomorrow's price is known in advance. The uncertain flavor adds uncertainty to tomorrow's price, requiring the agent to form an expectation over the distribution of $p_2$. Finally, given uncertainty over $p_2$, the costly-inspection flavor allows the agent to pay $c$ to inspect tomorrow's price before making a decision. That is, they choose (i) whether to pay the inspection fee, an indicator $I$, and (ii) a decision rule $\phi$ that maps the realized price into tomorrow's capital. See below for examples of each flavor as an instantiated question.

Methods

Models We Evaluated

We evaluated four state-of-the-art models: Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet and OpenAI's GPT 4o and o3. While Claude 3.5 Sonnet and GPT 4o are standard language models, Claude 3.7 Sonnet and o3 are reasoning models. Reasoning models are fine-tuned to conduct better chain-of-thought reasoning and o3 even has the ability to interleave code execution during reasoning. For the standard LLMs we decode with temperature 0 as that is considered optimal for high-fidelity reasoning tasks. For the reasoning models we sample with temperature 0.6 as is recommended by each provider.

Evaluation Procedure

To standardize the evaluation, we presented each model with the same set of 500 questions per flavor and allowed each model to reason before coming to an answer. Each prompt lists the scenario and four labelled options; the model's task is to output the letter of its chosen option. We then scored each response for accuracy, counting an answer correct only when the returned letter matched the unique correct choice.

Results

Figure 1: Accuracy by flavor

As can be seen in Figure 1, o3 was nearly flawless in the baseline deterministic-price flavor and loses only three to five percentage points as uncertainty and information acquisition are layered on. This robustness is unsurprising: o3 is the only model in the cohort with native code-execution, so once it forms the correct objective it can offload the calculus to the built-in Python sandbox—turning what is conceptually challenging but mechanically straightforward optimization into a trivial call to a solver. Claude 3.7 Sonnet Thinking ranks a clear second. Although it lacks code execution, the fine-tuned reasoning offers the model noticeably higher accuracy than both GPT-4o and Claude 3.5 Sonnet. However, we observed that it had the largest drop from deterministic to uncertain price out of all the evaluated models. That large drop suggests that Claude 3.7 Sonnet's boost in performance is susceptible to uncertainty in the information set. GPT-4o's performance mostly clusters in the low 0.4 range and exhibits little systematic difference between flavors. The flat profile implies that the step from deterministic optimization to value-of-information reasoning did not make much difference because it was already near random-guessing on the underlying calculus. Surprisingly, Claude 3.5 Sonnet showed an increase in performance for the costly inspection flavor.

Decomposing Costly Inspection Accuracy

This does not necessarily mean that Claude 3.5 Sonnet was particularly good at value-of-information reasoning, however. Because the costly-inspection question contains two "pay" menus and two "skip" menus, the evaluation can—and should—distinguish between two very different cognitive hurdles. First, a model must decide whether paying the fee is worthwhile. This is the harder piece of reasoning: it requires forming an expectation over tomorrow's prices, computing the marginal value of information, and comparing that value to the fee. Only after the correct branch is chosen does the model face the comparatively mechanical task of selecting the correct option for that branch. To tease these skills apart we re-graded every response in two layers: (1) Strategy accuracy: Did the model pick the correct branch (pay vs. skip)? (2) Conditional-$K_2$ accuracy: Given that branch, did the model pick the menu whose $K_2$ values satisfy the conditions within that branch?

Model Strategy accuracy K2 accuracy | strategy
o3 97% 98%
Claude 3.7 Sonnet Thinking 70% 82%
Claude 3.5 Sonnet 54% 80%
GPT-4o 51% 83%
Table 1. Decomposing accuracy on the costly-inspection flavor of Dynamic Profit Maximization. A random guesser achieves 50% strategy accuracy and 25% overall accuracy (two correct branches x two correct menus).

Table 1 shows that o3's near-perfect score came from both layers: it almost always chooses the correct branch and almost never mis-computes $K_2$. Claude 3.7 Sonnet Thinking got the branch decision right four times out of five and, when it did, chose the correct capital level 82% of the time. It trailed o3 but clearly outperforms non-reasoning models, suggesting fine-tuning for chain-of-thought improved both its value-of-information heuristic and its raw optimisation arithmetic despite lacking tool use. Claude 3.5 Sonnet and GPT-4o hovered between 51.26% to 54.10% on strategy accuracy—essentially indistinguishable from random guessing—but still achieved mid-80% conditional accuracy. This means they could solve the calculus when the branch is handed to them yet lacked a reliable rule for judging when information is worth its cost.

Conclusions

Our evaluations provide concrete evidence that weak proficiency on seemingly "low-level" elements—expectation formation, Bayes updates—manifests downstream as brittle, surface-level behavior on more elaborate economic tasks. Even competitive models such as GPT-4o and Claude 3.5 Sonnet were no better than guessing whether paying for information was worthwhile. In contrast, o3's native code-execution combined with fine-tuned reasoning delivered near-perfect performance across deterministic, uncertain, and costly-inspection variants, illustrating how tool integration can compensate for gaps in symbolic reasoning.

These findings underscore three broader points. First, information-economics problems expose a dimension of reasoning—valuing information and acting on contingent states—that is not well tested by classic optimization benchmarks. Second, decomposing accuracy is essential for diagnosing where models actually fail; headline scores alone can be misleading when the multiple-choice structure embeds partial credit. Third, method matters: models that can off-load algebra to external solvers enjoy a large advantage, suggesting that benchmark design should explicitly distinguish conceptual errors from computational ones.

← Back to Posts