Abstract

The remarkable capabilities of large language models (LLMs) have prompted interest in LLM-based support for the case-based reasoning (CBR) process, including LLM implementations of each step of the CBR cycle. However, CBR research showing the tight coupling of retrieval and adaptation suggests potential benefits of a more unified approach. This paper presents an LLM-based CBR architecture and uses it to compare the performance explicitly guiding the model to perform adaptation versus having adaptation arise implicitly within LLM processing. It tests four architectures varying along two dimensions–retrieval mechanism (explicit multi-stage, with separate LLM calls for retrieval sub-tasks, vs. unified, with a single LLM call for all retrieval reasoning) and adaptation mechanism (explicit multi-stage, with prompting steps guiding adaptation, vs. implicit, with a single LLM call)—across six LLM models from three families, on GSM8K and MATH data sets. Performance of the case-based unified retrieval architecture, which presents the retrieved cases and the problem in a single context window, matches published few-shot chain-of-thought benchmarks without requiring curated few-shot examples. In contrast, explicit multi-stage adaptation consistently degrades accuracy. A failure analysis identifies fourteen distinct ways in which multi-stage adaptation may destroy information. Our results suggest the simple adaptation hypothesis: In LLM-based CBR systems for such domains, adaptation should be implicitly performed by the LLM rather than by explicit multi-stage prompting.

Introduction

Large Language Models (LLMs) provide remarkable capabilities over a broad range of tasks but remain unreliable at structured, multi-step reasoning, with failure modes including hallucinated intermediate steps and inconsistent rule application (Huang and Chang 2023; Wei et al. 2022). CBR, by bringing to bear specific prior examples, offers a natural complement to the generalization of LLMs, which has prompted much interest in how CBR can support LLMs, and vice versa (Bach et al. 2025). One such area is LLM-based implementations of the entire CBR process (Wilkerson and Leake 2024).

A natural approach to such implementations is to apply LLM-based reasoning to successive steps of the CBR process. However, previous CBR research presents evidence for tight coupling between steps of the CBR process (e.g., (Leake and Ye 2021; Smyth and Keane 1998)). This suggests the question of whether LLM-based CBR should implement phases of the CBR process through explicit guidance for each step or in a more unified way, as an effect of prompting the LLM to apply a case to a new problem.

This paper presents a general LLM-based architecture for CBR and applies it to four testbed systems for reasoning in a mathematical problem-solving domain. These systems vary along two dimensions. The first is retrieval mechanism: an explicit retrieval pipeline in which separate LLM calls handle similarity assessment, case ranking, and new case construction from retrieved documents; versus a unified retrieval approach in which a single LLM call performs all retrieval reasoning. The second is adaptation mechanism: explicit adaptation, in which dedicated LLM stages implement adaptation by generating a reasoning chain that applies this approach to the new problem, and performing an error analysis that checks the adapted solution for consistency before producing a final answer; versus implicit adaptation, in which the retrieved case is passed directly to the solver and adaptation arises naturally from the LLM’s internal processing over the combined context.

We evaluate the system variants on two mathematical reasoning datasets across three LLM families. In these tests, unified retrieval with implicit adaptation achieves high accuracy across all models above 10B parameters, while explicit multi-stage adaptation consistently degrades performance. The performance effects of retrieval and adaptation design dominate those of model choice, with implicit retrieval resulting in large performance gains on grade-school-level problems but not reliably on competition-level problems, revealing limits of surface-similarity retrieval on competition-level problems.

The paper makes the following contributions:

It presents a domain-agnostic LLM-based CBR architecture implemented with two variants for retrieval (explicit and unified) and adaptation (implicit and explicit), applicable to any case base, available as a community resource.¹. This repository also contains all experiment code.
It presents an evaluation across three model families, at multiple scales, and two benchmarks, providing evidence that architecture design dominates model choice for LLM-based CBR across model families: mid-size and large models converge to \sim92% accuracy on GSM8K under unified retrieval, comparable to published 8-shot chain-of-thought results without curated examples.
It develops a failure taxonomy for explicit case adaptation, identifying fourteen distinct failure types across \sim6000 failure cases. The types suggest that the issue in the multi-step process is post-retrieval processing, not retrieval itself, and that explicit adaptation can degrade quality as each adaptation stage compounds errors by treating its predecessor’s output as authoritative without cross-referencing the original problem.

Reasoning in Large Language Models

Much recent work on LLM reasoning traces back to chain-of-thought (CoT) prompting (Wei et al. 2022), which showed that asking a model to produce intermediate reasoning steps can lead to large performance gains on arithmetic and commonsense tasks. Recent work includes self-consistency decoding, which samples multiple reasoning paths and takes a majority vote (Wang et al. 2023), tree-of-thought prompting, which explores branching solution paths (Yao et al. 2023), and least-to-most prompting, which decomposes hard problems into simpler sub-problems (Zhou et al. 2023). However, models can still produce chains that contain logical errors or struggle with problems that require combining familiar reasoning steps in new ways (Dziri et al. 2023). In our approach, rather than building reasoning chains from scratch, reasoning steps in retrieving solved cases serve as implicit CoT examples, for the model to ground its reasoning in demonstrated solutions.

Retrieval-Augmented Generation:

RAG (Lewis et al. 2020), which retrieves related facts to augment queries, is a standard technique for grounding LLM outputs in external knowledge. RAG has demonstrated effectiveness in multiple domains. However, even strong embedding models struggle to retrieve structurally similar problems, suggesting that standard embeddings match on surface features rather than reasoning structure (Su et al. 2025). Recent work on solution-guided retrieval (Chen et al. 2025; Das, O’Nuallain, and Rahimi 2025; Shao et al. 2025) trains retrievers to match on solution structure; integrating such retrievers into CBR pipelines is a promising direction we discuss in Section 7.

Integrating Case-Based Reasoning with LLMs:

CBR-LLM integrations have recently gained traction; a community paper (Bach et al. 2025) lays out a research agenda and argues that the relationship is symbiotic: LLMs can help CBR, for example by automating case acquisition, learning similarity measures, and proposing adaptation rules or directly performing case adaptation, while CBR can help LLMs, for example, by structuring retrieval, planning reasoning chains, and grounding outputs in concrete precedents. Initial empirical research has tested LLM capabilities for CBR sub-tasks such as similarity assessment (Lenz, Hoffmann, and Bergmann 2025) and fuller CBR processing (Wilkerson and Leake 2024). That work found accuracy improvements but also the need for careful prompt design and risk of case hallucinations.

CBR-RAG (Wiratunga et al. 2024) applies CBR retrieval to structure Retrieval-Augmented Generation, for legal question answering. It uses indexing vocabulary and similarity knowledge to select contextually relevant cases, and is shown to provide better answer quality than standard RAG. This work suggests the promise of CBR in imposing a useful structure on what would otherwise be a generic retrieval step. The work in this paper applies a similar perspective to mathematical reasoning.

An Architecture for LLM-Based CBR and Variants

We have designed, implemented, and evaluated a general architecture for CBR-augmented reasoning using LLMs. The design is domain-agnostic: It can be applied to any domain with a case base of solved problems simply by providing it to the system.

Based on this architecture, we developed four testbeds varying along two independent dimensions: the retrieval mechanism—how cases are found and validated, and the adaptation mechanism—how retrieved cases are processed before answer generation. Crossing these two dimensions yields four configurations, which we compared to a zero-shot baseline with no retrieval. Table 1 summarizes the five configurations. No model fine-tuning is performed, and all reasoning is done through zero-shot prompting.

We consider a case to be a structured representation that contains the problem statement, a step-by-step reasoning chain and the solution. The case base provided to the system contains this information in a textual form. During retrieval, a structured case is generated with solution strategies and supporting facts from retrieved documents, which is then passed to the adaptation process to generate the reasoning chain and the solution. Both retrieval variants construct these structured cases, either through a sequence of prompts (explicit, multi-stage) or with a single prompt (implicit). This structured case allows us to transfer a unified solution strategy from the retrieved cases to the adaptation mechanism, letting that phase focus solely on problem-solving from the retrieved information.

System architecture diagram showing Multi-Node vs Unified Retrieval and Retrieval-Only vs Full-Architecture dimensions — Figure 1: System architecture, varying along two independent dimensions: *retrieval mechanism* (Multi-Node vs. Unified) and *adaptation type* (Retrieval-Only vs. Full-Architecture).

System configurations. MN = Multi-Node, UR = Unified Retrieval, RO = Retrieval-Only, FA = Full-Architecture (explicit adaptation).
Config	Retrieval	Adaptation	LLM Calls	Description
LLM-BL	None	None	1	Direct prompting baseline
MN-RO	Multi-node	Minimal	6–9+	Multi-stage retrieval, case \rightarrow solver
MN-FA	Multi-node	Explicit 3-stage	9–12+	Multi-stage retrieval + explicit adaptation
UR-RO	Unified	Minimal	2	Single retrieval call, case \rightarrow solver
UR-FA	Unified	Explicit 3-stage	4	Single retrieval call + explicit adaptation

Retrieval Variants

The retrieval mechanism has two variants, multi-node and unified.

Multi-Node (MN) Pipeline: The explicit multi-stage retrieval pipeline consists of the following stages:

Front-end: Extracts keywords and structures the query for retrieval.
Retrieve: Performs vector similarity search over the case base.
Grade documents: An LLM call judges each retrieved document for relevance, filtering out cases that match on surface features but not reasoning structure.
Transform query (loop, max 3 iterations): If grading rejects all documents, the query is reformulated and retrieval is retried.
Generate case: Structures the approved documents into a case representation containing relevance summaries, problem breakdowns, supporting facts, and solution approaches.
Generate answer: Adaptation module produces the final answer using the structured case.

This pipeline requires 6–9+ LLM calls per question, depending on the number of retrieval retries. When all retrieval attempts fail, a fallback mechanism routes directly to LLM answer generation without case context.

Unified Retrieval (UR) Pipeline: The unified retrieval pipeline collapses the multi-node pipeline into a single LLM call. After vector retrieval returns a set of K candidate documents, one LLM call performs document grading, relevance assessment, hallucination checking, fact extraction, and case structuring within a single context window. The output is a structured case representation identical to that produced by the multi-node pipeline, but generated in a single pass.

A confidence-based retry mechanism (max 2 retries) generates alternatives if the model is uncertain about the quality of the generated case. The LLM is prompted to rate how well the constructed case can guide a solution to the current problem, producing a confidence score in [0, 1]. If confidence is below a set threshold, retrieval and case generation are repeated.

The multi-node and unified approaches reflect a potential trade-off: isolated calls could in principle allow each stage to specialize and validate intermediate outputs, while a single integrated context allows the LLM to act on all information at the hidden-representation level. Our experiments test which of these benefits dominates in practice.

Adaptation Variants

Each retrieval mechanism is paired with either of two adaptation strategies:

Retrieval-Only (RO): The structured case from retrieval is passed directly to a final answer generator. The LLM receives the original question plus the case and produces an answer. This represents implicit adaptation, in which the LLM is prompted to adapt retrieved patterns through its attention mechanism.

Full-Architecture (FA): After case generation, three explicit adaptation stages process the case before answer generation:

Pre-adapter 1: Analyzes the case to identify relevant solution patterns, analogies, and potential pitfalls.
Pre-adapter 2: Refines the identified patterns and generates specific transformation guidelines for the current problem.
Solution generator: Produces a step-by-step solution using the adapted case information.

Experimental Questions and Design

Our experiments address four questions:

Implicit vs. explicit adaptation: When retrieval is held constant, does implicit or explicit adaptation produce better results?
How can adaptation fail? What types of issues cause either type of adaptation to break down?
Architecture vs. model scale: Does architecture choice or model choice matter more for these systems?
Is retrieval the problem? Do adaptation failures stem more from poor retrieval quality, or from the adaptation process itself?

Experimental Setup:

We set the number K of documents retrieved initially to 4 and the confidence threshold for the unified retriever (UR) variant to 0.3.

We evaluate six instruction-tuned LLMs from three model families, spanning 7B to 32B parameters (Table 2). No fine-tuning is performed; all models are used with zero-shot prompting.

Models evaluated are instruction-tuned variants without fine-tuning.
Family	Model	Parameters
Qwen 2.5	Qwen2.5-7B-Instruct	7B
	Qwen2.5-14B-Instruct	14B
	Qwen2.5-32B-Instruct	32B
Gemma 3	gemma-3-12b-it	12B
Gemma 3	gemma-3-27b-it	27B
Llama 3.1	Llama-3.1-8B-Instruct	8B

We evaluate on two mathematical reasoning benchmarks. GSM8K (Cobbe et al. 2021) contains 7,473 training and 1,319 test problems requiring grade-school level mathematical reasoning (arithmetic, fractions, word problems). Answers are integers, enabling exact-match evaluation. MATH (Hendrycks et al. 2021) contains 7,500 training and 5,000 test problems spanning seven subjects (algebra, geometry, number theory, counting and probability, intermediate algebra, precalculus, prealgebra) at competition level. Answers are LaTeX expressions (fractions, radicals, symbolic expressions), requiring symbolic equivalence checking.

Our baseline (LLM-BL) uses zero-shot direct prompting, which differs from published benchmarks that typically use 8-shot chain-of-thought (Wei et al. 2022). This makes our baselines deliberately weaker to isolate the effect of retrieval augmentation. Published GSM8K results for Qwen 14B and 32B are 94.8% and 95.9% respectively under 8-shot CoT; our zero-shot baselines for the same models are 38.1% and 37.2%. Under unified retrieval, these same models recover to \sim92% accuracy, nearly matching 8-shot CoT. Our experiments use the train split as the case base and automatically retrieve relevant cases for each test example.This suggests that much of the benefit of 8-shot CoT can be recovered by retrieving examples from a broad in-domain case base, without needing manually curated few-shot examples.

Evaluation Metrics:

GSM8K solution quality is evaluated by exact numeric match after extracting the final numerical answer from model output. MATH solution quality is evaluated by a three-tier comparison. The primary strategy parses the model answer and ground truth as LaTeX expressions and checks whether their symbolic difference simplifies to zero; when that fails, it falls back to numeric evaluation within tolerance. Its final fallback is to compare normalized string representations. All configurations are evaluated on the full test sets.

Retrieval Quality Tests:

Because performance of later steps depends on upstream performance as well, before attributing downstream performance differences to architecture design, we test retrieval quality. Using an LLM-based document grader (the same method used in our multi-stage pipeline), we evaluate retrieval quality across both datasets. Dense retrieval with Nomic Embed v1.5 achieves a 78.5% document approval rate on GSM8K (97.6% coverage, which is the proportion of queries with at least one relevant retrieved case) and 71.4% on MATH (93.9% coverage). These rates confirm that the retrieval pipeline provides relevant cases for the vast majority of queries. We analyze retrieval quality and its implications further in Section 6.4.

Experimental Results

GSM8K Results

Table 3 presents accuracy across all 30 model–configuration combinations on GSM8K. Several patterns emerge:

GSM8K accuracy (%) across models and configurations. Best in **bold**. ^\daggerLlama 8B results are affected by systematic formatting errors.
Model	LLM-BL	MN-RO	MN-FA	UR-RO	UR-FA
Qwen 7B	80.21	66.57	22.29	80.06	23.88
Qwen 14B	38.13	73.92	28.89	92.04	27.90
Qwen 32B	37.23	47.08	14.56	91.89	15.62
Gemma 12B	22.37	85.75	22.67	91.28	37.76
Gemma 27B	30.33	90.45	47.16	92.95	58.15
Llama 8B^\dagger	42.76	29.80	14.48	33.81	18.50

Unified retrieval dominates for 12B+ models. UR-RO achieves 91–93% for Qwen 14B/32B and Gemma 12B/27B, a remarkably tight range given that these models’ baselines span 22–38%. In these tests, architecture design eliminates the 16-percentage-point gap between model families.

Multi-node retrieval helps, but inconsistently. MN-RO produces large gains for Gemma (85–90%) but mixed results for Qwen (47–74%). The multi-node pipeline’s 6–9 sequential LLM calls create opportunities for error propagation that the unified approach avoids.

Explicit adaptation is consistently harmful. Both MN-FA and UR-FA degrade accuracy relative to their retrieval-only counterparts, with degradations ranging from 33 to 76 percentage points. The effect is particularly stark for UR-FA: starting from the same high-quality retrieval that produces 92% accuracy under UR-RO, adding three adaptation stages drops accuracy to 15–28% for Qwen and 38–58% for Gemma. This is the central finding of our work.

Small models hit a capacity ceiling. Qwen 7B achieves \sim80% regardless of configuration, suggesting that at 7B parameters, the model’s intrinsic reasoning capacity is the binding constraint. Llama 8B generated systematic formatting failures that prevent fair evaluation of retrieval benefits.

MATH Results

To test whether these patterns generalize beyond grade-school mathematics, we evaluate on the MATH dataset competition-level problems. Table 4 presents accuracy across all five configurations. Results suggest the following observations:

MATH accuracy (%) across models and configurations. Best in **bold**.
Model	LLM-BL	MN-RO	MN-FA	UR-RO	UR-FA
Qwen 7B	57.82	18.38	6.90	29.16	11.76
Qwen 14B	24.84	7.90	16.92	12.76	24.86
Qwen 32B	37.55	7.98	24.70	12.22	29.78
Gemma 12B	21.54	75.62	26.40	77.64	34.64
Gemma 27B	36.58	81.14	34.94	82.92	41.74
Llama 8B	5.52	10.10	6.96	10.20	8.70

Unified retrieval still outperforms multi-node. UR-RO matches or exceeds MN-RO for every model: Qwen 7B (+10.8%), Qwen 14B (+4.9%), Qwen 32B (+4.2%), Gemma 12B (+2.0%), and Gemma 27B (+1.8%). The advantage of unified context persists on harder problems.

Retrieval helps Gemma but hurts Qwen. The baseline results reveal a model-family split. Qwen models achieve strong baselines (Qwen 7B: 57.82%, Qwen 32B: 37.55%) that retrieval-augmented configurations cannot match; Qwen 7B’s best retrieval result (UR-RO, 29.16%) is half its baseline. In contrast, Gemma models see large gains from retrieval: Gemma 12B from 21.54% to 77.64% under UR-RO (+56%), and Gemma 27B from 36.58% to 82.92% (+46%). This contrasts with GSM8K, where retrieval helped all models above 10B.

Explicit adaptation remains harmful. MN-FA and UR-FA consistently underperform their retrieval-only counterparts for Gemma models on MATH. Gemma 27B falls from 82.92% to 41.74% (-41%) and Gemma 12B from 77.64% to 34.64% (-43%) when adding explicit adaptation to UR-RO. For Qwen 14B and 32B, UR-FA outperforms UR-RO (+12% and +18% respectively), but neither exceeds the zero-shot baseline for Qwen 32B, and Qwen 14B’s best retrieval result (UR-FA, 24.86%) only matches its baseline (24.84%).

Cross-Dataset Comparison

Comparing GSM8K and MATH results shows which findings apply across both:

Generalized conclusions. Two conclusions hold across both datasets. Unified retrieval consistently outperforms multi-node retrieval, confirming that consolidated context is preferable to sequential processing regardless of problem difficulty. Explicit adaptation is harmful across both datasets for models where retrieval helps, with the same failure mechanisms identified in Section 6.2.

Dataset-dependent conclusions. The tight convergence under UR-RO seen on GSM8K (91–93% for all 12B+ models) breaks down on MATH. Gemma models benefit enormously from retrieval (Gemma 12B: 21.54% to 77.64%, Gemma 27B: 36.58% to 82.92%), while Qwen models are actively harmed by it (Qwen 7B: 57.82% to 29.16%, Qwen 32B: 37.55% to 12.22% under the same UR-RO architecture). On grade-school problems, surface-similar retrieved cases reliably share solution structure, so retrieval helps universally. On competition-level problems, surface similarity no longer guarantees solution relevance and models that cannot filter out misleading cases are worse off than with no retrieval at all. This points to a fundamental limitation of embedding-based retrieval for harder reasoning tasks, which we discuss further in Section 7.

Analysis

Question: Implicit vs. Explicit Adaptation

The contrast between UR-RO and UR-FA provides a clean test of implicit versus explicit adaptation. Both use identical retrieval (same unified retrieval call, same retrieved cases, same structured case). The only difference is what happens next: UR-RO passes the case directly to the solver, while UR-FA routes through three adaptation stages first.

On GSM8K, this difference is dramatic: UR-RO achieves 91–93% for 12B+ models, while UR-FA drops to 15–28% for Qwen and 38–58% for Gemma. The 430 failure cases all involve questions where the same retrieval input produced a correct answer under UR-RO but an incorrect one under UR-FA.

We interpret this as evidence that LLMs perform adaptation more effectively through their attention mechanism, operating over a unified context, than through explicit multi-stage token generation. When the model sees the question and retrieved cases together, it can selectively attend to relevant solution patterns, numerical details, and structural cues. When forced to generate an explicit adaptation plan across multiple stages, information is lost at each handoff.

This has implications for how we think about the “Revise” step in LLM-based CBR. Classical CBR assumes explicit adaptation is necessary to transfer solutions between cases. Our results suggest that for LLMs, the Revise step should be implicit and performed by the model’s internal reasoning over unified context rather than externalized as a multi-stage pipeline.

Question: How Explicit Adaptation Degrades Performance

To understand why explicit adaptation degrades performance, we conducted a systematic analysis of failure cases. We identified 430 questions on GSM8K for which UR-RO answered correctly but UR-FA answered incorrectly, with identical retrieval and only adaptation different. Using a heuristic rule-based classifier combining NLTK tokenization with numerical analysis, we classified 352 of these (81.9%) into eight failure modes. We then ran a parallel analysis on MATH (5,552 UR-RO-correct / UR-FA-wrong candidates), extending the classifier with six MATH-specific detectors to handle MATH’s diverse answer space (symbolic expressions, intervals, tuples, LaTeX-formatted values). The unified taxonomy across both datasets is shown in Table 5.

Failure taxonomy for explicit adaptation across both datasets. Each row is a candidate where UR-RO succeeded but UR-FA failed under identical retrieval. *Italicized* codes appear in both datasets and represent mechanisms that transfer; non-italicized codes are dataset-specific.
GSM8K (n{=}430)				MATH (n{=}5{,}552)
Code	Failure Mode	n	%	Code	Failure Mode	n	%
MAE	Minor Arithmetic Error	101	23.5	MTY	Mismatched Type	1,150	20.7
MFT	Missing Final Transformation	99	23.0	AIR	Answer In Reasoning	1,068	19.2
TSC	Temporal Semantic Confusion	53	12.3	WAF	Wrong Answer Form	395	7.1
CCH	Complete Context Hallucination	44	10.2	MFT	Missing Final Transformation	387	7.0
MC	Magnitude Confusion	24	5.6	MC	Magnitude Confusion	284	5.1
APE	Arithmetic Propagation Error	16	3.7	SCM	Symbolic Computation Mistake	216	3.9
NDS	Numerical Detail Stripping	8	1.9	UEX	Unevaluated Expression	196	3.5
EAIE	Error Analysis Introduces Error	7	1.6	NDS	Numerical Detail Stripping	158	2.8
				MAE	Minor Arithmetic Error	157	2.8
				AR	Adapter Refusal	34	0.6
	Unclassified	78	18.1		Unclassified	1,507	27.1

The top three failure modes—Minor Arithmetic Error, Missing Final Transformation, and Temporal Semantic Confusion—account for 253 of 430 failures (59%) in GSM8K. These represent different ways in which information loss can occur:

Minor Arithmetic Error (MAE, 23.5%): The adaptation pipeline produces a solution that is on-topic and uses the right approach, but contains a small calculation error. For example, correctly identifying all prices and quantities but computing 7.5 instead of 5.5 in one intermediate step.

Missing Final Transformation (MFT, 23.0%): The pipeline solves most of the problem correctly but omits a final conversion or transformation step. A typical case: computing 80 legs correctly but failing to convert to “pairs” by dividing by 2. The unified solver, with the full problem visible, catches this; the adapter pipeline, having abstracted the problem across three stages, loses track of what the question actually asks.

Temporal Semantic Confusion (TSC, 12.3%): The fact extraction stage conflates time-related quantities (e.g., interpreting “$300/week” as “monthly $300”), and subsequent stages propagate this error without cross-checking the original question.

A notable model-specific pattern: Complete Context Hallucination (CCH) is heavily concentrated in Qwen 14B, which accounts for 39 of 44 cases (89%). In these failures, the adaptation stages solve an unrelated retrieved case problem instead of the target question.

The common thread across all failure modes is that each adaptation stage treats its predecessor’s output as authoritative, without cross-referencing the original question, even when it has access to the full information at every step. The unified approach avoids this by keeping all information in a single context upon which the model’s attention mechanism freely operates.

MATH surfaces a new class of failure: presentation-stage breakdowns. The MATH analysis adds 5,552 candidates from six models, but only four of the eight GSM8K modes transfer (MAE, MFT, MC, NDS). The remaining MATH candidates concentrate in six new modes that do not exist on GSM8K. Two are dominant: Mismatched Type (MTY, 20.7%): the pipeline returns an answer of the wrong mathematical shape (e.g.,a scalar when the ground truth is an interval, a coordinate point when an equation is expected, an integer when a decimal is expected), and Answer In Reasoning (AIR, 19.2%): the pipeline produces an answer that matches ground truth during the reasoning phases, but the final extraction stage fails to surface it cleanly. Together with Wrong Answer Form (WAF) and Unevaluated Expression (UEX), these four account for 50.5% of MATH failures. They are not new reasoning errors; they are the failures of a downstream stage that is supposed to deliver the answer in canonical form, and they exist on MATH only because MATH’s answer space is heterogeneous enough to expose them.

Model-specific concentrations differ between datasets. On GSM8K, CCH is concentrated in Qwen 14B (39/44). On MATH, AIR is concentrated in Gemma 12B and 27B (849/1,068) indicating the Gemma models reason toward the right answer but ramble in their final output. AR is essentially Gemma-only (30/34 on Gemma 12B), reflecting these models’ tendency to commit to uncertainty rather than guessing. Qwen 14B’s MATH candidate pool is smaller (420) and, as noted in Section 5.2, UR-FA outperforms UR-RO on this model at the aggregate level; a model-specific exception to the broader pattern.

Caveats. The taxonomy is heuristic and illustrative rather than exhaustive; the unclassified failures (18.1% on GSM8K, 27.1% on MATH) reflect heterogeneous failures rather than a single missing mode. The broader message is that multi-stage adaptation breaks down in many ways, and any single benchmark can indicate only the subset its answer structure permits; GSM8K and MATH together give a more complete picture than either alone.

Question: Architecture vs. Model Scale

A striking finding across both datasets is that architecture design dominates model choice. On GSM8K, four models from two different families (Qwen 14B (92.04%), Qwen 32B (91.89%), Gemma 12B (91.28%), and Gemma 27B (92.95%)) converge to a 1.7-percentage-point range under UR-RO, despite baseline performances ranging from 22% to 38%. The 58-percentage-point gap between Gemma 12B’s baseline (22.37%) and Qwen 7B’s baseline (80.21%) is eliminated: under UR-RO, Gemma 12B (91.28%) substantially outperforms Qwen 7B (80.06%).

Models around 7–8B parameters appear to hit a capacity ceiling. Qwen 7B achieves \sim80% regardless of architecture, suggesting its intrinsic reasoning capacity cannot be overcome by retrieval alone. This defines a practical “retrieval breakpoint”: below \sim10B parameters, the model’s internal capabilities are the binding constraint; above it, architecture design determines performance.

On MATH, the convergence is less tight with Gemma models outperforming Qwen models even under UR-RO. This suggests that on harder problems, model-specific capabilities play a larger role. Still, the core pattern holds: a 12B Gemma model with good architecture (77.64%) dramatically outperforms a 32B Qwen model with the same architecture (12.22%).

We noted an asymmetry the results: The same Dense Nomic retrieval that lifts Gemma 12B from 21.54% to 77.64% under UR-RO drops Qwen 32B from 37.55% to 12.22%, with retrieval quality identical for both families (Section 6.4). Our experiments aim to isolate architecture from retrieval quality; the mechanisms of each model’s attention layers and internal representation work on the same retrieved cases. Whether the pipeline succeeds for one model but fails for another depends on their internal model architecture. Our analysis is external and behavioral; we evaluate across six models from three families to have the breadth to enable detecting model-level effects such as this asymmetry. Retrieval helps one family by 56% and hurts another by 25% under identical input, showing a model-family effect that tempers any expectation that retrieval augmentation transfers uniformly across model families.

Question: Is Retrieval Quality the Problem

A possible hypothesis is that adaptation failures stem from poor retrieval, due to retrieved cases not being sufficiently relevant to be useful. We test this using a separate retrieval quality evaluation comparing four retrieval configurations across both datasets (Table 6).

Retrieval quality evaluation. Approval rate is proportion of retrieved documents judged relevant by LLM grader. Coverage is proportion of queries with at least one relevant document.
	GSM8K		MATH
Configuration	Approval	Coverage	Approval	Coverage
Dense (Nomic, 137M)	78.5%	97.6%	71.4%	93.9%
Dense (Qwen3, 8B)	85.0%	98.6%	86.9%	98.5%
Hybrid+Rerank (Nomic)	77.7%	97.0%	65.8%	92.3%
Hybrid+Rerank (Qwen3)	79.0%	97.0%	69.8%	94.3%

Several findings emerge:

Retrieval quality is adequate. Our production configuration (Dense Nomic) achieves 78.5% approval on GSM8K and 71.4% on MATH, with 94–98% coverage. The vast majority of queries receive at least one relevant case.

Embedding model quality is the dominant factor for initial retrieval. Upgrading from Nomic (137M parameters) to Qwen3 Embedding (8B parameters) yields +6.5% on GSM8K and +15.5% on MATH which exceeds any pipeline-level improvement.

BM25 hurts for math. Adding BM25 keyword matching (via Reciprocal Rank Fusion) degrades retrieval quality: -0.8% on GSM8K and -5.6% on MATH for Nomic, and -6.0%/-17.1% for Qwen3. The errors increase with complexity. This contradicts general-domain RAG results (Anthropic 2024), where hybrid retrieval reduces failures. For mathematical reasoning, keyword matching will retrieve problems with similar vocabulary but not similar answer structure.

The bottleneck is post-retrieval, not retrieval itself. Dense Nomic achieves 78.5% approval and powers UR-RO to 92% accuracy but the same retrieval under UR-FA produces only 15–28% accuracy. The 430 UR-FA failure cases analyzed in Section 6.2 all have the same high-quality retrieval. That is not the bottleneck; what happens after retrieval determines success.

Conclusion

We evaluated four LLM-based CBR architectures for mathematical reasoning across six models and two benchmarks. In our tests, explicit adaptation consistently and often dramatically harms performance compared to implicit adaptation. An analysis of failure cases identified fourteen mechanisms across two datasets by which explicit adaptation destroys information, with arithmetic errors, missed transformations, and semantic confusion accounting for most.

In contrast, unified retrieval with implicit adaptation achieves high accuracy on GSM8K for all models above 10B parameters, regardless of model family. This supports the simple adaptation hypothesis that for such domains, adaptation should be implicitly performed by the LLM rather than by explicit multi-stage prompting, with the LLM’s attention mechanism handling adaptation implicitly. Our results also show that architecture design dominates model choice on GSM8K. On MATH, however, retrieval benefits depend on the model family. This suggests that surface-similarity retrieval reaches its limits on competition-level problems.

Our evaluation is limited to two mathematical reasoning benchmarks, so solidifying these results as a general principle and delineating its scope of application will require broader testing. The principle that implicit adaptation outperforms explicit adaptation may not hold for domains where the adaptation step involves genuinely novel transformations not present in the retrieved cases. This suggests the need for a solution-aware retrieval process, in which cases are matched by the reasoning strategies they require rather than by surface features of the problem statement—a modern manifestation of the classic CBR indexing problem (Kolodner and Leake 1996).

Acknowledgments and Declaration on Generative AI

This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute. The authors did not use any generative AI for the preparation of this paper.

Anthropic. 2024. “Contextual Retrieval.” Engineering at Anthropic (blog).

Bach, K., R. Bergmann, F. Brand, M. Caro-Martínez, V. Eisenstadt, M. W. Floyd, L. Jayawardena, et al. 2025. “Case-Based Reasoning Meets Large Language Models: A Research Manifesto for Open Challenges and Research Directions.”

Chen, Jianlyu, Junwei Lan, Chaofan Li, Defu Lian, and Zheng Liu. 2025. “ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval.” arXiv Preprint arXiv:2510.08252.

Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, et al. 2021. “Training Verifiers to Solve Math Word Problems.” https://arxiv.org/abs/2110.14168.

Das, D., Sam O’Nuallain, and Razieh Rahimi. 2025. “RaDeR: Reasoning-Aware Dense Retrieval Models.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 19970–97. ACL.

Dziri, N., X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, Peter West, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.” In Advances in Neural Information Processing Systems, 70293–332. Red Hook: Curran.

Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.” In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

Huang, J., and K. C.-C. Chang. 2023. “Towards Reasoning in Large Language Models: A Survey.” In Findings of ACL, 1049–65. ACL.

Kolodner, J., and D. Leake. 1996. “A Tutorial Introduction to Case-Based Reasoning.” In Case-Based Reasoning: Experiences, Lessons, and Future Directions, edited by D. Leake, 31–65. Menlo Park, CA: AAAI Press.

Leake, David, and Xiaomeng Ye. 2021. “Harmonizing Case Retrieval and Adaptation with Alternating Optimization.” In Case-Based Reasoning Research and Development, ICCBR 2021, 125–39. Springer.

Lenz, Mirko, Maximilian Hoffmann, and Ralph Bergmann. 2025. “LLsiM: Large Language Models for Similarity Assessment in Case-Based Reasoning.” In Case-Based Reasoning Research and Development - 33rd International Conference, ICCBR 2025, 126–41. Cham: Springer.

Lewis, P., E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Proceedings of the 34th International Conference on Neural Information Processing System (NeurIPS), 9459–74.

Shao, Rulin, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, et al. 2025. “ReasonIR: Training Retrievers for Reasoning Tasks.” arXiv Preprint arXiv:2504.20595.

Smyth, B., and M. Keane. 1998. “Adaptation-Guided Retrieval: Questioning the Similarity Assumption in Reasoning.” Artificial Intelligence 102 (2): 249–93.

Su, H., Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-Yu Wang, Haisu Liu, et al. 2025. “BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval.” In 13th International Conference on Learning Representations, ICLR 2025 (Spotlight).

Wang, X., J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” In 11th International Conference on Learning Representations, ICLR 2023.

Wei, J., X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In Proceedings of the 36th International Conference on Neural Information Processing System (NeurIPS), 24824–37.

Wilkerson, Kaitlynne, and David Leake. 2024. “On Implementing Case-Based Reasoning with Large Language Models.” In Case-Based Reasoning Research and Development - 32nd International Conference, ICCBR 2024, Merida, Mexico, July 1-4, 2024, Proceedings, 14775:404–17. Lecture Notes in Computer Science. Springer.

Wiratunga, N., R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch. 2024. “CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering.” In Case-Based Reasoning Research and Development, ICCBR-24. Springer.

Yao, S., D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” In Proceedings of the 37th International Conference on Neural Information Processing Systems, 11809–22.

Zhou, D., N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, et al. 2023. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.” In 11th International Conference on Learning Representations, ICLR 2023.

https://github.com/raregul/keeping-adaptation-simple ↩︎

Keep Adaptation Simple