To Reason or Not to Reason: Is 5% more accuracy worth >5x cost?

Before and after using an LLM for lead qualification

Summary

In this series of experiments, we explore whether training LLMs to reason can enhance performance in data transformation and information extraction tasks and the resulting latency/cost tradeoffs from increased token usage.

Reasoning helps improve output quality for tasks such as classification, NER and attribute extraction (the average improvement in output quality across the datasets in our evaluation was 4.9%)
Our results demonstrate that finetuning with reasoning data can enhanced output quality, but only when the base model was already trained to use such reasoning traces effectively. However, we found that in some cases, fine-tuning models with reasoning traces led to performance degradation rather than improvement if the base model wasn’t already trained with reasoning abilities earlier.
Models finetuned with reasoning, output significantly higher number of tokens (>5x increase compared to a model finetuned on the same dataset with only outputs, no reasoning traces). This has significant implications for workflows that need to operate at high throughput and lower cost.
‍

Overview of reasoning LLMs

Reasoning in Large Language Models (LLMs) refers to their ability to break down complex problems, and generate structured, step-by-step solutions. Unlike simple pattern recognition, effective reasoning enables LLMs to tackle tasks such as solving hard math and programming problems and making logical deductions. Over time, different techniques have emerged to enhance LLMs' reasoning capabilities.

Chain of Thought: One of the earliest breakthroughs was Chain-of-Thought (CoT) prompting, which improves problem-solving by explicitly guiding the model to generate intermediate reasoning steps. By mimicking human-like stepwise thinking, CoT prompting enables models to handle complex queries more effectively.‍
Inference Time Scaling: Another approach called Inference-Time Scaling (likely used by OpenAI as part of o1, o3 series of models), enhances reasoning by dynamically allocating more computational resources during inference. Instead of modifying the model’s training process, this method allows LLMs to apply additional processing power selectively, improving their ability to reason through difficult problems on demand.‍
Reinforcement Learning: DeepSeek-R1 on the other hand takes a different path by refining reasoning through reinforcement learning and specialized training pipelines. Rather than adjusting computation at inference time, this technique improves logical consistency by training models on both correct and incorrect reasoning paths, iteratively strengthening their ability to derive accurate conclusions.

These advancements illustrate the evolving landscape of LLM reasoning, shifting from explicit prompting to dynamic computational scaling and adaptive learning techniques. As research continues, the focus remains on striking a balance between interpretability, efficiency, and reasoning depth to push the boundaries of what LLMs can achieve.
‍

Goal of this experiment

Recent advancements in LLM reasoning capabilities have demonstrated significant performance improvements in complex tasks such as logical reasoning, mathematical problem-solving, and multi-step fundamental research. However, these gains often come with a trade-off—increased computational cost and latency.

This experiment explores whether training LLMs to reason can enhance performance in data transformation and information extraction tasks. which differ from traditional reasoning-heavy problems. These tasks typically prioritize:

High Throughput & Lower Cost – Information extraction tasks require efficient, scalable solutions that can process vast amounts of text with minimal latency.‍
Lower per-sample complexity – Unlike abstract reasoning tasks, information extraction often follows structured patterns and rule-based dependencies. Extracting date and dollar amount from a noisy receipt image is a bit less complex than solving a field equation in General Relativity!
‍

Experiment Setup

Base Models

Refuel LLM v2 Small: A Llama3.1-8B base model, instruction tuned data labeling, enrichment and cleaning tasks spanning classification, entity resolution, matching, reading comprehension, and information extraction (announced here).
DeepSeek-R1-Distill-Llama-8B: A distilled version of DeepSeek-R1, derived from Llama3.1-8B base model, instruction tuned with 800K samples with reasoning traces curated from DeepSeek-R1.

Datasets

We used three non-public datasets for this experiment, spanning a few domains and data types such as HRTech, Customer365 and AI agent evaluation:

By their nature of being proprietary datasets and the time when they were constructed, we have a high degree of confidence that these datasets were not used as part of any training or validation of the base models used above.

For each of these datasets, we generated the reasoning traces and labels using DeepSeek-R1 (i.e. the same procedure used to generate the reasoning + output data from DeepSeek-R1 for training DeepSeek-R1-Distill-Llama-8B).

Training Schemes

In order to understand how training models with reasoning impacts output quality and latency, we compared the finetuned model performance for each {base model, dataset} pair mentioned above using the following two finetuning schemes:

FT-(R): Finetuning with Reasoning traces: The base model was finetuned for the dataset using both - the ground truth labels and reasoning traces fromDeepSeek-R1 as mentioned in the dataset section above.
FT-(NR): Finetuning without Reasoning traces: The base model was finetuned for the dataset using only the ground truth labels, no reasoning traces. This is the instruction tuning setup that is most common for finetuning LLMs.

For both schemes, we trained LoRA adapters with a rank of 64. The models were trained for 20 epochs with an initial learning rate of 1e-5 and used cosine learning rate decay to 1e-6. We withhold 10% of the training dataset for validation, checkpoint every 2 epochs and choose the model checkpoint which has the lowest loss on the validation dataset.

Results

Evaluation Setup

Model performance was assessed using task-specific evaluation metrics:

Macro F1 Score: A metric used to evaluate classification performance, particularly in cases of class imbalance. It calculates the F1 score independently for each class and then computes the unweighted mean, providing a balanced measure of precision and recall across all classes. We used this metric for Dataset 1 since it is a multilabel classification task.
ROUGE-L Score: A metric commonly used to assess the quality of generated text by comparing it to reference texts. ROUGE-L focuses on the longest common subsequence between the generated and reference texts, capturing both precision and recall aspects of text generation. We used this metric for Dataset 2 and 3 since both are freeform text extraction tasks.

For each dataset and base model, we provide evaluation numbers across three setups:

Base model – Performance of the respective base model without any finetuning
FT-(R) – Performance of the respective base model, finetuned with outputs and reasoning traces
FT-(NR) – Performance of the respective base model, finetuned with only outputs and no reasoning traces.

This multi-tiered evaluation approach helps quantify the effectiveness of reasoning traces in enhancing LLM performance across different tasks, domains and base models
‍

Output Quality

We observed two key results from the table above:

[1] Finetuning with reasoning helps improve quality, but only if the base model has been trained with reasoning abilities.

Our results demonstrate that reasoning data can significantly enhance model performance, but only when the base model has been trained to use such reasoning traces effectively. When a model designed for reasoning is fine-tuned with reasoning traces, the improvements are substantial. However, fine-tuning models for specific tasks with reasoning traces, can lead to performance degradation rather than improvement if the base model hasn’t been trained with reasoning abilities earlier.

For example:

DeepSeek-R1 Distill-Llama-8B fine-tuned with reasoning traces consistently outperformed the same model fine-tuned without reasoning data across multiple datasets.
Refuel LLM v2 Small, a model not explicitly trained with reasoning capabilities, showed mixed results when trained with reasoning traces, sometimes even underperforming compared to its non-reasoning fine-tuned counterpart.

This highlights that reasoning data is not inherently beneficial unless the model has been trained to take advantage of it.

[2] Base models trained with reasoning likely cannot learn without reasoning

One of the most striking observations is that DeepSeek-R1 Distill-Llama-8B distilled models appear to rely heavily on reasoning (in addition to ground truth labels) to learn effectively. Across all datasets, the DeepSeek-R1 Distill-Llama-8B model performed significantly worse when fine-tuned without reasoning data. In fact, in some cases, the model fine-tuned on ground truth (GT) data alone performed worse than the base model with no fine-tuning at all.

For example:

On Dataset 1, DeepSeek-R1 Distill-Llama-8B fine-tuned without reasoning traces performed worse (0.082 Macro F1) than the base model (0.096 Macro F1)
On Dataset 2, the reasoning-trained DeepSeek-R1 Distill-Llama-8B model had the highest ROUGE-L score (0.7705), while the non-reasoning version performed only slightly better than the base model (0.30 vs. 0.205).

These results suggest that models trained with reasoning abilities may rely on reasoning data for additional finetuning as well

Output tokens and latency

Average number of tokens generated per input row

‍

One clear trend in our results is that fine-tuning with reasoning traces consistently leads to a significant increase in the number of tokens generated during inference. Across all datasets, models trained with reasoning data produced substantially longer outputs compared to both their base versions and those fine-tuned without reasoning. Comparing results between Refuel LLM v2 Small FT-(NR) and DeepSeek-R1 Distill-Llama-8B across datasets we see that:

On average, output length increased by 6.7x when we switch from a non-reasoning model (Refuel LLM v2 Small) to a reasoning model (DeepSeek-R1 Distill-Llama-8B).
Despite this large jump in token usage (and, by extension, inference cost), the average performance improvement was only 4.9%.

This suggests that while reasoning traces do improve model effectiveness, they also lead to higher computational costs. This presents a crucial trade-off: the performance benefits of reasoning traces must be weighed against the increased cost and latency associated with generating longer outputs.



