Data intelligence too cheap to meter: Refuel-LLM2-mini

Before and after using an LLM for lead qualification

‍

‍

Key takeaways

We’re excited to add RefuelLLM-2-mini, a new 1.5B parameter model, to our Refuel-LLM family of models purpose built for data labeling, enrichment and cleaning.

‍

RefuelLLM-2-mini (75.02%) outperforms comparable LLMs including Phi-3.5-mini (65.3%), Qwen2.5-3B (67.62%), Gemma2-2B (64.52%), Llama3-3B (55.8%) and Llama3-1B (39.92%) across a benchmark of ~30 data labeling tasks.

RefuelLLM-2-mini, aka is a Qwen2-1.5B base model, trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution using the same recipe as Refuel-LLM-2 and Refuel-LLM-2-small (announced last year).

You can start using the model (with fine tuning support) in Refuel Cloud starting today. We’re open sourcing the model weights, available on Hugging Face.

‍

Why

Over the past few months, there has been a surge in open-source LLMs with parameters in 0.5 - 3B parameter range, such as Microsoft's Phi-3-mini (3.8B), Google’s Gemma-2 (2B), and Meta’s Llama-3.2 (1B & 3B). These Small Language Models (SLMs) feature fewer parameters than larger language models (LLMs), resulting in lower latency and inference cost but often at the expense of broader reasoning capabilities that are typical of LLMs.

‍

At Refuel, our users care about a very specific subset of use cases - transforming, enriching, cleaning and categorizing their enterprise data, at high quality and scale. With this effort our goal is to bring the power of tailored SLMs optimized for these tasks to our users.

Results

‍

Benchmark

We evaluate LLM output quality on the same data labeling and transformation benchmark introduced in the Refuel-LLM-2 announcement.

‍

‍

Output Quality

‍

Output quality measures how well does the LLM generated output agree with the ground truth label provided.

RefuelLLM-2-mini (75.02%) outperforms comparable LLMs including Phi-3.5-mini (65.3%), Qwen2.5-3B (67.62%), Gemma2-2B (64.52%), Llama3-3B (55.8%) and Llama3-1B (39.92%).
Compared to the base LLM we started with (Qwen2.5-1.5B), we see a significant improvement in quality, similar to what we observed for other models in the Refuel-LLM-2 family.

‍

Quality of confidence scores

‍

To benchmark the quality of confidence scores, we use AUROC. AUROC is an aggregate score that measures a classifier's ability to distinguish between the positive (“LLM output is correct”) and negative classes (“LLM output is incorrect”) across all score thresholds. More details about how confidence scores are computed from token-level generation probabilities can be found in our earlier work on estimating LLM confidence.

‍

We observe that RefuelLLM-2-mini outputs much better calibrated confidence scores compared to other comparable models.

‍

Latency

‍

We observe that Refuel-LLM-2-Mini is fast, while having superior performance compared to other models in its size range.

‍

The latency numbers reported above were computed under the following setup:

Each request to the LLM consists of 2k input tokens, and 100 output tokens (we set the max tokens to 100, and removed stop tokens to force the LLM to deterministically output the same number of tokens for each request)
The models were hosted on a A6000 GPU using vLLM as the inference engine
We measure the latency per request, and report the final latency averaged across 100 requests, per model.

‍

Training recipe

RefuelLLM-2-mini was trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution. These datasets are a combination of:

Human annotated datasets like Flan, Task Source, and the Aya collection
Synthetic datasets like OpenOrca, OpenHermes and WizardLM
Proprietary datasets developed or licensed by Refuel

‍

The dataset composition and training recipe used were the same as Refuel-LLM-2 and Refuel-LLM-2-small (announced earlier).

‍

Access

Access to the RefuelLLM-2-mini is available starting today:

Sign up for Refuel Cloud to get access to models, along with fine tuning support: https://www.refuel.ai/get-started
We’re open sourcing Refuel-LLM-2-mini, aka Qwen-2-Refueled, under a CC BY-NC 4.0 license. You can access the model on Hugging Face.

‍

Acknowledgements

We thank the Qwen team for making the Qwen series of models available to the community.
We would like to thank the creators of the Flan, Tasksource, OpenHermes, OpenOrca, Aya collections and many more for building and making them open source. We would also like to thank AllenAI and TAU for building and open sourcing several long-context datasets that we used for training.
We are grateful to the maintainers of transformers, TGI, vLLM, axolotl, deepspeed, FastAttention.
Finally, we also want to thank Mosaic, GCP and Runpod for their infrastructure resources that we heavily relied on for training and evaluation.



