We’re excited to add RefuelLLM-2-mini, a new 1.5B parameter model, to our Refuel-LLM family of models purpose built for data labeling, enrichment and cleaning.
RefuelLLM-2-mini (75.02%) outperforms comparable LLMs including Phi-3.5-mini (65.3%), Qwen2.5-3B (67.62%), Gemma2-2B (64.52%), Llama3-3B (55.8%) and Llama3-1B (39.92%) across a benchmark of ~30 data labeling tasks.
RefuelLLM-2-mini, aka is a Qwen2-1.5B base model, trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution using the same recipe as Refuel-LLM-2 and Refuel-LLM-2-small (announced last year).
You can start using the model (with fine tuning support) in Refuel Cloud starting today. We’re open sourcing the model weights, available on Hugging Face.
Why
Over the past few months, there has been a surge in open-source LLMs with parameters in 0.5 - 3B parameter range, such as Microsoft's Phi-3-mini (3.8B), Google’s Gemma-2 (2B), and Meta’s Llama-3.2 (1B & 3B). These Small Language Models (SLMs) feature fewer parameters than larger language models (LLMs), resulting in lower latency and inference cost but often at the expense of broader reasoning capabilities that are typical of LLMs.
At Refuel, our users care about a very specific subset of use cases - transforming, enriching, cleaning and categorizing their enterprise data, at high quality and scale. With this effort our goal is to bring the power of tailored SLMs optimized for these tasks to our users.
Results
Benchmark
We evaluate LLM output quality on the same data labeling and transformation benchmark introduced in the Refuel-LLM-2 announcement.
Output Quality
Output quality measures how well does the LLM generated output agree with the ground truth label provided.
RefuelLLM-2-mini (75.02%) outperforms comparable LLMs including Phi-3.5-mini (65.3%), Qwen2.5-3B (67.62%), Gemma2-2B (64.52%), Llama3-3B (55.8%) and Llama3-1B (39.92%).
Compared to the base LLM we started with (Qwen2.5-1.5B), we see a significant improvement in quality, similar to what we observed for other models in the Refuel-LLM-2 family.
Quality of confidence scores
To benchmark the quality of confidence scores, we use AUROC. AUROC is an aggregate score that measures a classifier's ability to distinguish between the positive (“LLM output is correct”) and negative classes (“LLM output is incorrect”) across all score thresholds. More details about how confidence scores are computed from token-level generation probabilities can be found in our earlier work on estimating LLM confidence.
We observe that RefuelLLM-2-mini outputs much better calibrated confidence scores compared to other comparable models.
Latency
We observe that Refuel-LLM-2-Mini is fast, while having superior performance compared to other models in its size range.
The latency numbers reported above were computed under the following setup:
Each request to the LLM consists of 2k input tokens, and 100 output tokens (we set the max tokens to 100, and removed stop tokens to force the LLM to deterministically output the same number of tokens for each request)
The models were hosted on a A6000 GPU using vLLM as the inference engine
We measure the latency per request, and report the final latency averaged across 100 requests, per model.
Training recipe
RefuelLLM-2-mini was trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution. These datasets are a combination of:
Human annotated datasets like Flan, Task Source, and the Aya collection
Synthetic datasets like OpenOrca, OpenHermes and WizardLM
Proprietary datasets developed or licensed by Refuel
The dataset composition and training recipe used were the same as Refuel-LLM-2 and Refuel-LLM-2-small (announced earlier).
Access
Access to the RefuelLLM-2-mini is available starting today:
We’re open sourcing Refuel-LLM-2-mini, aka Qwen-2-Refueled, under a CC BY-NC 4.0 license. You can access the model on Hugging Face.
Acknowledgements
We thank the Qwen team for making the Qwen series of models available to the community.
We would like to thank the creators of the Flan, Tasksource, OpenHermes, OpenOrca, Aya collections and many more for building and making them open source. We would also like to thank AllenAI and TAU for building and open sourcing several long-context datasets that we used for training.
By clicking “Accept”, you agree to the storing of cookies on your device. Refuel uses this data to enhance site navigation and analyze site usage. To deny all but essential cookies, click "Deny". For more information, see our Privacy Policy.