We’re thrilled to introduce RefuelLLM-2 and RefuelLLM-2-small, the next version of our large language models purpose built for data labeling, enrichment and cleaning.
You can try out the models in our LLM playground, or access it (with fine tuning support) in Refuel Cloud starting today.
While data is the fuel for modern enterprises, so much value remains locked up in messy, unclean and unstructured data. Unlocking this value requires far too much manual effort - 80% of time spent by data teams is just doing "unsexy" data work (cleaning, normalization, labeling, etc) before you can do anything meaningful on top.
State-of-the-art LLMs are good at everything, but rarely great at the handful of tasks you might care about, and what’s more, they’re too expensive and slow for the scale of data that’s collected or generated today. We need something better -
Compared to the previous launch of Refuel LLM, we’ve added 10 more datasets to the benchmark:
We used Autolabel, our open-source library for LLM-powered data labeling, to run all the experiments that are a part of this report.
Output quality measures how well does the LLM generated output agree with the ground truth label provided.
As mentioned in the benchmark section, we have included a few datasets specifically to evaluate LLM performance on long input context.
As mentioned in the benchmark section, we evaluated all LLMs on a collection of non-public datasets, spanning domains such as recruiting, financial services, STEM and e-commerce.
These datasets were not used as part of any training or validation splits for the Refuel-LLM2 model family. While including these in the benchmark hurts reproducibility, we believe that it is critical to evaluate LLMs on non-public, task-specific datasets in order to understand their reliability and quality in real-world settings.
RefuelLLM-2’s superior quality is reinforced in the performance comparison shown above. Moreover, for both models the quality improvement on held out datasets compared to their respective base LLMs is a good indication of their capability to generalize.
Going one step further to understand the models’ reliability and quality in real-world settings we also report LLM quality on datasets from specific industries/problem domains.
We observe that across verticals, Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs such as GPT-4-Turbo and Claude-3-Opus, at less than 1/10th the model size.
Building on top of learnings from our research on Labeling with Confidence, we use average token generation probabilities as a heuristic for estimating confidence of LLM outputs.
To benchmark the quality of these confidence scores, we use AUROC. AUROC is an aggregate score that measures a classifier's ability to distinguish between the positive (“LLM output is correct”) and negative classes (“LLM output is incorrect”) across all score thresholds:
We observe that RefuelLLM-2 and RefuelLLM-2-small output much better calibrated confidence scores compared to those from GPT-4 and Llama-3-70B.
Previous work in this domain has shown that RLHF-based post-training of LLMs can hurt logprob calibration significantly. The RLHF training process can cause large spikes in the KL divergence between the model's output distribution and the original pre-trained distribution.
This could cause the model to deviate significantly from its original "world prior", thereby hurting its ability to accurately estimate probabilities. Note that models provided by Claude and Google don’t support returning token level log probability and hence don’t have a score assigned to them.
We train the models in two phases. The first phase is responsible for making the models expert at data labeling and enrichment tasks, while the second phase helps improve performance on longer context examples.
Training for both the phases was done on a cluster of 8xH100 80GB GPUs.
While the distribution of examples used for both phases was different, they were sampled from the same collection of 2750+ unique tasks. Our training collection consists majorly of:
The final instruction tuning dataset (after deduplication, sampling and cleanups) consisted of ~4B tokens across both the phases. We also leverage multipacking, which packs multiple sequences into a single batch to increase training throughput.
Access to the RefuelLLM-2 models is available starting today: