We’re thrilled to announce the launch of Refuel LLM, a large language model purpose built for data labeling and enrichment tasks.
The model is publicly available via the LLM labeling playground, and Autolabel (our recently released open source library for data labeling with LLMs). We plan to share a more detailed technical summary, and open source the model in the coming weeks.
As part of this launch, we are expanding our earlier benchmark for LLM data labeling to expand the diversity of tasks and problem domains. In the following sections, we report results on this expanded benchmark.
* FLAN T5-XXL is an open-source model originally developed by Google. We used this checkpoint. Similarly, Llama-13b-chat is an open-source model developed by Meta. For this benchmark, we hosted the models on a single A6000 GPU (48 GB) on RunPod IO using vLLM.
Label quality measures how well the generated labels (by human or LLM annotator) agree with the ground truth label provided in the dataset. For question answering datasets, we measure agreement with the F1 score between generated and ground truth answer. For all other datasets, we measure agreement with an exact match between the generated and ground truth label.
Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4. Compared to the open source Llama-v2-13b that we started with, we see significant improvements across tasks, especially on on question answering and entity extraction. Here’s a table with more detailed results per dataset:
In addition to expanding our benchmark, we also evaluate all LLMs on heldout datasets that weren’t used while instruction-tuning Refuel LLM. Since closed-source models don’t disclose the list of datasets they were pre-trained or instruction tuned on, we don’t know if these datasets are held-out for these models too. We select the following three datasets for evaluation in this setting:
These datasets were selected to ensure we have diversity in tasks and domains for this setting. Additionally, all these datasets were released in 2023, reducing the chances of dataset contamination in closed source LLMs as their pre-training cutoffs are 2021 or 2022.
While Refuel LLM provides superhuman performance out-of-the-box, we also show that our model is easy to finetune and improve performance even more. We train Refuel LLM in low data regimes, showcasing its quick adaptation to new tasks.
Specifically, we train on 500 rows for the symptom-to-disease dataset and 2000 rows for acronym_identification and SQuaD, and outperform GPT-4 in all three cases. For each of these datasets, we were able to get to a model that outperformed GPT-4 in less than 15 minutes on a cluster of 8x H100s.
We expect fine-tuning to be a critical part of our cloud product, and with these results we also promise models that continuously improve as customers get more data labeled via our platform.
Refuel LLM is instruction tuned on top of Llama-v2-13b base model, and purpose built for data labeling and enrichment tasks.
The model was instruction-tuned on 5.24B tokens, comprising more than 2500 unique tasks.
Most of our training datasets are composed of the FLAN and Tasksource collections in addition to a few proprietary datasets. A big focus of this effort was to build an LLM that can accurately output labels without requiring any parsing. To this end, we preprocess all targets to only have labels.
Additionally, we randomly transform 1% of our dataset to query for a JSON output, thus enabling Refuel LLM to handle JSON outputs gracefully. Finally, we also pre-processed our training datasets to ensure that they consist of diverse instructions, making the model easily steerable.
We train the model in multiple stages.
For all stages, we use a step learning rate with an initial learning rate of 1e-5 , a weight decay of 0.1, a batch size of 56, and a sequence length of 4096 tokens.
Access to Refuel LLM is available starting today at the following places:
We plan to share a technical report (detailing the decisions we made while training and evaluation of the model), and open source the model in the coming weeks.
If you discover any issues or have suggestions to share with us, come say hi to the team on our Discord.
This work was possible due to the helpful discussions with Arvind and Enrico Shippole. We would like to thank the creators of the Flan and Tasksource collections for building and making them open source.
We are grateful to the maintainers of transformers, TGI, vLLM and Llama-recipes. We also thank Meta for making llama-v2 available to the community. Finally, we also want to thank Mosaic and Runpod for their infrastructure resources that we heavily relied on for training and evaluation.