Announcing Refuel-LLM

Schedule a Demo

October 16, 2023

Refuel Team
by
Refuel Team
Before and after using an LLM for lead qualification

Key takeaways

We’re thrilled to announce the launch of Refuel LLM, a large language model purpose built for data labeling and enrichment tasks. 

  • Refuel LLM (84.2%) outperforms trained human annotators (80.4%), GPT-3-5-turbo (81.3%),  PaLM-2 (82.3%) and  Claude (79.3%) across a benchmark of 15 text labeling datasets.
  • Refuel LLM is easy to finetune on a target domain, an important lever for improving performance and reducing inference costs by reducing prompt lengths. For each of the datasets used for finetuning experiments, we were able to get to a model that outperformed GPT-4 with less than 15 minutes of training on a cluster of 8x H100s.  On average, finetuning improved label quality by 12.2% across datasets.
  • Refuel LLM is a Llama-v2-13b base model, trained on over 2500 unique datasets (5.24B tokens) spanning categories such as classification, entity resolution, matching, reading comprehension and information extraction.

The model is publicly available via the LLM labeling playground, and Autolabel (our recently released open source library for data labeling with LLMs). We plan to share a more detailed technical summary, and open source the model in the coming weeks.

Results

Benchmark

As part of this launch, we are expanding our earlier benchmark for LLM data labeling to expand the diversity of tasks and problem domains. In the following sections, we report results on this expanded benchmark.

List of datasets used for the labeling benchmark

List of models

* FLAN T5-XXL is an open-source model originally developed by Google. We used this checkpoint. Similarly, Llama-13b-chat is an open-source model developed by Meta. For this benchmark, we hosted the models on a single A6000 GPU (48 GB) on RunPod IO using vLLM.

Label Quality

Label quality measures how well the generated labels (by human or LLM annotator) agree with the ground truth label provided in the dataset. For question answering datasets, we measure agreement with the F1 score between generated and ground truth answer. For all other datasets, we measure agreement with an exact match between the generated and ground truth label. 

Label quality (% agreement with ground truth labels) averaged across 15 datasets

Refuel LLM outperforms human annotators and all closed source LLMs, excluding GPT-4.  Compared to the open source Llama-v2-13b that we started with, we see significant improvements across tasks, especially on on question answering and entity extraction. Here’s a table with more detailed results per dataset: 

Label quality (% agreement with ground truth labels).

Holdout datasets

In addition to expanding our benchmark, we also evaluate all LLMs on heldout datasets that weren’t used while instruction-tuning Refuel LLM. Since closed-source models don’t disclose the list of datasets they were pre-trained or instruction tuned on, we don’t know if these datasets are held-out for these models too. We select the following three datasets for evaluation in this setting:

  1. Symptom-to-disease
  2. Belebele
  3. MultiCoNER

These datasets were selected to ensure we have diversity in tasks and domains for this setting. Additionally, all these datasets were released in 2023, reducing the chances of dataset contamination in closed source LLMs as their pre-training cutoffs are 2021 or 2022.

Label quality (% agreement with ground truth labels) on holdout datasets

Fine Tuning

While Refuel LLM provides superhuman performance out-of-the-box, we also show that our model is easy to finetune and improve performance even more. We train Refuel LLM in low data regimes, showcasing its quick adaptation to new tasks.

Specifically, we train on 500 rows for the symptom-to-disease dataset and 2000 rows for acronym_identification and SQuaD, and outperform GPT-4 in all three cases. For each of these datasets, we were able to get to a model that outperformed GPT-4 in less than 15 minutes on a cluster of 8x H100s.

We expect fine-tuning to be a critical part of our cloud product, and with these results we also promise models that continuously improve as customers get more data labeled via our platform. 

Methodology

Refuel LLM is instruction tuned on top of Llama-v2-13b base model, and purpose built for data labeling and enrichment tasks.

Datasets

The model was instruction-tuned on 5.24B tokens, comprising more than 2500 unique tasks. 

Most of our training datasets are composed of the FLAN and Tasksource collections in addition to a few proprietary datasets. A big focus of this effort was to build an LLM that can accurately output labels without requiring any parsing. To this end, we preprocess all targets to only have labels.

Additionally, we randomly transform 1% of our dataset to query for a JSON output, thus enabling Refuel LLM to handle JSON outputs gracefully. Finally, we also pre-processed our training datasets to ensure that they consist of diverse instructions, making the model easily steerable.

Training procedure and Hyperparameters


We train the model in multiple stages.

  • The first stage focuses on improving instruction following behavior of the base LLM. This comprises the bulk of the training, and we use as much data as is possible across datasets, without biasing towards any specific task or domain. We sample elements from datasets inversely proportional to the number of average output tokens.
  • Later stages focus on training the model on  tasks that are relevant to labeling. Thus, all tasks that require generating long sentences like generation and translation are excluded and only relevant tasks like classification, question answering, and entity recognition are included. 

For all stages, we use a step learning rate with an initial learning rate of 1e-5 , a weight decay of 0.1, a batch size of 56, and a sequence length of 4096 tokens.

Access


Access to Refuel LLM is available starting today at the following places: 

We plan to share a technical report (detailing the decisions we made while training and evaluation of the model), and open source the model in the coming weeks. 

If you discover any issues or have suggestions to share with us, come say hi to the team on our Discord.

Acknowledgements


This work was possible due to the helpful discussions with Arvind and  Enrico Shippole. We would like to thank the creators of the Flan and Tasksource collections for building and making them open source.

We are grateful to the maintainers of transformers, TGI, vLLM and Llama-recipes. We also thank Meta for making llama-v2 available to the community. Finally, we also want to thank Mosaic and Runpod for their infrastructure resources that we heavily relied on for training and evaluation.