Announcing the launch of Refuel LLM, a large language model purpose built for data labeling and enrichment tasks.
In this post we examine different techniques for estimating confidence of LLM generated labels, and demonstrate how to leverage these to automatically reject low confidence labels and ensemble LLMs optimally
In this report, we compare the latest models from OpenAI against their previous versions on a data labeling benchmark to find that gpt-3.5-turbo is worse for 6/8 datasets, while gpt-4 performance remains the same.
In this report, we show that LLMs can label datasets 20x faster, and 7x cheaper, but at the same or better quality compared to skilled human annotators.