Finding the right SLM for your needs - a guide to Small Language Models

Before and after using an LLM for lead qualification

Over the past few months months, there has been a surge in open-source models with parameters in the “smaller” range (1-3B), including Microsoft's Phi-3-mini (3.8B), Google’s Gemma-2 (2B), and Meta’s Llama-3.2 (1B & 3B).
‍

These Small Language Models (SLMs) feature fewer parameters than larger language models (LLMs), but their key advantage lies in efficient deployment on mobile and edge devices, alongside their superior speed and cost-effectiveness. However, this efficiency often comes at the expense of the broader reasoning capabilities typical of LLMs. Despite this, SLMs excel in specific tasks such as summarization, writing suggestions, and structured data extraction from text.
‍

Training Techniques for SLMs
‍

SLMs are primarily trained using two methods:

Pruning: Reducing the parameters of an existing LLM (e.g., Llama-3.2 1B derived from Llama-3.2 8B).
Knowledge Distillation: Leveraging the logits from a larger, more powerful model. Studies [https://arxiv.org/pdf/2408.00118] show that learning from a larger "teacher" model is more efficient than training a model from scratch.
‍

Notably, SLMs are often over-trained in accordance with the Chinchilla scaling laws, which predict that 20 billion tokens are optimal for training a 1B parameter model. For comparison, Llama-3.2 1B was trained on a staggering 9 trillion tokens, the highest among open-source counterparts. This additional training enhances the model’s capabilities while maintaining its small size and low latency.
‍

Narrowing the Performance Gap

The performance gap between SLMs and LLMs is steadily shrinking. As demonstrated in this MMLU performance chart (pictured below) the size of models required to achieve the same benchmark scores (e.g., a 65 MMLU score) has decreased over time. Furthermore, as LLMs continue to improve, they will generate higher-quality synthetic data for SLM training, further accelerating advancements in SLM capabilities.
‍

‍

Finding the right size model

As mentioned, efficiency of smaller language models often comes at the expense of the broader reasoning capabilities typical of LLMs. We set out to investigate the correlation between model size and performance by measuring the performance of different models with the same input and output tokens. We set temperature to 0, set input tokens to 2k and set the number of output tokens to 100 for every model. We then measure the time taken to serve 500 requests.
‍

Here’s what we learned:
‍

Latency

All experiments were conducted on a A6000 GPU. We take the instruct versions of models wherever not specified.

‍

‍

We observed that SLMs yielded significantly lower latency than LLMs, ranging from 50% to 10% (in cases) of the latency of LLMs, due to the smaller size of the model when processing requests.
‍

We saw the models in the same family having latency proportional to the number of parameters e.g. Llama-3B having a latency approximately 3 times that of Llama-1B. However, the scaling factor may differ between model families based on the architecture used. Qwen models are more performant than Gemma at the same parameter count, owing to the difference in architecture choices.
‍

We then sought to investigate if this speed came at the expense of performance. We run these models on our benchmark (https://www.refuel.ai/blog-posts/announcing-refuel-llm-2) reporting the numbers on the whole benchmark as well as the held out portion of the benchmark (certain private datasets collected by Refuel).

‍

‍

When evaluating performance, Gemma-2-2B, Qwen2.5-3B, and phi-3-mini (3.8B) were able to achieve comparable performance to Lllama-3.1-8B (which was released a few months ago). This shows that SLMs are getting better and better over time and have comparable/quite close performance compared to LLMs.
‍

We see that the Gemma models are quite performant in their performance range, followed by Qwen2.5 models and the Llama models are the weakest according to the benchmark. Gemma and Qwen2.5 seem to be the most useful SLMs trading performance for latency/throughput.
‍

We then sought to investigate the relationship between performance and training data/examples needed across 3 publicly available datasets - “Banking”(https://huggingface.co/datasets/legacy-datasets/banking77), “Company” (missing reference) and Ledgar (https://huggingface.co/datasets/coastalcph/lex_glue/viewer/ledgar) .
‍

Fine Tuning Data Needed
‍

The experiment produced the following observations:
‍

Case 1: Limited Data (500 examples or fewer):
‍In a situation where you only have around 100–200 examples, this number is typically insufficient for fine-tuning a Small Language Model (SLM). Due to their smaller parameter size, SLMs lack the capacity to effectively learn from such limited data. In this case, a larger language model (LLM) would be more appropriate, as its greater capacity allows it to generalize better from smaller datasets.
‍

Case 2: Sufficient Data (2,000 examples or more):
When you have a larger dataset (e.g., around 2,000 examples), the story changes. At this scale, an SLM can learn effectively, and the performance you would achieve by fine-tuning an SLM can be comparable to that of fine-tuning an LLM. Since both models perform similarly with sufficient data, opting for an SLM is the more efficient choice—it provides similar results at a lower cost and with less computational overhead.
‍
Ultimately, the decision comes down to the availability of training data. If your dataset is limited, an LLM is the better option due to its greater learning capacity. However, if you have enough data, an SLM can deliver equivalent performance to an LLM at a fraction of the cost. This makes SLMs the ideal choice for resource-conscious deployments when adequate training data is available.

Training Techniques for SLMs

The decision to use Small Language Models (SLMs) versus Large Language Models (LLMs) largely depends on the specific requirements of your product. If an SLM can deliver the same quality as an LLM at a lower cost and faster speed, the determining factor becomes the amount of training data available. You need to ask - what task are you trying to accomplish? And what volume of data will flow through the system?

In recent months, the performance of SLMs has steadily improved, particularly in handling simpler, non-reasoning tasks—such as extraction, cleaning, structuring—that don’t require complex problem-solving. These are exactly the types of tasks Refuel is focused on. If you have sufficient training data and the task is straightforward, using an SLM is likely the best choice, as it provides similar performance to an LLM, but at a much lower price point.

Do SLMs have any limitations?

While Small Language Models (SLMs) offer several advantages, they also come with certain limitations compared to Large Language Models (LLMs):

Higher Data Requirements: SLMs generally need more training or fine-tuning data to achieve the same performance levels as LLMs. If you are working in a data-constrained environment, SLMs may not deliver optimal performance.
Limited Reasoning Capabilities: SLMs are not well-suited for tasks that require extensive reasoning, complex context, or deep understanding. For these types of applications, LLMs are a better choice due to their higher capacity for handling nuanced, multi-step tasks.
Hyperparameter Sensitivity: SLMs tend to be more sensitive to hyperparameters during fine-tuning, which may require more effort and experimentation to achieve an optimal model compared to fine-tuning an LLM.

‍

How should I evaluate the right SLM for my use case?

When choosing a Small Language Model (SLM) for your specific use case, it's crucial to consider the model's latency. Latency is often influenced by the number of parameters in the model, with SLMs typically ranging from 1 billion to 3 billion parameters.

Depending on the complexity of your use case, you may need to balance performance with latency. Even SLMs with the same parameter count can have varying latencies, so it's important to compare models based on their performance in your specific environment.

Additionally, context window size is another factor to consider. While newer SLMs come with larger context windows, older models might be limited to 8,000 tokens. This could impact the model’s ability to handle longer inputs or complex interactions, so it’s worth evaluating based on your requirements.

‍

What are the hardware requirements for using an SLM?

Smaller LLMs, typically in the 1B–3B parameter range, are compact enough to run on a single A10 GPU without the need for quantization. In contrast, larger LLMs often require two A10 GPUs just to fit within memory. This is an important consideration when selecting a model, as the hardware requirements can significantly impact the cost and complexity of deployment.

‍

What sort of adoption will SLMs have in the future?

Many tasks that companies face involve simple data transformations, such as extracting names from emails, normalizing addresses, pulling data from LinkedIn profiles, or identifying a person’s work experience. These are straightforward extractive tasks that don’t require extensive reasoning but still benefit from the precision of language models. In such cases, Small Language Models (SLMs) can be highly effective, especially when large amounts of data need to be processed efficiently.

For example, when working with a massive database, the difference between using an SLM versus an LLM could mean completing the task in days rather than weeks.

SLMs are also ideal for edge use cases where models need to operate locally on devices. A prime example is Apple’s rumored 3 billion-parameter model running on mobile devices. In scenarios where low-latency, on-device processing is critical, SLMs offer the perfect solution for scaling these simpler tasks.

‍

What sort of solutions make it easy to use SLMs successfully?

‍Refuel makes it easy for anyone to harness the power of Small Language Models (SLMs). We enable users to define their own tasks and deploy SLMs out-of-the-box with built-in feedback mechanisms. Refuel handles key processes like few-shot learning, ensuring that SLMs are effective and easy to implement for a wide range of use cases.

Additionally, we help users distill larger models when they have limited training data. In this process, users first leverage a larger model—such as GPT-4 or Refuel's own LLM-large—to label the dataset. This generates enough data points to fine-tune a smaller model, allowing users to benefit from the efficiency and reduced latency of SLMs.

Refuel also offers its own high-performance SLM, which excels in labeling tasks relevant to Refuel and our customers, consistently outperforming other SLMs in the same weight category for these specific use cases.

‍



A guide to improving marketplace search, data quality, and onboarding with LLMs

