Knowledge Distillation: How to Make LLMs Easier and Save Accuracy

TechLatest is supported by readers. We may earn a commission for purchases using our links. Learn more.

The article is op-ed authored by Kirill Starkov.

The development of modern LLMs has led to incredible results: state-of-the-art performance, high quality, and, unfortunately, computational costs. Engineers tend to choose smaller models just because they are cheaper and don’t require special hardware.

Don’t want to miss the best from TechLatest?
Set us as a preferred source in Google Search and make sure you never miss our latest.

The knowledge distillation process was invented to address this issue: it is a chance to save time, money, and high-quality performance at the same time. Our expert, Kirill Starkov, Senior Machine Learning Engineer, will comment on this technology and share his own experience.

How does knowledge distillation work?

The idea of knowledge distillation (KD) can be explained by the example of ‘teacher-student’ interaction: it is a knowledge transfer from a large language model to a small one. The ‘student’ model will be as efficient as its ‘teacher’ but will be more suitable for deployment.

There are two ways to train the ‘student’ model: hard- and soft-label distillation.

‘Hard-label distillation has three stages:

Prompts collection
Answers to prompts, generated by “teacher” model
Labelled dataset formation

After that, the small model learns to imitate answers of the large model with the labeled dataset, marked as a ground truth.’

Hard-label distillation is easier and has fewer computational costs than soft-label distillation, but the latter is more accurate because it transfers the individual predictive distribution of the large model.

‘Soft labels teach better than hard targets because they provide more learning information and much less variance in the gradient between training cases when having high entropy. The “student” model can be trained on much less data than the original “teacher” model.’

One of the most important metrics in ML is loss function or cross entropy. KD deployment requires another type of loss metric—soft loss. ‘Soft loss is a weighted cross entropy when we assign different weights to prevent false positives or false negatives from the “teacher” model.’

Kullback-Leibler Divergence (KLDiv) formula is used to compute distillation loss.

L_KD = KL(softmax(z_t/T) || softmax(z_s/T)) ⋅ T²

Where T is temperature (usually >1)

z_tand z_s are logits from teacher and student, respectively.

Hard Target Loss function

L_CE = CrossEntropy(y_true,softmax(z_s))

Total Loss (Combined)

L = α ⋅ L_CE + (1− α) ⋅ L_KD

Where α is a hyperparameter (commonly 0.1 to 0.9)

Knowledge distillation implementation

Knowledge distillation is often used in projects with limited operational resources, where implementation of cumbersome LLMs is impossible.

‘Knowledge distillation is a must-have in computer vision and object detection programmes. Smaller models are suitable for deployment on devices with limited processing resources, such as security cameras and drones.’

Small models are also used in natural language processing programmes. ‘NLP requires real-time response with high speed and efficiency, so trained “student” models are perfect for chat-bots, translation programmes and other mobile devices.’

Deployment case: DSSL Computer Vision

As it was mentioned before, knowledge distillation is used in modern CV technologies. Kirill Starkov decided to improve the security detector device with the deployment of a small language model.

‘In that case we saw that knowledge distillation is actually useful, because we checked results with a special metric: mean average precision.’

Mean Average Precision (mAP) measures the accuracy of object detectors. It provides a single number that summarizes the precision-recall curve, reflecting how well a model is performing across different threshold levels. ‘Before KD deployment our mAP was 27.4; after—34.2.’

Advantages and disadvantages of knowledge distillation

KD is always about better performance: common advantages are reduction of operational costs, faster inference, preservation of complex patterns.

But this technology can have some disadvantages. Imbalance between learning conditions and inference can lead to exposure bias because the ‘student” language model can’t learn how to fix its own mistakes.

Soft-label distillation is computationally expensive during training, since full probability distributions rather than individual token indices are stored and processed.

It also requires deeper student-teacher integration to access the internal probabilities of a large model, making it more difficult to implement than standard approaches.

This story was originally published on 23 October 2021.