Tim McLaren 11/11/24 Tim McLaren 11/11/24

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2's tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by 44%, giving DDD an average speedup of 3.16x.

Oscar Brown / Zhengjie Wang / Andrea Do / Nikhil Mathew / Cheng Yu
ML Research Labs, Canberra, Australia
Australian National University

Abstract

The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2’s tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by 44%, giving DDD an average speedup of 3.16x.

Introduction

Large Language Models (LLMs) (Brown et al., 2020) (Touvron et al., 2023) have demonstrated im- pressive performance over various tasks. However, their large number of parameters causes inference speed to be too slow for many applications.

Speculative Decoding (Leviathan et al., 2023) addresses this to accelerate an LLM, known as the target model. For each forward pass, the algorithm uses a much smaller draft model to generate a se- quence of tokens to be inputted to the target model. Running the target model once is sufficient to ver- ify the tokens until one is incorrect and generate the token that should follow the correct sequence. This gives a speedup by generating more tokens per forward pass of the target model. Notably, spec- ulative decoding methods are lossless since every token is verified as correct by the target model.

Extrapolation Algorithm for Greater Language- model Efficiency (EAGLE) (Li et al., 2024b) is the state of the art speculative decoding method, with it’s key feature being the construction of a draft model using the embedding layer and LM head of the target model with a single trainable head in be- tween. On its first release, EAGLE used a method of generating a tree of tokens from the draft model and adjusting the target model’s attention mask to allow the entire tree to be inputted simultaneously into the target model. This tree has the structure shown in Figure 2, with the best tokens generated from each previous token being on the left. Al- though this tree chooses the tokens with the highest draft logprobs outputted after each token, its struc- ture is static with no dependence on the draft model output.

EAGLE-2 (Li et al., 2024a) improves on this static tree method by introducing a dynamic draft tree. The tree uses a beam search by choosing the top-k token sequences after each run of the draft model as the next input to the draft model. The sum of all logprobs generated in a sequence of tokens is used as a heuristic for choosing the top-k.

Conclusion

In this work, we introduce Dynamic Depth De- coding, an optimisation of EAGLE-2’s decoding algorithm that increases the speedup of the current state-of-the-art speculative decoding method. We discover an opportunity to use the draft model’s confidence to determine whether to continue draft- ing. Since the heuristic check breaks lazy evalua- tion, we find that it is optimal to check the heuristic only a few times. We also compare our decoding al- gorithm to EAGLE and EAGLE-2 over a variety of models. Future work on speculative decoding that significantly improves on the speedup of EAGLE- 2 will most likely focus on optimising the draft model and the verification process, rather than the drafting algorithm.

Learn more

Paper published at Cornell University

Tim McLaren 20/9/23 Tim McLaren 20/9/23

Using Fine-tuning and Min Lookahead Beam search to improve Whisper

Andrea Do / Oscar Brown* / Zhengjie Wang / Nikhil Mathew / Zixin Liu / Jawwad Ahmed / Cheng Yu
* Australian National University

Abstract

The performance of Whisper in low-resource languages is still far from perfect. In addition to a lack of training data on low-resource languages, we identify some limitations in the beam search algorithm used in Whisper. To address these issues, we fine-tune Whisper on additional data and propose an improved decoding algorithm. On the Vietnamese language, fine-tuning Whisper-Tiny with LoRA leads to an improvement of 38.49 in WER over the zero-shot Whisper-Tiny setting which is a further reduction of 1.45 compared to full-parameter fine-tuning. Additionally, by using Filter-Ends and Min Lookahead decoding algorithms, the WER reduces by 2.26 on average over a range of languages compared to standard beam search. These results generalise to larger Whisper model sizes. We also prove a theorem that Min Lookahead outperforms the standard beam search algorithm used in Whisper.

Introduction

Whisper has remarkable performance in transcribing multilingual speech audio into text [1]. While its performance with English and other high-resource languages is impressive, the limited availability of training audio data for low-resources languages is a challenge. As Whisper is open-source, researchers may enhance its performance with new training datasets and methods. In this paper, we investigate unconventional fine-tuning and decoding algorithms to improve Whisper’s performance in a low-resource scenario. While fine-tuning is common in practice, a systematic comparison between different fine-tuning strategies for an encoder-decoder model like Whisper has yet to be documented. In the work of Jain et al. [2], the authors froze most of the model’s parameters while finetuned only the final layer.

Conversely, Rouditchenko et al. [3] finetuned the entire model on unseen languages. Both studies lack comprehensive explanations for their choice of fine-tuning strategies. To fill this gap, we conduct a comprehensive study of fine-tuning strategies on Whisper, including full-parameter fine-tuning and partialparameter fine-tuning where gradients are updated only in parts of the model. We selected Vietnamese as our target language, but we believe the results translate to other low-resource languages since we did not utilise any language-specific features in our fine-tuning experiments. Whisper uses a beam search decoding algorithm with beam width n = 5 and log-probability (logprob) as the score function [1]. This is as opposed to the greedy algorithm which chooses the token with the greatest logprob at each decoding step. Although beam search outperforms the greedy algorithm, we suggest it can be further improved by filtering out certain sequences and performing

Conclusion

Despite having less trainable parameters, fine-tuning Whisper- Medium and Whisper-Large with high-rank LoRA yields comparable performance improvements in comparison to full-parameter fine-tuning. The decoupling of input and output embeddings does not harm the model performance but can occasionally surpass the results achieved through full-parameter fine-tuning. Furthermore, we suggest Filter-Ends and Min Lookahead as improvements to Whisper’s decoding algorithm. We prove that Min Lookahead is expected to outperform standard beam search, while empirical results verify this with particularly strong performance on low-resource languages. Future studies should perform fine-tuning experiments on more low-resource languages and investigate increasing the beam’s diversity as a potential improvement to the decoding algorithm.

Learn more

Paper published at Cornell University