As Large Language Models (LLMs) rapidly grow, Integrating LLMs into the medical domain is increasingly being explored. Because an LLM's computational cost and sequence length limits are directly related to its token count, subword tokenization is esse...
As Large Language Models (LLMs) rapidly grow, Integrating LLMs into the medical domain is increasingly being explored. Because an LLM's computational cost and sequence length limits are directly related to its token count, subword tokenization is essential for managing inputs efficiently. However, appropriate subword segmentation is difficult because of complex medical terminology and abbreviations. Because subword tokens are used to train LLMs, the tokenizer plays an important role in an LLM’s performance. In this paper we define two evaluation criteria (1) the out-of-vocabulary (OOV) rate, the extent to which medical terms are preserved, and (2) the token split rate (TSR), which captures the stability of token boundaries, and we propose two corresponding metrics for assessing tokenizers. First, we compare the distributions of each criterion with ground-truth (GT) distributions using Kullback–Leibler (KL) divergence and normalize the resulting score. Second, we fit a Gaussian regression model to the GT distribution and measure the tokenizer’s error with the normalized root-mean-square error (NRMSE). The proposed evaluation scheme offers practical, objective evidence for selecting an appropriate tokenizer in medical LLM applications.