Robust and scalable subword tokenizer evaluation for medical LLM|RISS 상세보기

다국어 초록 (Multilingual Abstract)

As Large Language Models (LLMs) rapidly grow, Integrating LLMs into the medical domain is increasingly being explored. Because an LLM's computational cost and sequence length limits are directly related to its token count, subword tokenization is essential for managing inputs efficiently. However, appropriate subword segmentation is difficult because of complex medical terminology and abbreviations. Because subword tokens are used to train LLMs, the tokenizer plays an important role in an LLM’s performance. In this paper we define two evaluation criteria (1) the out-of-vocabulary (OOV) rate, the extent to which medical terms are preserved, and (2) the token split rate (TSR), which captures the stability of token boundaries, and we propose two corresponding metrics for assessing tokenizers. First, we compare the distributions of each criterion with ground-truth (GT) distributions using Kullback–Leibler (KL) divergence and normalize the resulting score. Second, we fit a Gaussian regression model to the GT distribution and measure the tokenizer’s error with the normalized root-mean-square error (NRMSE). The proposed evaluation scheme offers practical, objective evidence for selecting an appropriate tokenizer in medical LLM applications.

번역하기

국문 초록 (Abstract)

초대형언어모델(LLM)의 급속한 발전과 함께 의료 분야에서의 언어모델 연구도 각광받고 있다. LLM 네트워크 출력 길이가 곧 토큰 수와 직결되므로 토큰 수를 줄이기 위한 subword 토크나이징 기법이 필수적이다. 그러나 의료 텍스트는 전문 용어와 약어가 많아 올바른 subword 분할이 어렵다. LLM은 subword 단위를 기반으로 학습하므로 LLM 성능에 토크나이저가 중요한 역할을 한다. 따라서 본 논문에서는 의학용어 보존력을 측정하는 Out-of-Vocabulary(OOV) 비율과 Token Split Rate(TSR)로 분절 안정성을 평가 지표로 정의하고, 이를 기반으로 두 가지 토크나이저 평가 방법을 제안한다. 첫 번째, KL Divergence로 각 지표의 분포와 GT(Ground Truth) 분포를 비교하여 토크나이저의 성능을 정량화한다. 두 번째, GT 기반 Gaussian 회귀 분석을 수행하고 오차를 NRMSE (Normalized Root Mean Square Error)로 계산해 토크나이저의 성능을 비교한다. 제안한 평가 방식을 통해 토크나이저 선택 과정에 합리적이고 실질적인 근거를 제공할 것으로 기대된다.

번역하기

초대형언어모델(LLM)의 급속한 발전과 함께 의료 분야에서의 언어모델 연구도 각광받고 있다. LLM 네트워크 출력 길이가 곧 토큰 수와 직결되므로 토큰 수를 줄이기 위한 subword 토크나이징 기...

목차 (Table of Contents)