RISS 검색 - 국내학술지논문 상세보기

다국어 초록 (Multilingual Abstract)

Objectives The aim was to find effective vectorization and classification models to predict a psychiatric diagnosis from text-based medical records. Methods Electronic medical records (n = 494) of present illness were collected retrospectively in inpatient admission notes with three diagnoses of major depressive disorder, type 1 bipolar disorder, and schizophrenia. Data were split into 400 training data and 94 independent validation data. Data were vectorized by two different models such as term frequency-inverse document frequency (TF-IDF) and Doc2vec. Machine learning models for classification including stochastic gradient descent, logistic regression, support vector classification, and deep learning (DL) were applied to predict three psychiatric diagnoses. Five-fold cross-validation was used to find an effective model. Metrics such as accuracy, precision, recall, and F1-score were measured for comparison between the models. Results Five-fold cross-validation in training data showed DL model with Doc2vec was the most effective model to predict the diagnosis (accuracy = 0.87, F1-score = 0.87). However, these metrics have been reduced in independent test data set with final working DL models (accuracy = 0.79, F1-score = 0.79), while the model of logistic regression and support vector machine with Doc2vec showed slightly better performance (accuracy = 0.80, F1-score = 0.80) than the DL models with Doc2vec and others with TF-IDF. Conclusions The current results suggest that the vectorization may have more impact on the performance of classification than the machine learning model. However, data set had a number of limitations including small sample size, imbalance among the category, and its generalizability. With this regard, the need for research with multi-sites and large samples is suggested to improve the machine learning models.

참고문헌 (Reference)

1 정지수, "문서 유사도를 통한 관련 문서 분류 시스템 연구" 한국방송∙미디어공학회 24 (24): 77-86, 2019

2 허성완, "낚시성 인터넷 신문기사 검출을 위한 특징 추출" 한국정보과학회 43 (43): 1210-1215, 2016

3 김정미, "Word2vec을 활용한 RNN기반의 문서 분류에 관한 연구" 한국지능시스템학회 27 (27): 560-565, 2017

4 Ramos JA, "Using TF-IDF to determine word relevance in document queries" Rutgers 2003

5 Hastie T, "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" Springer Series in Statistics 2001

6 Weiss SM, "Text Mining : Predictive Methods for Analyzing Unstructured Information" Springer Science & Business Media 2010

7 Srivastava AN, "Text Mining : Classification, Clustering, and Applications" Chapman and Hall/CRC 2009

8 Craddock N, "Psychiatric diagnosis : impersonal, imperfect and important" 204 : 93-95, 2014

9 Tran T, "Predicting mental conditions based on “history of present illness” in psychiatric notes with deep neural networks" 75 Suppl : S138-S148, 2017

10 Geman S, "Neural Networks and the Bias/Variance Dilemma" MIT Press 1-58, 1992