http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
딥러닝 기반 한국어 개체명 인식의 평가와 오류 분석 연구
유현조(You, Hyun-Jo),송영숙(Song, Youngsook),김민수(Kim, Min Soo),윤기현(Yun, Gihyun),정유남(Cheong, Yunam) 한국언어학회 2021 언어 Vol.46 No.3
Named entity recognition is a natural language processing task that recognizes and classifies named entities in an unstructured text. The targets of NER are not limited to typical proper names for persons, locations and organizations, but also date, time and quantity expressions and can be further expanded to names of events, animals, plants, materials and other encyclopedic entities. A real-world NER system is also expected to be tuned to process domain-specific terminologies. In this study, the researchers built and tested a BERT based Korean NER system and proposed methods for evaluation and error analysis. The study trained the system with 140K word NER corpus and evaluated with 60K test. Error types are proposed to be categorized into four classes: detection, boundary, segmentation, and labelling. Error rates are found to vary greatly from 1% to 30% between entity labels, which are grouped into the most accurate time and quantity expressions, relatively accurate proper names, and highly erroneous terminologies. We expect that the error analysis will provide insights for finding a better way of data collection and post-processing correction.
한국어 TimeML -텍스트의 사건 및 시간 정보 연구
유현조 ( Hyun Jo You ),장하연 ( Ha Yeon Jang ),조유미 ( Yu Mi Jo ),남승호 ( Seung Ho Nam ),신효필 ( Hyo Pil Shin ),김윤신 ( Yoon Shin Kim ) 한국언어정보학회 2011 언어와 정보 Vol.15 No.1
TimeML is a markup language for events and temporal expressions in natural language, proposed in Pustejovsky et al. (2003) and latter standardized as ISO-TimeML (ISO 24617-1:2009). In this paper, we propose the further specification of ISO-TimeML for the Korean language with the concrete and thorough examination of real world texts. Since Korean differs significantly from English, which is the first and almost only extensively tested language with TimeML, one continuously run into theoretical and practical difficulties in the application of TimeML to Korean. We focus on the discussion for the consistent and efficent application of TimeML: how to consistently apply TimeML in accordance with Korean specificity and what to be annotated and what not to be, i.e. which information is meaningful in the temporal interpretation of Korean text, for efficient application of TimeML.
고동호 ( Dong Ho Ko ),유현조 ( Hyun Jo You ) 전북대학교 인문학연구소 2012 건지인문학 Vol.8 No.-
To enhance the level of research into Manchu in Korea and to make it lead the worldwide research trend, it is necessary to build a database of written Manchu which comprises lexical and text data translated from Chinese. The characteristics of this database lie in the fats that its data is enormous Manchu literatures which is hard to obtain, and it is result of interdisciplinary co-work in the humanities by Manchu and Chinese specialists. To achieve a successful goal of this work, it will be performed with the following procedures. (1) Results of transliteration of Manchu into Roman alphabets and Chinse materials will be entered in UNICODE 6.1. (2) The data inputted will be transformed into structured texts in XML format. (3) Integrated dictionary database, Manchu-Chinese parallel corpus, word index database will be built.
이준규(Lee, Junkyu),유현조(You, Hyun Jo) 한국응용언어학회 2011 응용 언어학 Vol.27 No.2
Formulaic sequences or MWE (multiword expressions) have been a locus of significant interests in the various domains of applied linguistics, yet have been limited to English education. In contrast to ample research findings of formulaic sequences in English, little is known about counterparts in Korean. Thus, this study aims to respond to a basic, fundamental question: what are formulaic sequences in Korean? For this, this study proposes a new method of finding formulaic sequences in Korean by integrating corpus linguistics methods and data-mining techniques. This study demonstrates how the new method could be implemented to find recurrent noun combinations from a large Korean corpus of newspaper, highlighting that the method allows us to locate MWE, which differs from previous collocation search methods in Korean. Also, this study illustrates how the extracted formulaic sequences can be used for pedagogical practices.
조인식(Cho In-Sik),유현조(You Hyun-Jo),신효필(Shin Hyo-Pil) 한국사전학회 2004 한국사전학 Vol.- No.3
The purpose of this study is to identify the properties of special-word, and to show the process of extracting special-words from a large corpus. A special-word corresponds to the notion of unknown words, which is a counterpart of the lexical database in Natural Language Process(NLP). Generally unknown words cause a lot of ambiguities and thus decline the accuracy of NLP systems. The special-word in this work includes various expressions about the events of the day or the fashions, abbreviated words and naturalized word. We came up with a semi-automatic procedure of constructing a special-word dictionary mainly based on the language-dependent heuristics. We, however, also feel that other statistical considerations including frequencies, and probability distributions may be required for unknown word extractions in a higher automatic fashion.
김문형 ( Mun Hyoung Kim ),조유미 ( Yu Mi Jo ),유현조 ( Hyun Jo You ),장하연 ( Ha Yeon Jang ),남승호 ( Seung Ho Nam ),신효필 ( Hyo Pil Shin ),김윤신 ( Yoon Shin Kim ) 한국언어정보학회 2012 언어와 정보 Vol.16 No.1
This study introduces set-denoting time expressions in Korean, which can be divided into simple and complex types. It was found that while the simple type expressions are easily represented within ISO-Time ML, a time-expression markup language, some complex type set-denoting pressions are not. Therefore, this study analyzes the reason for these difficulties in representing complex type expressions, as well as suggests the introduction of @measure and @interpretation attributes in the TIMEX3 tag. The @measure attribute represents the time interval, and the @interpretation attribute is used to distinguish distributive readings from cumulative readings. Additionally this paper suggests that a mapping between these and other attributes are required in TLINK.
송영숙 ( Song Youngsook ),정유남 ( Cheong Yunam ),유현조 ( You Hyun-jo ) 한국어의미학회 2022 한국어 의미학 Vol.76 No.-
This paper analyzes the hierarchical structures of named entities in the NIKL Named Entity Corpus, which is annotated with 553,830 flat named entity tags. This study will be a base for developing a method to build a Korean nested named entity corpus. The flat version of named entity recognition identifies mentions as linear spans. The nested named entity approach analyzes the hierarchical internal structure of named entities which may consist of smaller component named entities. We extracted candidate mentions for the nested named entity analysis from the NIKL Named Entity Corpus and classified them into three categories: serial named entities, complex named entities, and phrases with a named entity head. These candidates were reviewed manually to be selected as the target of nested named entity analysis. Finally, we discussed the span and the internal structure of named entities and proposed principles and guidelines for the construction of the Korean nested named entity corpus.
한국어 화자의 러시아어 이동사건 표현의 어휘화 패턴 습득 양상 연구
이수현 ( Suhyoun Lee ),안혁 ( Hyug Ahn ),유현조 ( Hyun-jo You ),하주애 ( Juae Ha ),정하경 ( Hakyung Jung ) 서울대학교 러시아연구소 2023 러시아연구 Vol.33 No.1
본 연구에서는 Talmy(1985; 2000)의 어휘화 유형론에서 동사 틀부여 언어로 분류되는 한국어의 모국어 화자가 위성어 틀부여 언어인 러시아어에서 이동사건의 의미구조를 어휘화하는 방식을 습득할 때 나타나는 모국어 전이 양상을 실험을 통해 고찰하였다. 실험 결과, 두 언어에서 유표 여부 자체가 다른 ‘방향’ 의미요소의 학습이 예측한 대로 어려웠으며, 러시아어에서와 달리 한국어에서 수의적으로 유표되는 ‘양태’ 의미요소의 어휘화 패턴 역시 모국어 전이 효과를 불러일으켰다. 그럼에도 불구하고 ‘양태’ 의미요소의 높은 정답률은 두 언어의 어휘화 패턴 차이를 약화시키는 한국어 양태복합동사의 단일어휘화의 영향으로 설명될 수 있는데, 이는 역설적으로 Talmy식의 유형론적 차이가 언어 습득에 유의미한 차이로 이어질 수 있음을 시사한다. 한편, ‘경로’ 의미요소의 경우 가장 낮은 정답률을 보였는데, 이는 경로가 한국어와 러시아어에서 동사와 위성어로 각각 달리 표현될 뿐만 아니라, 다양한 경로 유형이 상이한 인지적 현저성을 띠고 있기 때문으로 해석될 수 있다. This experimental study examines native language transfer effects in Korean speakers’ acquisition of lexicalization patterns of motion events in Russian, based on Talmy’s (1985; 2000) typological study. Speakers of Korean (a verb-framed language) and Russian (a satellite-framed language) use distinct strategies to encode semantic elements such as Path, Manner, and Direction when describing motion events. The experiments show that Direction, which is not expressed in Korean, is a challenging aspect for Korean speakers to learn. Manner, optionally expressed in Korean, is frequently omitted, and instead, the verbs idti/xodit’ ‘walk’ are employed in place of other Manner verbs as pseudo-generic verbs roughly equivalent to kada ‘go’ and oda ‘come’ in Korean. Path appears to be most difficult to learn among the three semantic elements. This may be either attributed to the typological difference between Korean and Russian regarding how Path is encoded (verb vs. satellite) or to different degrees of cognitive salience of diverse Path types.