RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      검색결과 좁혀 보기

      선택해제

      오늘 본 자료

      • 오늘 본 자료가 없습니다.
      더보기
      • 무료
      • 기관 내 무료
      • 유료
      • KCI등재

        NB 모델을 이용한 형태소 복원

        김재훈,전길호,Kim, Jae-Hoon,Jeon, Kil-Ho 한국정보처리학회 2012 정보처리학회논문지B Vol.19 No.3

        In Korean, spelling change in various forms must be recovered into base forms in morphological analysis as well as part-of-speech (POS) tagging is difficult without morphological analysis because Korean is agglutinative. This is one of notorious problems in Korean morphological analysis and has been solved by morpheme recovery rules, which generate morphological ambiguity resolved by POS tagging. In this paper, we propose a morpheme recovery scheme based on machine learning methods like Na$\ddot{i}$ve Bayes models. Input features of the models are the surrounding context of the syllable which the spelling change is occurred and categories of the models are the recovered syllables. The POS tagging system with the proposed model has demonstrated the $F_1$-score of 97.5% for the ETRI tree-tagged corpus. Thus it can be decided that the proposed model is very useful to handle morpheme recovery in Korean. 한국어는 교착어이어서 형태소 분석 없이 품사 부착이 어려울 뿐 아니라 형태소를 분석할 때 다양한 어형 변화가 복원되어야 한다. 이것은 한국어 형태소 분석의 고질적인 문제 중 하나이며, 주로 규칙을 이용해서 해결한다. 규칙을 이용할 경우 주어진 문맥에 가장 적합한 복원을 어려워 여러 형태의 모호성을 생성하며, 이는 품사 부착에 의해서 해결된다. 본 논문에서는 이 문제를 기계학습 방법(Na$\ddot{i}$ve Bayes 모델)을 이용하여 해결한다. 기계학습 모델의 입력 자질은 어형 변화가 발생하는 주변 음절이며 출력 범주는 복원된 음절이다. ETRI 구문 말뭉치를 이용한 실험에서 제안된 형태소 복원 모델을 사용한 형태소 단위의 품사 부착 성능은 97.5%의 $F_1$점수를 보였으며 이 모델이 형태소 복원에 매우 유용함을 알 수 있었다.

      • 세종말뭉치의 오류 수정 방법

        김재훈(Jae-Hoon Kim),서형원(Hyung-Won Seo),전길호(Kil-Ho Jeon),최명길(Myung-Gil Choi) 한국마린엔지니어링학회 2010 한국마린엔지니어링학회 학술대회 논문집 Vol.2010 No.4

        Sejong corpus is a Korean corpus annotated with various linguistic information. The corpus contains a raw corpus, a part-of-speech (POS) tagged corpus, a syntactic tree bank and so on, according to the annotated information. This paper is related to the POS-tagged corpus, which is annotated with the POS information and used to develop natural language processing (NLP) systems, such as information retrieval, information extract, etc. The Sejong POS-tagged corpus had been built by the National Institute of the Korean Language for 9 years and consists of 10.6 million words. However, it's hard to use the corpus for developing some NLP systems because of various types of errors in the corpus. We treat errors which original words mismatch the concatenation of tagged morphemes. In this paper, we represent a method for detecting the errors and correcting them, and also our results. First, the error detection is to find mismatches of strings between original words and the concatenation of their analyzed words. The mismatches is candidates of errors and contains some valid forms transformed by irregular or phoneme conjugations. We develop a program to filter the valid forms out. The remaining mismatches are modified according to error types as follows: 1) Unnecessarily inserted or deleted words had been corrected by regular expressions, which are made manually. 2) Some special symbols as errors didn't be recognized by annotators correctly and had been corrected manually. 3) Others as the remaining errors account for very small portion and had also been corrected manually. As the result of our effort, the Sejong POS-tagged corpus is improved as good as it is useful for some applications.

      연관 검색어 추천

      이 검색어로 많이 본 자료

      활용도 높은 자료

      해외이동버튼