RISS 검색 - 연구보고서 등 상세보기

국문 초록 (Abstract)

본 연구는 한국어 서울 방언의 자연 발화 음성 코퍼스를 구축을 위해, 40명분의 소리 파일을 녹음하고, 각 음성 파일에 대하여 어절별, 음소별로 표기된 레이블 파일, 및 코퍼스 검색 도구를 ...

본 연구는 한국어 서울 방언의 자연 발화 음성 코퍼스를 구축을 위해, 40명분의 소리 파일을 녹음하고, 각 음성 파일에 대하여 어절별, 음소별로 표기된 레이블 파일, 및 코퍼스 검색 도구를 연구용으로 무료 배포 하는 것이다. 자연발화는 통제발화나 낭독체 발화에 비하여 변이성이 높다. 즉 음소배열과 음운현상 등이 예기치 않게 나타나기도 하지만 일상적으로 인간이 사용하는 언어 형태를 가장 잘 반영한다고 볼 수 있다. 녹음 대상으로는 10대, 20대, 30대, 40대 각각 10명씩을 정하였고, 각 세대별로 남녀를 각각 5명씩 녹음하였다. 녹음 완료된 파일들을 대상으로 후처리를 거쳐 자동음성인식의 부수적 과정인 강제정렬을 통해 자동레이블링을 생성하였다. 자동으로 생성된 음소별 레이블은 다시 연구원들에 의해서 수동으로 음소경계에 대한 미세 조정 작업이 진행되었다. 녹음된 인터뷰 시간은 약 40시간 이었으며, 총 1,135,263개의 음소 레이블의 수동레이블이 완성되었다. 참여한 6명의 레이블러들 사이의 음소 일치도를 계산하여 레이블의 정확성을 검증하였다. 이를 위해, 1분간의 테스트 음성을 수동 레이블 한 결과 98.1%의 일치도를 보였으며, 음소경계에 대한 레이블러들 사이의 오차는 평균 9.04msec 이었다.

다국어 초록 (Multilingual Abstract)

The study describes the development of a corpus of spontaneous Korean speech. The Korean language includes various local dialects and this corpus focuses on the Seoul dialect. Although spontaneous speech has much more variability than controlled or read speech, it provides a genuine insight into our speech in everyday life. Data were gathered from interviews with 20 men and 20 women from 4 different age groups from teenagers to people in their 40's. The number of interviewees in each age group was balanced according to their sex, five men and women. All the interviews were digitally recorded and manually labelled with different annotation levels, such as phonemes and phrasal words. About 40 hours of interviews were collected and 1,135,263 phoneme labels were annotated. A test of labelling consistency among labellers was done and the agreement on phoneme identification was 98.1%. Mean deviation in phoneme segmentation was 9.04 msec. The corpus will be available to the research community free of charge.

상세검색

RISS 보유자료

상세검색

해외전자자료

한국어 자연발화 음성코퍼스 구축을 위한 기초 연구

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료