http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
AURORA 잡음 처리 알고리즘을 이용한 전화망 환경에서의 강인한 음성 검출
서영주,지미경,김회린,Suh Youngjoo,Ji Mikyong,Kim Hoi-Rin 대한음성학회 2003 말소리 Vol.48 No.-
This paper proposes a noise reduction-based speech detection method under telephone channel environments. We adopt the AURORA front-end noise reduction algorithm based on the two-stage mel-warped Wiener filter approach as a preprocessor for the frequency domain speech detector. The speech detector utilizes mel filter-bank based useful band energies as its feature parameters. The preprocessor firstly removes the adverse noise components on the incoming noisy speech signals and the speech detector at the next stage detects proper speech regions for the noise-reduced speech signals. Experimental results show that the proposed noise reduction-based speech detection method is very effective in improving not only the performance of the speech detector but also that of the subsequent speech recognizer.
Universal Background Model 클러스터링 방법을 이용한 고속 화자식별
박주민,서영주,김회린,Park, Jumin,Suh, Youngjoo,Kim, Hoirin 한국음향학회 2014 韓國音響學會誌 Vol.33 No.3
본 논문은 Gaussian Mixture Model (GMM) 기반의 화자식별에서 급격한 계산 복잡도 감소를 위한 새로운 방법을 제안한다. 일반적으로 GMM 기반의 화자식별 시스템은 테스트 발성의 길이, 등록 화자의 수, GMM의 크기 등 크게 세 가지 요인에 비례하는 많은 계산 복잡도를 가진다. 이러한 점은 화자식별 시스템이 다양한 응용분야에 적용되는 것을 막는 큰 요인이기에 계산 복잡도와 식별 성능 사이의 trade-off 관계는 실제 적용을 위해 가장 중요한 고려요소이다. 식별 성능을 거의 그대로 유지하면서 최대한 계산 복잡도를 감소시키기 위해 우리는 Universal Background Model (UBM) 클러스터링 접근 방법을 제시하고, 또한 이 방법은 실시간 구조의 화자식별에 적용할 수 있다는 것을 보여준다. 제안한 방법의 실험을 통해 미미한 정도의 식별 성능 저하에서 speed-up factor 6의 결과를 얻을 수 있었다. In this paper, we propose a new method to drastically reduce computational complexity in Gaussian Mixture Model (GMM)-based Speaker Identification (SI). Generally, GMM-based SI systems have very high computational complexity proportional to the length of the test utterance, the number of enrolled speakers, and the GMM size. These make the SI systems difficult to be used in various real applications in spite of their broad applicability. Thus, a trade-off between computational complexity and identification accuracy is considered as a primary issue for practical applications. In order to reduce computational complexity sharply with a little loss of accuracy, we introduce a method based on the Universal Background Model (UBM) clustering approach and then we show that it can be used successfully in real-time applications. In experiments with the proposed algorithm, we obtained a speed-up factor of 6 with a negligible loss of accuracy.
기본주파수와 성도길이의 상관관계를 이용한 HTS 음성합성기에서의 목소리 변환
유효근(Yoo, Hyogeun),김영관(Kim, Younggwan),서영주(Suh, Youngjoo),김회린(Kim, Hoirin) 한국음성학회 2017 말소리와 음성과학 Vol.9 No.1
The main advantage of the statistical parametric speech synthesis is its flexibility in changing voice characteristics. A personalized text-to-speech(TTS) system can be implemented by combining a speech synthesis system and a voice transformation system, and it is widely used in many application areas. It is known that the fundamental frequency and the spectral envelope of speech signal can be independently modified to convert the voice characteristics. Also it is important to maintain naturalness of the transformed speech. In this paper, a speech synthesis system based on Hidden Markov Model(HMM-based speech synthesis, HTS) using the STRAIGHT vocoder is constructed and voice transformation is conducted by modifying the fundamental frequency and spectral envelope. The fundamental frequency is transformed in a scaling method, and the spectral envelope is transformed through frequency warping method to control the speaker’s vocal tract length. In particular, this study proposes a voice transformation method using the correlation between fundamental frequency and vocal tract length. Subjective evaluations were conducted to assess preference and mean opinion scores(MOS) for naturalness of synthetic speech. Experimental results showed that the proposed voice transformation method achieved higher preference than baseline systems while maintaining the naturalness of the speech quality.
한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구
최연주(Choi, Yeunju),정영문(Jung, Youngmoon),김영관(Kim, Younggwan),서영주(Suh, Youngjoo),김회린(Kim, Hoirin) 한국음성학회 2018 말소리와 음성과학 Vol.10 No.1
A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google’s Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.