Speech-based interfaces for mobile devices require human voices to be detected in the audio signals that are received. Noise signals extraneous to the voice of the target user must be reduced and rejected by such systems to increase the performance of...
Speech-based interfaces for mobile devices require human voices to be detected in the audio signals that are received. Noise signals extraneous to the voice of the target user must be reduced and rejected by such systems to increase the performance of speech recognition applications and to reduce computational costs. Furthermore, speech recognition systems need to trace the direction from which the voice of the user originates to focus on the user’s mouth and enhance the signal, so that when the user wants to interface with the system, only the user’s voice is accepted.
To this end, we introduce the ‘always listening and focusing’ concept, whereby the system tracks a legitimate user at any time by using multiple sources of information such as the speaker, speech, and video. This concept intends to simulate human listening in order to recognize behavior so that the meaning of the signal and the concerns of the user can be examined in a mobile environment. This thesis proposes a novel algorithm based on this concept that works with multiple sources of information, including a microphone array and a video camera. The proposed algorithm adopts sound source localization to locate the source of the voice signal and to reject noise in three dimensional space; a beamforming technique to enhance the voice signal and reduce noise; a voice activity detection method to isolate the voice interval and to reject noise in the time domain; and a speaker recognition approach to verify the identity of a legitimate user. Furthermore, the system determines the direction toward which the user is facing, and the voice is rejected if the user is talking to somebody else. The algorithm that is herein proposed has been named ‘Audio-Visual Space-Time voice activity detection’. The results of experiments with simulated and real-world data indicate that the proposed method significantly reduces the error rate.