A speech interface can be useful for various devices, including smartphones, car navigation systems, smart TVs, and humanoid robots. In particular, the ability to control such devices situated at distant positions is one of the most attractive charact...
A speech interface can be useful for various devices, including smartphones, car navigation systems, smart TVs, and humanoid robots. In particular, the ability to control such devices situated at distant positions is one of the most attractive characteristics of speech-based interfaces. However, the performance of speech-based interfaces, including speech recognition and speaker recognition, degrades significantly in real life conditions, where unrelated noises frequently occur. Sound source localization-based speech enhancements can improve the quality of such speech-based interfaces by determining the location of the speaker, and then boosting the signal from the desired location while suppressing the sounds from other locations.
Conventional sound source localization methods, however, cannot provide reliable estimation of a speaker’s location in severe noise conditions. In conventional localization methods, the loudest sound source within a given area is selected as the target location, though this may not necessarily be related to human speech. For speech-based interfaces, the locations with a high correlation to human speech should be given preference. However, in real life applications, speech-like noises, including babble noises, can frequently occur. Therefore, locations showing a high correlation with the target speaker should be given preference. To accomplish this, this paper combines several speech analysis algorithms, including voice activity detection and speaker verification, with a sound source localization algorithm. By incorporating features that are closely correlated with human speech and target speakers, unrelated noise, including speech-like background noise, can be effectively suppressed.
The proposed method was tested under a variety of conditions using both simulation data and real data. Experimental results indicated that the performance of the proposed method was superior to that of a conventional algorithm for various types of noise and signal-to-noise conditions. In particular, the proposed method performed much better in severely degraded noise conditions.