Speech-based interaction is commonly used for the exchange of information between a human and a robot. However, speech recognition for a robot is obstructed by noise in a real-life environment. Various speech-enhancement techniques have been studied t...
Speech-based interaction is commonly used for the exchange of information between a human and a robot. However, speech recognition for a robot is obstructed by noise in a real-life environment. Various speech-enhancement techniques have been studied to overcome this problem.
Sound source localization (SSL) is the technique of determining the direction of a sound source. Because this direction is used as prior information for speech-enhancement technologies such as beamforming, SSL is a crucial component of noise-robust speech-based human-robot interaction. The steered response power-phase transform (SRP-PHAT) method for SSL has been widely used owing to its robustness to reverberations. However, it is known that SRP-PHAT cannot be executed in real time because it needs to calculate a very large number of candidate sound source locations. Thus, various CPU-based approaches have been proposed to overcome this problem.
Prevailing GPU-based programming toolkits such as compute unified device architecture (CUDA) and open computing language (OpenCL) have helped GPU computing in integrating PC and GPU environments. In order to cope with the changing environments, it is vital to modify conventional algorithms into GPU-based algorithms for improved performance.
SRP-PHAT is divided into four stages-loading the time-difference-of-arrival (TDOA) table, cross-correlation, SRP energy map calculation, and searching for maximum-SRP coordinates. Each stage is then transformed into a GPU-based framework. If the configurations of the microphone array and candidate coordinates remain unchanged, the TDOA values remain unchanged; therefore, TDOA values are pre-calculated. Cross-correlations are calculated only once per frame because they are commonly referenced by all the microphone pairs. On the basis of these cross-correlations, the SRP energy map for all the candidate coordinates is calculated. The candidate coordinates having maximum SRP are selected as the direction of the sound source.
The experiment is carried out using a single-core CPU and GPU with a varying number of microphone channels and candidate coordinates. The execution times were measured on a 3.4-GHz CPU and a GPU having 288 CUDA cores. As compared to the execution time of a conventional single-core CPU-based SRP-PHAT, the execution times of the proposed method showed a 11-19-fold and a 19-25-fold improvement.
In this study, SRP-PHAT optimized into sequential implementation was divided into four stages. Each stage was presented as a generalized parallel framework. Thus, users can adapt the algorithm to suit their application. In particular, the cross-correlation stage presents variable data parallelism with respect to the number of microphones. Similarly, the SRP energy map calculation stage presents variable data parallelism with respect to the number of candidate coordinates. And the searching for the maximum SRP coordinates stage presents the performance improvement in terms of execution time in proportion to log2(NC), where NC is the number of candidate.