In this thesis, audio-to-visual conversion techniques for efficient multimedia communications are described. The audio signals are automatically converted to visual images of mouth shapes. Visual images synchronized with audio signals can provide user...
In this thesis, audio-to-visual conversion techniques for efficient multimedia communications are described. The audio signals are automatically converted to visual images of mouth shapes. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. HMMs(hidden Markov models) are used to convert audio signals to a sequence of visemes.
This study compares four approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. In the third approach, an HMM is trained for each triviseme which is a viseme with its left and right context, and the audio signals are directly recognized as a sequence of trivisemes. In the fourth approach, each triphone is modeled with an HMM, and a general triphone recognizer is used to produce a triphone sequence from the audio signals. The triviseme or triphone sequence is then converted to a viseme sequence. The performances of the four visemes recognition systems are evaluated on the TIMIT speech corpus.
The viseme recognizer shows 33.9% viseme recognition error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. The triviseme-based approach displays 22.7% error rate. And triphone-based approach shows 17.4% recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 26.9%, 19.6%, 18.8% and 10.7%, respectably. These results show that the triviseme model based system has the better accuracy than the monoviseme models.