Tokyo Tech News
Tokyo Tech News
Published: January 31, 2011
Automatic speech recognition (ASR) is the process used to convert spoken utterances into text using a computer. ASR systems usually consist of a voice activity detector (VAD) and a decoder, which uses acoustic and language models. Since state-of-the-art ASR systems are based on statistical models comprising millions of parameters and therefore need a huge amount of computation especially for large-vocabulary continuous speech recognition, it is essential to find effective ways to reduce computational complexity in order to realize real-time systems.
For this purpose, Paul Dixon, Tasuku Oonishi, and Sadaoki Furui have developed a WFST (Weighted Finite-State Transducer)-based decoder, named T3 (Tokyo Tech Transducer-based) Decoder—T3 is read as “T-cubed”. In the T3 decoder, on-the-fly composition is implemented for online transducer update, a commodity Graphics Processing Unit (GPU) is used for decoding, and a VAD is embedded in the decoder (Fig.1). The WFST decoder with on-the-fly composition capability provides a very flexible and efficient decoding structure, the GPU significantly reduces the recognition time, and the VAD-embedded decoder framework achieves high recognition accuracy even for noisy speech. By using these techniques, very fast and accurate ASR systems, such as a real-time high-accuracy continuous speech recognition system with a 500,000 word vocabulary, were achieved.
Structure of an efficient automatic speech recognition system
Reference
Sadaoki Furui
Program for Leading Graduate Schools
Professor Emeritus