Tokyo Tech News
Tokyo Institute of Technology merged with Tokyo Medical and Dental University to form Institute of Science Tokyo (Science Tokyo) on October 1, 2024.
Over time, content on this site will be migrated to the Science Tokyo Web. Any information published on this site will be valid in relation to Science Tokyo.
Tokyo Tech News
Published: January 31, 2011
Automatic speech recognition (ASR) is the process used to convert spoken utterances into text using a computer. ASR systems usually consist of a voice activity detector (VAD) and a decoder, which uses acoustic and language models. Since state-of-the-art ASR systems are based on statistical models comprising millions of parameters and therefore need a huge amount of computation especially for large-vocabulary continuous speech recognition, it is essential to find effective ways to reduce computational complexity in order to realize real-time systems.
For this purpose, Paul Dixon, Tasuku Oonishi, and Sadaoki Furui have developed a WFST (Weighted Finite-State Transducer)-based decoder, named T3 (Tokyo Tech Transducer-based) Decoder—T3 is read as “T-cubed”. In the T3 decoder, on-the-fly composition is implemented for online transducer update, a commodity Graphics Processing Unit (GPU) is used for decoding, and a VAD is embedded in the decoder (Fig.1). The WFST decoder with on-the-fly composition capability provides a very flexible and efficient decoding structure, the GPU significantly reduces the recognition time, and the VAD-embedded decoder framework achieves high recognition accuracy even for noisy speech. By using these techniques, very fast and accurate ASR systems, such as a real-time high-accuracy continuous speech recognition system with a 500,000 word vocabulary, were achieved.
Structure of an efficient automatic speech recognition system
Reference
Sadaoki Furui
Program for Leading Graduate Schools
Professor Emeritus