Fu-Hua Liu, Pedro J. Moreno, Richard M. Stern, Alejandro Acero Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
This paper describes a series of cepstral-based compensation procedures that render the SPHINX-II system more robust with respect to acousticalenvironment. The ﬁrst algorithm, phonedependent cepstral compensation, is similar in concept to the previously-described MFCDCN method, except that cepstral compensation vectors are selected according to the current phonetic hypothesis, rather than on the basis of SNR or VQ codeword identity. We also describe two procedures to accomplish adaptation of the VQ codebook for new environments, as well asthe use of reduced-bandwidth frequency analysis to process telephone-bandwidth speech. Use of the various compensation algorithms in consort produces a reduction of error rates for SPHINX-II by as much as 40 percent relative to the rate achieved with cepstral mean normalization alone, in both development test sets and in the context of the 1993 ARPA CSR evaluations.
In this paper we describeseveral new procedures that when used in consort can provide as much as an additional 40 percent improvement over baseline processing with CMN. These techniques include: • • • • • Phone-dependent cepstral compensation Environmental interpolation of compensation vectors Codebook adaptation Reduced-band analysis for telephone-bandwidth speech. Silence codebook adaptation
In Sec. 2 we describe thesecompensation procedures in detail, and we examine their effect on recognition accuracy in Secs. 3 and 4.
2. ENVIRONMENTAL COMPENSATION ALGORITHMS
We begin this section by reviewing the previously-described MFCDCN algorithm, which is the basis for most of the new procedures discussed. We then discuss blind environment selection and environmental interpolation as they apply to MFCDCN. Thecomplementary procedures of phone-dependent cepstral normalization and codebook adaptation are described. We close this section with brief description of reduced-bandwidth analysis and silence-codebook adaptation, which are very beneﬁcial in processing telephone-bandwidth speech and speech recorded in the presence of strong background noise, respectively.
A continuing problem withcurrent speech recognition technology is that of lack of robustness with respect to environmental variability. For example, the use of microphones other than the ARPA standard Sennheiser HM-414 “close-talking” headset (CLSTLK) severely degrades the performance of systems like the original SPHINX system, even in a relatively quiet ofﬁce environment [e.g. 1,2]. Applications such as speech recognitionin automobiles, over telephones, on a factory ﬂoor, or outdoors demand an even greater degree of environmental robustness. In this paper we describe and compare the performance of a series of cepstrum-based procedures that enable the CMU SPHINX-II  speech recognition system to maintain a high level of recognition accuracy over a wide variety of acoustical environments. We also discuss theaspects of these algorithms that appear to have contributed most signiﬁcantly to the success of the SPHINX-II system in the 1993 ARPA CSR evaluations for microphone independence (Spoke 5) and calibrated noise sources (Spoke 8). In previous years we described the performance of cepstral mapping procedures such as the CDCN algorithm, which is effective but fairly computationally costly . More recentlywe discussed the use of cepstral highpass-ﬁltering algorithms [such as the popular RASTA and cepstral-mean-normalization algorithms (CMN) . These algorithms are very simple to implement but somewhat limited in effectiveness, and CMN is now part of baseline processing for the CMU and many other systems.
2.1. Multiple Fixed Codeword-Dependent Cepstral Normalization (MFCDCN)