Joseph Keshet½ , Dan Chazan¾ and Ben–Zion Bobrovsky½ Department of Electrical Engineering, Tel Aviv University, Tel Aviv 69978, Israel
HRL Audio/Video Technologies Group, IBM Israel - Science and Technology, Haifa, 31905, Israel
This paper presents a novel algorithm for precisespotting of plosives. The algorithm is based on a pattern matching technique implemented with margin classiﬁers, such as support vector machines (SVM). A special hierarchical treatment to overcome the problem of fricative and false silence detection is presented. It uses the loss-based multi-class decisions. Furthermore, a method for smoothing the overall decisions by sequential linear programming isdescribed. The proposed algorithm was tested on the TIMIT corpus, which produced a very high spotting accuracy. The algorithm presented here is applied to plosives detection, but can easily be adapted to any class of phonemes.
The plosives consonants (/b/, /d/, /g/, /k/, /p/ and /t/) are unique among phoneme categories in English since they involve three distinct stages which aresequential in time : 1. Closure (occlusion) — The articulators totally block the air-stream and the air pressure increases just behind the obstruction. For the voiced plosives (/b/, /d/ and /g/), there is an underlying voicing activity during part of this stage. 2. Burst — The articulators quickly move away from each other. An explosive burst of air rushes through the opening, involving energyin most or all of the audible spectrum. 3. Transition — Transition segment to the next sound. Nowadays, The HMM is the predominant acoustic model in continuous speech recognition systems. Inherently, the HMM suffers from three basic restrictions : assumption of conditional independence of observations given the state sequence, features extraction imposed by framed-based observation, andduration model implicitly given by a geometric distribution. These restrictions result in a very poor model for plosives, and hence reduce recognition rates on this important class. Several schemes have been proposed for special purpose plosives recognition machines, which were not based on HMM. Torres and Iparraguirre  proposed two knowledge-based classiﬁers for identiﬁcation of Spanish unvoicedstops, which were designed and tested over a consonant-vowel (CV) context and resulted in a satisfactory rate of identiﬁcation. Morris et al.  compared the baseline performance of human perception of the
consonantal place of articulation with the performance of two automatic speech recognition techniques (Kohonen self organizing map and Gaussian mixture classiﬁer) on multilingual VC and CVvocalic transition segments. Ali et al.  suggested a new set of acoustic features and a knowledge-based acousticphonetic system for automatic recognition of isolated stops, taken from continuous speech. Lin, Lee and Lin  presented methods for CV alignment of Chinese Mandarin speech, using fuzzy implication to ﬁnd the abrupt spectral difference changes and spectral distance measuring. Theyreported their system performance was comparable to that of a human expert, though this system might fail to handle continuous English speech. Although there have been some studies on segmenting out, and recognizing plosive within known patterns of speech (such as CV, VC, etc.), so far no work has been carried out on accurate segmentation and recognition of plosives in ﬂuent speech. We propose a twostage scheme to carry out the recognition of plosives. During the ﬁrst stage the exact location of the plosive is spotted, while in the second stage, the plosive is classiﬁed as a speciﬁc type, given its location . It may be noted here that the purpose of the work is to obtain a model for plosives. While for clean speech the proposed two stage approach may be an effective scheme for actually...