Modular Neural Networks for Speech Recognition Diploma thesis
Jurgen Fritsch
Interactive Systems Laboratories Carnegie Mellon University { USA University of Karlsruhe { Germany Supervisors: Prof. Dr. Alex Waibel Michael Finke 7/31/96 Pittsburgh, PA
1
Keywords: Hybrid Speech Recognition, Neural Networks, Hierarchical Mixtures of Experts, Mixture Models, Acoustic Modeling, Connectionist Context Modeling, Constructive Methods.
2
Ich erklare hiermit, da ich die vorliegende Arbeit selbstandig verfat und keine anderen als die angegebenen Hilfsmittel verwendet habe.
Pittsburgh, den 31.7.1996
3
Abstract In recent years, researchers have established the viability of so called hybrid NN/HMM large vocabulary, speaker independent continuous speech recognition systems, where neural networks (NN) are used for the estimation of acoustic emission probabilities for hidden Markov models (HMM) which provide statistical temporal modeling. Work in this direction is based on a proof, that neural networks can be trained to estimate posterior class probabilities. Advantages of the hybrid approach over traditional mixture of Gaussians based systems include discriminative training, fewer parameters, contextual inputs and faster sentence decoding. However, hybrid systems usually have training times that are orders of magnitude higher than those observed in traditional systems. This is largely due to the costly, gradient-based error-backpropagation learning algorithm applied to very large neural networks, which often requires the use of specialized parallel hardware. This thesis examines how a hybrid NN/HMM system can bene t from the use of modular and hierarchical neural networks such as the hierarchical mixtures of experts (HME) architecture. Based on a powerful statistical framework, it is shown that modularity and the principle of divide-and-conquer applied to neural network learning reduces training times signi cantly. We developed a hybrid speech recognition system based on modular neural networks and the state-of-the-art continuous density HMM speech recognizer JANUS. The system is evaluated on the English Spontaneous Scheduling Task (ESST), a 2400 word spontaneous speech database. We developed an adaptive tree growing algorithm for the hierarchical mixtures of experts, which is shown to yield better usage of the parameters of the architecture than a pre-determined topology. We also explored alternative parameterizations of expert and gating networks based on Gaussian classi ers, which allow even faster training because of near-optimal initialization techniques. Finally, we enhanced our originally context independent hybrid speech recognizer to model polyphonic contexts, adopting decision tree clustered context classes from a Gaussian mixtures system.
2
3
Acknowledgments I would like to thank Alex Waibel for his support and for the opportunity to work for almost one year in a progressive environment on a challenging project at Carnegie Mellon University, Pittsburgh. Also, I'd like to thank Michael Finke, who was in charge of supervising this thesis and who, in my opinion, did a pretty darn good job in doing so. I'm especially thankful for the inspiring discussions we were having every so often. I'm still trying hard to get him to accept an invitation for dinner. I also wish to thank all the other members of the nnspeech group for support and for having a good time. Last, but by no means least, I'd like to thank my parents for giving me all the support (including nancial) and encouragment in my plans to experience the US and study at CMU. Pittsburgh, July 1996
4
Contents 1 Introduction 2 Neural Networks 2.1 2.2 2.3 2.4
Introduction . . . . . . . . . . . Multi Layer Perceptrons . . . . Radial Basis Function networks Statistical Interpretation . . . . 2.4.1 Perceptrons . . . . . . . 2.4.2 Multi Layer Perceptrons 2.4.3 Unsupervised Learning .