Skip to main content

Tez: Exploring Deep Learning Methods for Discovering Features in Speech Signals

Navdeep Jaitly
Ph.D. Thesis
University of Toronto

This thesis makes three main contributions to the area of speech recognition with Deep Neural Network – Hidden Markov Models (DNN-HMMs).
Firstly, we explore the effectiveness of features learnt from speech databases using Deep Learning for speech recognition. This contrasts with prior works that have largely confined themselves to using traditional features such as Mel Cepstral Coefficients and Mel log filter banks for speech recognition. We start by showing that features learnt on raw signals using Gaussian-ReLU Restricted Boltzmann Machines can achieve accuracy close to that achieved with the best traditional features. These features are, however, learnt using a generative model that ignores domain knowledge. We develop methods to discover features that are endowed with meaningful semantics that are relevant to the domain using capsules. To this end, we extend previous work on transforming autoencoders and propose a new autoencoder with a domain-specific decoder to learn capsules from speech databases. We show that capsule instantiation parameters can be combined with Mel log filter banks to produce improvements in phone recognition on TIMIT. On WSJ the word error rate does not improve, even though we get strong gains in classification accuracy. We speculate this may be because of the mismatched objectives of word error rate over an utterance and frame error rate on the sub-phonetic class for a frame.
Secondly, we develop a method for data augmentation in speech datasets. Such methods result in strong gains in object recognition, but have largely been ignored in speech recognition. Our data augmentation encourages the learning of invariance to vocal tract length of speakers. The method is shown to improve the phone error rate on TIMIT and the word error rate on a 14 hour subset of WSJ.
Lastly, we develop a method for learning and using a longer range model of targets, conditioned on the input. This method predicts the labels for multiple frames together and uses a geometric average of these predictions during decoding. It produces state of the art results on phone recognition with TIMIT and also produces significant gains on WSJ.

Devamını Oku