Skip to main content

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. Experiments show that compared with the standard hybrid DNN systems, Eesen achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

1507.08240v3 EESEN End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding.pdf

Tez: Exploring Deep Learning Methods for Discovering Features in Speech Signals


Info
Navdeep Jaitly
Ph.D. Thesis
2014
University of Toronto

This thesis makes three main contributions to the area of speech recognition with Deep Neural Network – Hidden Markov Models (DNN-HMMs).
Firstly, we explore the effectiveness of features learnt from speech databases using Deep Learning for speech recognition. This contrasts with prior works that have largely confined themselves to using traditional features such as Mel Cepstral Coefficients and Mel log filter banks for speech recognition. We start by showing that features learnt on raw signals using Gaussian-ReLU Restricted Boltzmann Machines can achieve accuracy close to that achieved with the best traditional features. These features are, however, learnt using a generative model that ignores domain knowledge. We develop methods to discover features that are endowed with meaningful semantics that are relevant to the domain using capsules. To this end, we extend previous work on transforming autoencoders and propose a new autoencoder with a domain-specific decoder to learn capsules from speech databases. We show that capsule instantiation parameters can be combined with Mel log filter banks to produce improvements in phone recognition on TIMIT. On WSJ the word error rate does not improve, even though we get strong gains in classification accuracy. We speculate this may be because of the mismatched objectives of word error rate over an utterance and frame error rate on the sub-phonetic class for a frame.
Secondly, we develop a method for data augmentation in speech datasets. Such methods result in strong gains in object recognition, but have largely been ignored in speech recognition. Our data augmentation encourages the learning of invariance to vocal tract length of speakers. The method is shown to improve the phone error rate on TIMIT and the word error rate on a 14 hour subset of WSJ.
Lastly, we develop a method for learning and using a longer range model of targets, conditioned on the input. This method predicts the labels for multiple frames together and uses a geometric average of these predictions during decoding. It produces state of the art results on phone recognition with TIMIT and also produces significant gains on WSJ.

Devamını Oku

Tez: Deep Learning Approaches to Problems in Speech Recognition, Computational Chemistry, and Natural Language Text Processing


Info
George E. Dahl
Ph.D. Thesis
2015
University of Toronto

The deep learning approach to machine learning emphasizes high-capacity, scalable models that learn distributed  representations  of  their  input. This  dissertation  demonstrates  the  ecacy  and  generality of this approach in a series of diverse case studies in speech recognition, computational chemistry, and natural language processing.  Throughout these studies, I extend and modify the neural network models as needed to be more e ective for each task.
In  the  area  of  speech  recognition,  I  develop  a  more  accurate  acoustic  model  using  a  deep  neural network.  This model, which uses recti ed linear units and dropout, improves word error rates on a 50 hour broadcast news task.  A similar neural network results in a model for molecular activity prediction substantially more e ective than production systems used in the pharmaceutical industry.  Even though training assays in drug discovery are not typically very large, it is still possible to train very large models by leveraging data from multiple assays in the same model and by using e ective regularization schemes. In the area of natural language processing, I first describe a new restricted Boltzmann machine training algorithm suitable for text data.  Then, I introduce a new neural network generative model of parsed sentences capable of generating reasonable samples and demonstrate a performance advantage for deeper variants of the model.

Devamını Oku