Skip to main content

Tez: Exploring Deep Learning Methods for Discovering Features in Speech Signals


Info
Navdeep Jaitly
Ph.D. Thesis
2014
University of Toronto

This thesis makes three main contributions to the area of speech recognition with Deep Neural Network – Hidden Markov Models (DNN-HMMs).
Firstly, we explore the effectiveness of features learnt from speech databases using Deep Learning for speech recognition. This contrasts with prior works that have largely confined themselves to using traditional features such as Mel Cepstral Coefficients and Mel log filter banks for speech recognition. We start by showing that features learnt on raw signals using Gaussian-ReLU Restricted Boltzmann Machines can achieve accuracy close to that achieved with the best traditional features. These features are, however, learnt using a generative model that ignores domain knowledge. We develop methods to discover features that are endowed with meaningful semantics that are relevant to the domain using capsules. To this end, we extend previous work on transforming autoencoders and propose a new autoencoder with a domain-specific decoder to learn capsules from speech databases. We show that capsule instantiation parameters can be combined with Mel log filter banks to produce improvements in phone recognition on TIMIT. On WSJ the word error rate does not improve, even though we get strong gains in classification accuracy. We speculate this may be because of the mismatched objectives of word error rate over an utterance and frame error rate on the sub-phonetic class for a frame.
Secondly, we develop a method for data augmentation in speech datasets. Such methods result in strong gains in object recognition, but have largely been ignored in speech recognition. Our data augmentation encourages the learning of invariance to vocal tract length of speakers. The method is shown to improve the phone error rate on TIMIT and the word error rate on a 14 hour subset of WSJ.
Lastly, we develop a method for learning and using a longer range model of targets, conditioned on the input. This method predicts the labels for multiple frames together and uses a geometric average of these predictions during decoding. It produces state of the art results on phone recognition with TIMIT and also produces significant gains on WSJ.

Devamını Oku

Tez: Deep Learning Approaches to Problems in Speech Recognition, Computational Chemistry, and Natural Language Text Processing


Info
George E. Dahl
Ph.D. Thesis
2015
University of Toronto

The deep learning approach to machine learning emphasizes high-capacity, scalable models that learn distributed  representations  of  their  input. This  dissertation  demonstrates  the  ecacy  and  generality of this approach in a series of diverse case studies in speech recognition, computational chemistry, and natural language processing.  Throughout these studies, I extend and modify the neural network models as needed to be more e ective for each task.
In  the  area  of  speech  recognition,  I  develop  a  more  accurate  acoustic  model  using  a  deep  neural network.  This model, which uses recti ed linear units and dropout, improves word error rates on a 50 hour broadcast news task.  A similar neural network results in a model for molecular activity prediction substantially more e ective than production systems used in the pharmaceutical industry.  Even though training assays in drug discovery are not typically very large, it is still possible to train very large models by leveraging data from multiple assays in the same model and by using e ective regularization schemes. In the area of natural language processing, I first describe a new restricted Boltzmann machine training algorithm suitable for text data.  Then, I introduce a new neural network generative model of parsed sentences capable of generating reasonable samples and demonstrate a performance advantage for deeper variants of the model.

Devamını Oku

Makale: Anticipating the Future by Watching Unlabeled Video

In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through the massive amounts of readily available unlabeled video. In this paper, we present a large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. We experimentally validate this idea on two challenging “in the wild” video datasets, and our results suggest that learning with unlabeled videos significantly helps forecast actions and anticipate objects.

Makale: Visualizing Object Detection Features

We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on “HOG goggles” and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.