Past Seminars‎ > ‎


Showing 19 items
January 28, 2005 Brian Langner Speech In Noise, and How it Affects the Understandability of Natural and Synthetic Speech This talk will describe speech in noise, a speaking style employed by people to improve their understandability when speaking in noisy conditions. I will discuss evidence of understandability improvements for natural speech, including recent experimental data. Further, since even the highest quality speech synthesizers can be difficult to understand, I look at the viability of using this speaking style to improve understandability of synthetic speech. A technique for obtaining and using speech in noise for speech synthesis is described, in addition to data and results from understandability tests. 
February 18, 2005 Laura Tomokiyo Mayfield Speaking Globally: Building the Voices of the World Intercultural communication between humans is a complicated thing. Not only do speakers have to master grammar and pronunciation well enough to make themselves understood, they must also find the most appropriate way to say it. Is the same true for synthetic speech? As machines are called upon to speak in all languages, the industry must meet this need in a culturally sensitive way. Cepstral would like highlight some of the interesting challenges that we have seen arise in the creation of synthetic voices. Our work in voice development for languages such as spoken Arabic, Thai, and Pashto has uncovered some important challenges. Gender roles, dialect community relationships, and speaking style can greatly affect how listeners perceive synthetic speech. Voice models can be extraordinarily difficult to find; the very naturalness of the synthetic voice may cause the speaker to worry about retribution if his voice is recognized. Although sociolinguistic concerns do play a role in development of voices for high-density languages such as English, the lack of linguistic resources to build the voice, and of exposure to speech technology on the part of the listener, can mean that cultural issues play a far more important role on acceptability of synthetic speech for most of the languages of the world. 
February 28, 2005 Hynek Hermansky Towards a "Primal Sketch" of Speech Most of automatic speech recognizers attempt to find model of the utterance that would be most consistent with observed "data". The "data" typically represent a transformation of sequence of short-term spectral vectors, i.e. the sequence of spectral components of short (10-20 ms) segments of speech signal. The talk discusses alternative representation of acoustic signal based on likelihoods of (so far not well specified) sound events and argues for its better consistency with current knowledge of mammalian hearing system. 
March 4, 2005 Veera Venkataramani Support Vector Machines for Automatic Speech Recognition in a Code Breaking Framework Code Breaking is a divide and conquer approach for sequential pattern recognition tasks where we identify weaknesses of an existing system and then use specialized decoders to strengthen the overall system. We study the technique in the context of Automatic Speech Recogniton. Using the lattice cutting algorithm, we first analyze lattices generated by a state-of-the-art speech recognizer to spot possible errors in its first-pass hypothesis. We then train specialized decoders for each of these problems and apply them to refine the first-pass hypothesis. We study the use of Support Vector Machines (SVMs) as discriminative models over each of these problems. The estimation of a posterior distribution over hypothesis in these regions of acoustic confusion is posed as a logistic regression problem. $Gini$SVMs, a variant of SVMs, can be used as an approximation technique to estimate the parameters of the logistic regression problem. We first validate our approach on a small vocabulary recognition task, namely, alphadigits. We show that the use of $Gini$SVMs can substantially improve the performance of a well trained MMI-HMM system. We also find that it is possible to derive reliable confidence scores over the $Gini$SVM hypotheses and that these can be used to good effect in hypothesis combination. We will then analyze lattice cutting in terms of its ability to reliably identify, and provide good alternatives for incorrectly hypothesized words in the Czech MALACH domain, a large vocabulary task. We describe a procedure to train and apply SVMs to strengthen the first pass system, resulting in small but statistically significant recognition improvements. We conclude with a discussion of methods including clustering for obtaining further improvements on large vocabulary tasks. 
March 4, 2005 Arthur Toth Cross-Speaker Articulatory Position Modeling for Speech Production The primary parameterizations of speech used in automatic speech recognition and synthesis are based on DSP techniques. MFCC, LPCC, and derived features can be readily extracted from acoustic signals and allow the construction of relatively high-performance speech systems. However, these features (though related) are a bit removed from the actual physical process of speaking. While a person speaks, the produced sound is the result of respiration and voicing, combined with the motions of articulators, which affect the shape of the vocal tract. The locations of these articulators should also be useful for the parameterization of speech, and should enable the construction of new models. Unfortunately, articulatory position data are difficult to collect. Given these circumstances, a few natural questions to ask are: 1.) What makes articulatory position data worthwhile? 2.) How good are the models that can be created with articulatory position data and can they be improved? 3.) How can the small amount of available articulatory position data be leveraged? This talk will discuss the results of various experiments concerning the use of articulatory position data to predict quantities that are useful for speech synthesis and further experiments to examine whether these data can be applied to models for other speakers for whom no articulatory position data are available. 
March 18, 2005 Wooil Kim Feature compensation method using parallel combination of Gaussian mixture model This talk will discuss an effective feature compensation scheme based on the speech model in order to achieve robust speech recognition. In the proposed method, the parallel model combination method is applied to estimation of noise-corrupted speech model. Employing model combination method eliminates an additional training procedure and computational expense. It also brings an effective adaptation of noise model and more reliable estimation of the noisy-speech model. In addition, another scheme will be introduced to cope with the time-varying background noise environment. In this scheme, the interpolation method of the multiple mixture models is applied and a technique for mixture sharing is proposed for reducing the computational complex. The performance is examined over Aurora 2.0 and speech corpus recorded while car-driving. The experimental results indicate that the proposed schemes are effective in realizing robust speech. 
March 18, 2005 Florian Metze Speech Recognition using Discriminative Combination of Articulatory Feature Detectors "Articulatory Features" such as "VOICED" and "NASAL" are commonly used to describe speech production and the resulting speech sounds. For notational purposes, typical combinations of these features are sompacted int so-called "phones", if they also help lexical disambiguation. The phonetic transcriptions of words are tabulated in dictionaries which form the basis of most of today's state-of-the-art HMM-based speech recognition systems. As phones are more of a convenient short-hand notation for the description of speech then an inherent property of speech, there is interest in removing them from the statistical modeling process used for automatic speech recognition in order to improve speech-to-text performance. Approaches relying on articulatory properties alone however have a computational complexity rendering them unsuitable for mainstream applications. This talk will present a stream architecture which allows to use both standard phone- and novel articulatory feature-models for speech recognition, it will demonstrate how articulatory features are better suited to model human speech and present recent advances in the discriminative training of context-dependent articulatory feature detectors to improve speech recognition. 
April 8, 2005 Richard Stern   
April 22, 2005 Scott Judy and Ulas Bardak The VERA System: Voice-Enabled Reminding Assistant People often miss important appointments and other items in their schedules because they simply forget to look at their planners or PDAs at the right moment. Or, how often do meetings not get scheduled and other important communications not made just because it is so hard to catch busy people and coordinate their schedules? We introduce VERA (Voice-Enabled Reminding Assistant), our project for the Spring 2005 Dialog Lab (11-754) and our approach to this problem. VERA is a dialog system based on the Ravenclaw architecture, that allows registered users to set up alarms and reminders for themselves and other registered users. The system incorporates the VoIP application Skype in the place of a conventional phone line connection, and then calls users at the right time to deliver information to them. VERA can use multiple contacts while attempting to reach someone, such as cell phone, office phone, and Skype running on the user's computer. There is no limit as to how many contacts can be listed for a given user as long as each contact is a phone number or valid Skype ID. Once a connection is established VERA attempts to confirm the identity of the person she is speaking to using several strategies. Daily, weekly, and monthly recurring events are supported, as well as urgent messages, in which a receipt confirming the success or failure of the delivery is sent by VERA to the originator. The system is divided into two separate modules, VERA-IN and VERA-OUT, which handle ingoing and outgoing calls, respectively. In other words, VERA-IN is used for getting from the user the piece of information to be relayed to which user and at which time, while VERA-OUT is responsible for actually delivering that information to the right user at the right time. VERA-OUT also has a web-based interface which allows users to directly enter tasks without the need for calling in. In our talk we explain VERA-OUT and our plans for developing it further. 
April 29, 2005 Jürgen Fritsch Conversational Speech and Language Technologies for Structured Document Generation I will present Multimodal Technologies' AnyModal CDS, a clinical documentation system for generating structured and encoded medical reports from speech. Set up as a back-end service, the system is completely transparent to dictating physicians and does not require active enrollment or changes in dictation behavior, while producing complete and accurate documents. Instead of producing flat literal transcripts, AnyModal CDS recognizes and interprets the meaning and intent behind dictated speech, producing rich structured documents with semantically annotated content. In the talk, I will discuss some of the enabling speech and language technologies, focusing on (1) continuous, semi-supervised adaptation of speech recognition models based on non-literal transcripts, and (2) modeling and identification of structure and relevant concepts in spoken documents based on combinations of statistical and finite state grammars with semantic annotations. 
May 27, 2005 Antoine Raux Let's Go Public! Taking a Spoken Dialogue System into the Real World In this talk, I will describe how we made the CMU Let's Go spoken dialog system available to the general public. The Let's Go Public spoken dialog system provides bus schedule information to the Pittsburgh population at night, when the human operators are not working. I will explain the changes we made to our original research system in order to make it usable for the general public and will present an analysis of the calls and dialogue strategies we have used to ensure high performance. This talk is based on a paper by Antoine Raux, Brian Langner, Dan Bohus, Alan W Black and Maxine Eskenazi submitted to Interspeech 2005. 
June 10, 2005 Tina Bennett Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 Because speech synthesis systems are often built using both different data and techniques, it has been difficult to compare the impact of various techniques. In order to improve our understanding of their effects on the quality of resulting synthesized speech, a large scale evaluation of systems using common data was devised by Drs. Alan Black and Keiichi Tokuda. The Blizzard Challenge 2005 called upon current groups working in speech synthesis around the world to build their best voices from the same datasets. This first-of-its-kind evaluation in speech synthesis was hosted by CMU this past spring. In this talk I will describe the organization and design of the Blizzard Challenge and quickly describe the CMU team's submission. The remainder of the talk will focus on a discussion of results and lessons learned from conducting an evaluation of this sort. A special session devoted to the Blizzard Challenge 2005 will be held at this year's Interspeech conference in Lisbon, Portugal. 
September 30, 2005 Brian Langner InterSpeech Review  
September 30, 2005 Stefanie Tomko InterSpeech Review  
October 14, 2005 Roger Hsiao Kernel Eigenspace-based MLLR Adaptation Eigenspace-based adaptation methods have been shown effective for fast speaker adaptation when the amount of adaptation data is small (for example, less than 10s). Recently, the application of kernel methods to improve the performance of these eigenspace-based adaptation methods was proposed. Kernel eigenspace-based MLLR adaptation (KEMLLR) method is one of the kernelized speaker adaptation methods, which tries to exploit the possible non-linearity in the speaker transformation supervector space. In KEMLLR, speaker-dependent MLLR transformation matrices are mapped to a kernel-induced high dimensional feature space, and kernel principal component analysis (KPCA) is used to derive a set of eigenmatrices in the feature space. A new speaker is then represented by a linear combination of the leading eigenmatrices. In this talk, KEMLLR adaptation will be compared with other adaptation methods including MAP, MLLR, eigenvoice (EV), and embedded kernel eigenvoice (eKEV) on the Resource Management and Wall Street Journal tasks using 5s or 10s of adaptation speech. 
October 14, 2005 Tina Bennett Eurospeech 2005 - Paper Review  
October 20, 2005 Dr. Udo Bub Speech Projects at T-Labs  
November 11, 2005 Brian Langner Towards Improving Spoken Presentation of Lists In this talk, I will describe recent work on improving the spoken presentation of lists and groupings of information, focusing on the appropriate amount of information to provide. For example, within the bus information domain, such lists and groups include things like "any of 61C, 61D, 59U, or 64A" and "all of the 61's except for the 61D". I describe an experiment designed to examine the limits of list size on understandability, methods for increasing the amount of provided understandable information, as well as conclusions we can draw from the results of this experiment. Further, I will discuss future directions and goals for this research, and additionally, its use within text-to-speech systems.  
December 7, 2005 Ian R. Lane Out-of-Domain Utterance Detection based on Topic Classification Confidence One significant problem for spoken language systems is how to cope with users' OOD (out-of-domain) utterances which cannot be handled by the back-end application system. In this talk, we propose a novel OOD detection framework, which makes use of the classification confidence scores of multiple topics and applies a linear discriminant model to perform in-domain verification. The verification model is trained using a combination of deleted interpolation of the in-domain data and minimum-classification-error training, and does not require actual OOD data during the training process, thus realizing high portability. When applied a read-style speech task, the proposed approach achieves an absolute reduction in OOD detection errors of up to 8.1 points (40% relative) compared to a baseline method based on the maximum topic classification score. Furthermore, the proposed approach realizes comparable performance to an equivalent system trained on both in-domain and OOD data, while requiring no OOD data during training. We also apply this framework to spontaneous dialogue speech, and extend the framework in two manners. First, we introduce topic clustering which enables reliable topic confidence scores to be generated even for indistinct utterances, and second, we implement methods to effectively incorporate dialogue context. 
Showing 19 items