Past Seminars‎ > ‎


Showing 21 items
June 13, 2005 Kevyn Collins-Thompson Prosody models based on automatically-derived focus words in narrative text The goal of this Speech II final project was to increase the appeal and clarity of voice synthesis for children's stories, without requiring any human pre-tagging of the text. To attempt this, I created very simple models of three 'focus word' types: topic words, difficult words, and modifier words, which are derived from the story text based on vocabulary statistics and shallow parsing. There is also a time-dependent 'novelty' component. These models drive duration and pitch changes in the Festival synthesizer. For example, the pitch of topic words is emphasized depending on their relevance and novelty, and the duration of difficult words can be stretched via a custom phoneme table. Future applications of this approach could include varying the language models used to estimate focus words, so that the same text could be adapted to different listeners, depending on their language background and skills. 
October 31, 2003 Qin Jin   
October 31, 2003 Kornel Laskowski   
October 17, 2003 Dr. Tomoki Toda A Voice Conversion Algorithm Based on Gaussian Mixture Model with Dynamic Frequency Warping Speech of various speakers can be synthesized by utilizing a voice conversion technique that can control speaker individuality. The voice conversion algorithm based on the Gaussian Mixture Model (GMM), which is a conventional statistical voice conversion algorithm proposed by Stylianou et al., can convert speech features continuously by using the correlation between a source feature and a target feature. However, the quality of the converted speech is degraded because the converted spectrum is excessively smoothed by the statistical averaging operation. In this talk, we propose a GMM-based algorithm with Dynamic Frequency Warping (DFW) to avoid such over-smoothing. In the proposed algorithm, the converted spectrum is calculated by mixing the GMM-based converted spectrum and the DFW-based converted spectrum to avoid deterioration of spectral conversion-accuracy. Results of evaluation experiments clarify that the proposed algorithm can improve the quality of the converted speech while maintaining the same conversion-accuracy for speaker individuality as the GMM-based algorithm. 
September 26, 2003 Tanja Schultz Introduction into Multilingual Speech Recognition In this talk I will give an overview over our research activities in the area of Multilingual Speech Recognition, including multilingual acoustic modeling, rapid language adaptation, and new trends in using other-than- phonemes-as-recognition-units. I will also discuss challenges and solutions for implementing multilingual speech interfaces. With respect to a tighter coupling of speech recognition and translation I will present an approach to correct disfluencies in spontaneous speech. This presentation will be given as a video conference transmission from Germany. 
September 16, 2003 Prof. Ron Cole Perceptive Animated Interfaces: A New Generation of Interactive Learning Tools Advances in human communication technologies are leading to a new generation of learning tools that engage students in natural face-to-face conversations with animated characters that behave like sensitive and effective teachers. These perceptive animated interfaces combine several emerging technologies-spoken dialogue interaction, language understanding, computer vision and character animation-to interpret student behaviors (mouse clicks, typed and spoken responses, facial expressions, gaze, hand and body gestures) and to infer the student's state of knowledge and cognitive state. The system uses this information in learning tasks to enable the animated character to respond to the student in real time using speech, facial expressions, head and eye movements and hand and body gestures. At the University of Colorado, research and development of perceptive animated interfaces occurs in the context of the Colorado Literacy Tutor, a comprehensive literacy program designed to improve reading and comprehension of text. My presentation will present the vision of perceptive animated interfaces; demonstrate literacy tutors currently deployed in K-2 classrooms that use perceptive animated agents to teach foundational reading skills, fluent reading and comprehension of text; provide demonstrations of emerging technologies leading to the next generation of animated agents; and discuss key research challenges for the future. 
August 8, 2003 Prof. Keiichi Tokuda An HMM-Based Approach to Speech Synthesis This talk summarizes our HMM-based approach to speech synthesis. It will be similar to the talk which I gave at Sphinx Lunch more than one year ago but I will try to include not only the technical description but also recent results and demos. The basic idea of the approach is very simple: just train HMMs and generate speech directly from them. To realize such a speech synthesis system, however, we need some tricks which will be presented in the talk. The attraction of the approach is in that voice characteristics of synthesized speech can easily be changed by transforming HMM parameters. Actually, it is shown that we can change voice characteristics of synthetic speech by applying a speaker adaptation technique which has been used in speech recognition systems. The relation between the HMM-based approach and other unit selection approaches will also be discussed. 
July 25, 2003 John Kominek & Alan W. Black ARCTIC -- A standard phonetically balanced speech database for speech synthesis research Although there are many standardized databases for evaluating speech recognition research there are almost none for speech synthesis. In order to help address this issue we have constructed new single speaker phonetically balanced databases that will be distributed without restriction for use in evaluation, algorithm comparison and as base-line voices for the speech synthesis community. This talk will present the motivation behind providing these as well as a detailed description of how they were designed, recorded and how baseline voices are built. We will also present our intention for their use in the immediate future. 
July 11, 2003 Fei Huang Named Entity Translations, Offline or Online? Named Entity(NE) translation is both semantically important and technically challenging, because OOV words occur frequently in person, location or organization names. High accuracy NE translation can contribute to multilingual natural language processing, such as statistical machine translation, cross information retrieval and crosslingual information extraction. In this talk I will present both offline and online methods for NE translation. In offline NE translation, starting from a bilingual corpus where NEs are automatically tagged for each language, NE pairs are aligned in order to minimize the overall multi-feature alignment cost. An NE transliteration model is presented and iteratively trained using named entity pairs extracted from a bilingual dictionary. The transliteration cost, combined with the named entity tagging cost and word-based translation cost, constitute the multi-feature alignment cost. These features are derived from several information sources using unsupervised and partly supervised methods. A greedy search algorithm is applied to minimize the alignment cost. Experiments show that the proposed approach extracts NE translingual equivalence with 81% F-score and improves the translation score from 7.68 to 7.74. In online NE translation, a bunch of topic relevant documents (wrt the translation hypothesis) are retrieved, and NEs in these documents are extracted, then matched aganist NEs in the source translation document. Those NE pairs with minimum transliteration cost are considered as translation equivalence, which further increase the translation quality from 7.87 to 7.96. 
June 27, 2003 Mikiko Mashimo Evalution of cross-language voice conversion using bilingual and non-bilingual databases I will present our work that test an extension of a single- language voice conversion technique to a cross-language conversion. Voice conversion means converting the voice of one speaker to sound like that of another speaker, and it is useful for many applications if it is extended to cross-languages. To modify source speeches, we concentrated on reducing spectral and f0 differences between the speakers. The performance was investigated by objective and perceptual evaluation using Japanese and English bilingual-speakers data for training in a system based on a Gaussian mixture model and a high quality vocoder. Results indicate that training with cross-language models also reduces speaker's voice diffrences between the pairs. Some demonstrations and future work are also included in the talk. 
June 13, 2003 Antoine Raux Modeling the Prosody of Emphasis in English - A Unit Selection Approach In this presentation, I will describe a way to model the prosodic features of word emphasis for speech synthesis. A decision tree was built for phoneme duration, based on a database of spoken English utterances containing emphasized words. Further, the F0 curve of a synthesized utterance is obtained by selecting and concatenating portions of F0 curves from the database, based on their segmental and suprasegmental properties. The obtained F0 curve is then applied to the synthesized utterance. This method has, over parametric or rule-based models of F0, the same advantage that concatenative speech synthesis has over generative speech synthesis: it allows to capture elements of natural prosody without requiring to analyze and model them explicitly and extensively. Using examples, I will illustrate the process of generating the F0 curve and compare this approach with a rule-based F0 model. 
May 30, 2003 Prof. B. Yegnanarayana  In this talk I propose to discuss a method for manipulating the duration and pitch without significantly distorting the qaulity of speech. This is accomplished using the knowledge of epochs, which are the instants of significant excitation of the vocal tract system. The method to extract the epochs from speech is based on the properties of the group-delay fuctions. Naturaless is preserved since significant portion of the excitation information is retained in the form of Linear Prediction (LP) residual. The technique will be illustrated with a few audio demonstrations. 
May 16, 2003 John Kominek Phoneme Segmentation for Unit Selection Synthesis As part of improved support for building unit selection voices, Festival now includes two algorithms for automatic labeling of wavefile data. The first method employs dynamic time warping to align a given wavefile against a known reference. This usually requires having a synthesizer already built for the target language -- a restriction to be averted if possible. The second, more recent addition makes use of the HMM-based acoustic modeling component from Sphinx-2. We have found that one technique is not clearly superior to the other but that the error characteristics are distinctly different. DTW is the more accurate method in 60-70% of cases, but is also more prone to gross labeling errors. Gross label errors are disastrous in a synthetic voice and need to be corrected before an acceptable voice can be constructed. This talk will illustrate these findings and indicate how a hybrid approach can eliminate such outliers without compromising overall accuracy. 
May 16, 2003 Brian Langner Language Generation and Synthesis for the Let's Go! Project I will discuss the spoken dialog system being created for the Let's Go! project, which deals with bus information for Pittsburgh-area buses. Specifically, I will discuss issues related to language generation and synthesis for this system. This work involves determining how stops are to be generated by the NLG, and designing and using a unit selection voice that has adequate coverage to say all of those stop names. 
May 2, 2003 Hua Yu Implicit Pronunciation Modeling for Conversational Speech Recognition Modeling pronunciation variation is key for recognizing conversational speech. Rather than being limited to dictionary modeling, we argue that triphone clustering is an integral part of pronunciation modeling. We propose a new approach called Enhanced Tree Clustering. This approach, in contrast to traditional decision tree based state tying, allows parameter sharing across phonemes. We show that accurate pronunciation modeling can be achieved through efficient parameter sharing in the acoustic model. Combined with a Single Pronunciation Dictionary, a 1.8% absolute word error rate improvement is achieved on Switchboard, a large vocabulary conversational speech recognition task. If time permits, I'll also talk about Gaussian Transition Models, This is a new approach to implicitly model trajectory information, in order to overcome the frame-independence assumption in HMMs.  
May 2, 2003 Michael Katzenmaier Focus of Attention based on Speech Recognizer Hypothesis  
April 18, 2003 Tina Bennett Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices Within-speaker pronunciation variation is a well-known phenomenon; however, attempting to capture and predict a speaker's choice of pronunciations has been mostly overlooked in the field of speech synthesis. We propose a method to utilize acoustic modeling techniques from speech recognition in order to detect a speaker's choice between full and reduced pronunciations. 
April 18, 2003 Jason Zhang Identifying Speakers in Children's Stories for Speech Synthesis Within the framework of rendering children's stories as synthetic speech, we are looking at greater text analysis in order to better choose appropriate voices in synthesis. We examine how to automatically identify spoken text within a story and by identifying the characters in the story, assign each quote to a particular character. The resulting marked-up story may then be rendered with a standard speech synthesizer with appropriate voices for the characters. This work presents each of the basic stages in identification and the algorithms, both rule-driven and data-driven, used to achieve this. A variety of story texts are used to test our system. Results are presented with a discussion of the limitations and recommendations on how to improve speaker assignment in more texts. 
April 18, 2003 Arthur Toth An Iterative Technique for Segmenting Speech and Text Alignment Numerous speech-processing tasks use data in the form of a large audio file with an associated text transcript. In many cases, it is necessary to split the audio file into smaller segments for processing, and to make the corresponding divisions in the transcript. Some approaches to this problem have relied on additional information beyond the text which facilitates the segmentation. However, such information is often not readily available, and it would be useful to have a technique that could perform such segmentation and alignment without depending on it. We will describe an automatic technique that segments the audio file based solely on acoustic information, and then attempts to determine where these segments occur in the text. The discussion will include the results of testing this technique on a portion of the Boston University Radio Corpus. 
April 4, 2003 Pam (Sinaporn) Suebvisai Comparison of Thai tones articulation between Tonal language speakers and Non-tonal language speakers  
April 4, 2003 Kevyn Collins-Thompson   
Showing 21 items