Past Seminars >
JSS-2004
Date | Speaker | Title | Abstract |
---|---|---|---|
November 19, 2004 | Antoine Raux | ICSLP Paper Review | |
November 12, 2004 | Shirin Saleem & John Kominek | ICSLP Paper Review | |
November 5, 2004 | Alan Black & Satanjeev Banerjee | ICSLP Paper Review | |
October 29, 2004 | Tanja Schultz | ICSLP Paper Review | |
October 15, 2004 | Arthur Toth | ||
October 15, 2004 | Wilson Tam | ||
September 17, 2004 | Woosung Kim, JHU | Language Model Adaptation for Automatic Speech Recognition and Statistical Machine Translation | Language modeling is crucial in many natural language applications such as automatic speech recognition and machine translation. Due to the complexity of natural language grammars, statistical techniques have been dominant for language modeling. All statistical modeling techniques, in principle, work under some conditions: 1) a reasonable amount of training data is available and 2) the training data comes from the same or similar population as the test data to which we eventually want to apply our model. This talk presents new methods to handle the problem when those conditions are not met in statistical language modeling---language model adaptations. In order to tackle the data scarcity problem in resource-deficient languages, we propose methods to take advantage of a resource-rich language such as English, utilizing cross-lingual information retrieval followed by machine translation. We next experiment the language model adaptation technique in a different language, English which is resource-rich and in a different application, the statistical machine translation task. Experimental results show that our adaptation techniques are effective for statistical machine translation as well as speech recognition. |
September 3, 2004 | Szu-Chen (Stan) Jou | Adaptation for Soft Whisper Recognition Using a Throat Microphone | In this talk, we report our work on recognizing soft whispery speech which is recorded with a throat microphone. The goal is to provide a noise-robust way to communicate privately. Our approach applies various adaptation methods to this task. Since the amount of adaptation data is small and the testing data is very different from the training data, a series of adaptation methods is necessary. We will discuss our combination of adaptation methods, which include maximum likelihood linear regression, feature-space adaptation, and re-training with downsampling, sigmoidal low-pass filter, or linear multivariate regression. |
September 3, 2004 | Yunghui Li | Applying Articulatory Features in ASR for non-native Speech | In this talk we will present the E-language learning system ELLS, a joint project by the US Department of Education and the Chinese Ministry of Education. After briefly introducing the internet based learning environment we will characterize the non-native speech we are facing in this project and then describe our experiments with Articulatory Features. By combining the AFs with traditional phoneme models, enhancing the context free grammar of the recognizer, and performing supervised and unsupervised adaptation schemes, we improved the speech recognition performance on non-native low proficiency speech. |
August 25, 2004 | Christian Fügen | Decoding along Context Free Grammars with Ibis Language Modeling experiences of the past evaluations | |
August 25, 2004 | Florian Metze | The Interface between ASR and Summarization/MT | |
August 20, 2004 | Tina Bennett | Initial Investigations in Evaluation of Speech Synthesis Evaluation | Evaluation of speech synthesis systems is a challenging problem. There are several commonly used methods for evaluation, but it is sometimes unclear how to best utilize them. We seek to learn how to conduct evaluations such that they are straightforward and their results are maximally useful for learning about users' preferences and making improvements to our systems. A small evaluation was conducted to explore this topic. Volunteers with varying levels of experience in listening to synthetic speech were included. Each participant listened to audio samples from three distinct systems in tests devised for four genres: news, novels, conversation, and semantically unpredictable sentences (SUS). Mean opinion scores were used in the first three tests, whereas a "type what you hear" format was used for the SUS test. Results and insights from the evaluation will be discussed. |
August 20, 2004 | Brian Langner | Modifying synthetic voices to improve understanding of speech in noise | In this talk, we discuss modifying synthetic voices to produce speech more like speech in noise, with the goal of improving understandability in noisy conditions. Speech in noise has several differences from normal speech, including spectral, durational, and pitch differences, which are taken into account by the voice conversion process used to modify existing synthetic voices. We describe the voice conversion process, as well as the methods used to obtain clean speech in noise and build a synthetic voice from it. We also discuss the results of several listening tests using modified and non-modified voices. |
July 30, 2004 | Prof. P.N. Girija, University of Hyderabad | Duration in different types of Telugu sentences with various emotions/attitudes and Role of Intonation in Telugu emotional/attitudinal sentences | Duration is one of the required main features to develop any emotional speech synthesis system. To explores the characteristic features of Telugu the duration of vowels under the following contexts .The phonetic contrasts, intrinsic duration, influence of adjacent sounds, positional variation, syllable structure are considered for the study. Also duration variations in different emotions and attitudes of declarative, interrogative and imperative sentences have been studied. Language is probably one of the most important tool that distinguishes human beings from other animals and speech is the most important medium of language. Speech sounds consists of segmental and non-segmental features. The non-segmental features consists of paralinguistic features such as vocal effects and prosodic features. Under prosodic features the intonation features such as pitch, pitch level, pitch movement, pitch range, loudness for different emotions/attitudes for different types of sentences have been studied. |
July 2, 2004 | David Reitter, MIT | Aspects of a situationalized multimodal human-computer interface | It's a common perception that some meetings are more effective than others. Those meeting that involve the physical presence of participants allow them to rely on multiple communication channels (multimodality), among them natural language, eye gaze and body postures. When channels are missing, such as in a call conference, communicative elements such as topic tracking (coherence) and turn-taking behavior become harder to manage. This is equally true in user interfaces: when restricted to unimodal communication, and this single channel is limited in bandwidth, noisy, or otherwise error-prone, humans encounter difficulties - for example when they use a small-screen computer interface or a voice-based dialogue system over the phone. Humans can usually integrate multimodal information without effort, which leads us the ask: can multimodality improve language-based interaction in bottle-neck devices? The UI on the Fly project explores ways to do this. While today's user interfaces make different input and output methods available (mouse and keyboard, screen and speakers), our interfaces go beyond that. They ensure cross-channel coordination for both input and output, so the communication channels can be used in parallel. These interfaces convey not just redundant, but also complementary information. For example, they can augment a graphical user interface (GUI) with helpful audio commentary. In mobile situations, screen-based output may be simplified, or eliminated entirely, in reponse to a specific use situation, e.g. when driving. Similarly, the system can adapt to the needs of hard-of-hearing or visually impaired users. We address the adaptivity of the user interface with a dynamic generator. Multimodal Functional Unification Grammar (MUG) is a unification-based formalism (based on FUF, Elhadad&Robin 1992) that provides the means to dynamically generate content that is coordinated across several communication modes, which currently include natural language on the screen and by voice, and GUI elements. The interface can adapt the content presented in each mode to the user's preferences and usage situation. The generation process satisfies the hard constraints defined by the dialogue act (input) and the grammar, and finds the optimal (or a near-optimal) solution according to an objective function. This heuristic defines the trade-off between predicted cognitive complexity of the output and its utility, following classical communication principles (cf. Grice's maxims). This way, the system can select from among several possible output forms generated by the grammar. The input to the formalism is a compositional semantic representation, which suggests it is suitable for principle-based dialogue management rather than hard-coded finite-state dialogue models. MUG aims to establish cross-modal coherence of output: we postulate that a user interface should be consistent, but not entirely redundant in its simultaneous, mode-specific outputs. That means to coordinate output on a lexical level and to some extend on syntactical level. Dialogue acts should be coordinated, too. Emphasized screen objects - similar to pointing gestures in input - and deictic expressions influence and depend on the salience of objects. We adopt Centering theory (Gros, Joshi, Weinstein 1995) as a framework to model coherence in the generation of referring expressions. The concept is embodies in the MUG System, which is a new development environment that makes it easier to develop, debug and study unification grammars for generation. |
May 28, 2004 | Arthur Toth | Using Dynamic Bayesian Networks to Model Speech: Survey and Preliminary Experiments | This talk will cover some background information about a type of graphical model called a Dynamic Bayesian Network, that is a generalization of Hidden Markov Models. I will describe some of the useful algorithms for Dynamic Bayesian Networks, and some of the attempts researchers have made to apply them to modeling speech. Then, I will describe a few preliminary experiments I have conducted and the results. |
May 25, 2004 | Dr. Shinji Watanabe, NTT Laboratories | Variational Bayesian Estimation and Clustering for speech recognition (VBEC) | In this task, I will introduce our proposal Variational Bayesian Estimation and Clustering for speech recognition (VBEC), which is based on the Variational Bayesian (VB) approach. VBEC is a total Bayesian framework, i.e., all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the Maximum Likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC (1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone Hidden Markov Model states and determining the number of Gaussians, and (2) enables robust speech classification, based on Bayesian Predictive Classification using VB posterior distributions (VB-BPC). The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracy due to the VBEC functions. |
May 14, 2004 | Prof. Nam Soo Kim, Seoul National University | Robust Speech Preprocessing for Mobile Applications | These days mobile communication has become an indispensable part of our daily lives, and has provided a variety of services. Diversified mobile applications have been available due to the recent progress in speech processing technologies such as speech coding, recognition and synthesis. Since, however, there exist many factors that deteriorate the speech quality, more robust approaches should be considered when designing a speech application in mobile communication. In this talk, I present three techniques that have been developed by our research group to reduce the effects of background noise and speech coder distortion for mobile communication. The first one is a soft decision-based spectral enhancement algorithm that suppresses the noise components and results in a cleaned speech waveform. The second one is a feature compensation technique referred to as the interacting multiple model (IMM) approach applied to achieve robust speech recognition. The last one is a preprocessor that modifies the signal applied to a low-bit-rate speech coder as an input. Our research has mainly focused on speech preprocessing algorithms, which require a minimal modification to the system. |
April 30, 2004 | Jonathan Brown | Residual Effects of Russian as a Native Language | The work presented in this talk details the residual effects of Russian as a native language for a particular speaker. Because the speaker does not have a stereotypical Russian accent, I detail the elements of her speech which are different from that of a native English speaker and attempt to explain the differences in terms of her first language. This work was done for the final project for the first half of Speech II. |
April 30, 2004 | Shirin Saleem | Role of native tongue and medium of instruction in school on 'Indian English' | This talk describes the influence of two factors - native tongue and medium of instruction in school - on the pronunciation of English words by native Indian speakers. Eight speakers of two Indian languages - Bengali and Kannada, were compared on the basis of the above 2 factors, and the major patterns in pronunciation were noted. This project was done for the Speech II class. |
April 16, 2004 | Dr. Chiori Hori | Spoken Interactive Open-domain QA system | Human and machine dialog systems using a speech interface have been intensively researched in the field of spoken language processing (SLP). Such conversational dialogs to exchange information through question answering (QA) are a natural communication modality. However, state-of-the-art dialog systems only operate for specific-domain question answering (SDQA) dialogs. To achieve more natural communication between human beings and machines, spoken dialog systems for open domains are necessary. Specifically open-domain question answering (ODQA) is an important function in natural communication. Our goal is to construct a spoken interactive ODQA system, which includes an ASR system and an ODQA system. Two main issues that need to be addressed to construct spoken interactive ODQA systems are: 1. The spoken QA problems: Recognition errors due to out-of-vocabulary degrade the performance of QA systems. Some indispensable information to extract answers is deleted or substituted by other words. To cope with this problem, 1.8 million words recognition on real time using a weighted finite state transducer (WFST) is constructed. Approximation in on-the-fly composition is applied to decoding. 2. The interactive ODQA problems: Since user's questions are not restricted, system queries for additional information to extract answers and effective interaction strategies using such queries cannot be prepared before the user inputs the question. To require additional information based on user's question and retrieved corpus, systems obtain additional information from users by using disambiguating queries generated by combining interrogatives and phrases in users' questions to query modifiers. |
March 12, 2004 | Paisarn Charoenpornsawat | Feature-based Thai Word Segmentation | Word Segmentation is a problem in several Asian language that have no explicit word boundary delimiter, e.g. Chinese, Japanese, Korean and Thai. We propose to use feature based approaches for Thai word segmenation. A feature can be anything that tests for specific information in the context arround the word in question, such as context words and collocations. To automatically extract such features from a training corpus, we employ two learning algorithms, namely RIPPER and Winnow. Experimental results show that both algorithms appear to outperform the existing Thai word segmentation methods, especially for context-dependent strings. |
January 16, 2004 | Florian Metze | The ISL RT-03 Switchboard System | This talk describes the ISL large vocabulary conversational telephony speech recognition system, which was tested in NIST's RT-03 Switchboard evaluation. We present our experiments on improving preprocessing, acoustic modelling, and language modelling. The system features phone dependent semi-tied full covariances, semi-tied clustering of septa-phones, clustering across phones, feature adaptive training, robust estimation of VTLN and MLLR, as well as context dependent interpolation of language models. We present detailed results for each stage of our multi-pass transcription scheme. System development started in 2002 with an word error rate of 35.1% on our internal 1h development set. The final system performed at 21.8%, a 38% relative improvement. The error rate on the RT-03 CTS evaluation set is 23.4%. |
Showing 23 items