Past Seminars‎ > ‎

JSS-2010

Showing 8 items
DateSpeakerTitleAbstract
Sort 
 
Sort 
 
Sort 
 
Sort 
 
DateSpeakerTitleAbstract
December 3, 2010 Stefan Steidl "Vocal Emotion Recognition: State-of-the-Art in Classification of Real Life Emotion For a better human-computer interaction, systems should be able to recognize the emotional state of the user. In order to obtain high recognition rates, it is important to train the system with realistic data. However, vocal emotion recognition is still mainly based on acted emotions. The FAU Aibo Emotion Corpus is one of the few corpora of naturally occurring emotion-related states that are available for scientific research and are large enough for machine classification. In my talk, I will describe the corpus and present results on the relevance of different feature types. Furthermore, I will show how state-of-the-art recognition performance degrades from acted emotions, over naturally occurring states to an open-microphone setting as used in the INTERSPEECH 2009 Emotion Challenge. 
August 2, 2010 Patrick Nguyen Speech recognition with Segmental Conditional Random Fields: News from the JHU workshop 2010 Segmental Conditional Random Fields (SCRF) relax the frame-level Markov assumption to a word-level Markov assumption. They provide a flexible framework for integrating heterogeneous word-level information. New kinds of processing, heretofore impractical to shoe-horn into HMM are now possible. At the JHU Summer Workshop 2010 a group of international researchers is exploring new approaches based on coherent demodulation, deep belief networks, DTW templates, long-range neural networks, point-process models and more. During the talk I will introduce SCRFs and Flat Direct Models and then give a brief overview of results observed during the workshop. 
June 21, 2010 Björn Schuller Emotion in Speech - A Look Around the Corner Everyday speech is emotional and rich of social signals as laughter. By that, future spoken dialog and speech retrieval systems will also have to look at "this side" - be it to raise robustness of automatic speech recognition and spoken language understanding or to integrate soft skills into tomorrow's human-robot and -machine communication. As the field of emotion recognition has considerably matured and grown over the last decade, we may now start to encounter full realism in processing: non-acted, non-elicited, but full blown spontaneous and natural emotion without pre-selection of high inter-labeler agreement instances or verbal limitations and analysis based on non-segmented, non-transcribed speech, potentially in the noise. In addition, optimal system integration will become relevant as we increasingly start to build real-life applications: on-line and incremental predictions, confidences, feature and emotion encoding standards and synergistic, yet flexible fusion with generally efficient or distributed analysis will become more and more important at the same time as acoustic, linguistic and emotional speaker profiling and studies with working prototype engines having users or target speakers in the loop. In these respects this talk shall provide an overview on recent results and trends in the analysis of emotional aspects and non-linguistic vocalisations in human speech following the complete chain of processing from input to output and indicating future challenges and directions. 
May 7, 2010 Alexander Schmitt LET's GO Witchcraft: Mining and model testing with the Witchcraft Workbench and the CMU LET's GO Corpus This talk will report on the advances in the Witchcraft project. Witchcraft is a new platform-independent open-source workbench designed for the analysis, mining and management of large spoken dialogue system corpora. What makes Witchcraft unique is its ability to visualize the effect of classification and prediction models on ongoing system-user interactions. By that we are able to simulate the employment of any kind of statistical prediction model on ongoing interactions, e.g. in emotion, age and gender recognition tasks. Witchcraft is able to handle predictions from discriminative classifiers and regression models. To a special degree the workbench targets Interactive Voice Response systems and the management of large corpora from telephone applications, but is sufficiently general to cover all kinds of SDS-based corpora. We adapted the CMU LET’S GO corpus to demonstrate Witchraft. 
May 7, 2010 Tim Polzehl Emotion and Personality Classification from Speech  This talk will show advances in user state classification, e.g. emotion and personality classification. In 2009, the first Emotion Recognition benchmark was organized and held at the Interspeech 2009 conference. Departing from latest results from automatic emotion classification we give outline general approaches from the conference and latest results from T-Labs. We will estimate general figures of performance on an anger recognition task for IVR data. Exploring speech analysis beyond emotions we will introduce preliminary results from user personality classification. First results from experiments under laboratory condition will outline general applicability and challenges when classifying users according to the “Big 5” personality traits. 
April 26, 2010 Ivan Tashev Sound Capture and Processing for Telecommunications and Speech Recognition The talk will present technologies for speech enhancement and microphone array processing designed in Microsoft Research. They are used in devices and applications such as Windows Live Messenger and Office Communicator, in-car infotainment systems (see Ford Sync at www.syncmyride.com), voice control and communications (see Project Natal for Xbox at http://www.xbox.com/en-AU/live/projectnatal/). The talk is illustrated with several demos of the presented technologies.  
February 26, 2010 Hagen Soltau Dynamic Network Decoding Revisited We present a dynamic network decoder capable of using large cross-word context models and large n-gram histories. Our method for constructing the search network is designed to process large cross-word context models very efficiently and we address the optimization of the search network to minimize any overhead during run-time for the dynamic network decoder. The search procedure uses the full LM history for lookahead, and path recombination is done as early as possible. In our systematic comparison to a static FSM based decoder, we find the dynamic decoder can run at comparable speed as the static decoder when large language models are used, while the static decoder performs best for small language models. We discuss the use of very large vocabularies of up to 2.5 million words for both decoding approaches and analyze the effect of weak acoustic models for pruning. 
February 23, 2010 Kenichi Kumatani  Beamforming with Super Gaussian Criteria for Distant Speech Recognition Microphone array techniques for hands-free speech recognition can relieve users from the necessity of donning close talking microphones (CTMs) before dictating or otherwise interacting with automatic speech recognition (ASR) systems. Far-field speech recognition systems are used in many applications such as humanoid robots, voice control systems for automobiles, automatic speech annotation systems in meetings and so on. A main problem in the far-field speech recognition is that the recognition performance is seriously degraded when a speaker is far from the microphones. Noise and reverberation should be removed from signals received with the microphones in order to avoid the degradation. Acoustic beamforming techniques have a potential to enhance speech from the far field with little distortion. In this talk, I review the conventional techniques and then present new microphone array processing methods for the distant speech recognition as follows: * Beamforming based on the maximum super-Gaussian criterion. Distant speech can be enhanced by these techniques without the signal cancellation problem encountered in conventional adaptive beamforming algorithms. Maximum negentropy and kurtosis criteria are considered in this talk. * Minimum mutual information (MMI) beamforming. It can separate sound sources without the signal cancellation problem. Furthermore, it does not suffer any problem seen in the BSS techniques. * Filter bank design for adaptive beamforming. Undesired aliasing effects caused by adaptive processing can be alleviated by the new filter bank system. The effectiveness of the new algorithms is demonstrated through a series of ASR experiments on data captured with real sensors in realist acoustic environments and spoken by real speakers. Speech samples processed with these new techniques are also played. Moreover, I will discuss strategies for realizing real-time operation of these beamforming algorithms and show recent process. 
Showing 8 items