Past Seminars‎ > ‎

JSS-2011, 2012


Showing 8 items
DateSpeakerTitleAbstract
Sort 
 
Sort 
 
Sort 
 
Sort 
 
DateSpeakerTitleAbstract
March 6, 2012 Meg Mitchell Generating Descriptions of Visual Objects What do people describe when they look at objects? Can we model what they say? (Why does this matter?) This talk will characterize what makes up a visual description and define some of the methods necessary to automatically generate such language. Taking this a bit further, I describe an end-to-end prototype system that reads in computer vision output and generates natural language descriptions. Time permitting, I argue that improving visual descriptions can also improve computer vision, and working on the interaction between the two may lead to advances in both computer vision and natural language generation. My prototype vision-to-language system, largely developed during the Hopkins summer workshop 2011 in collaboration with vision researchers at Stony Brook and language researchers at U. Maryland, is available at:  
February 17, 2012 Jun Ohya Computer Vision, Computer Graphics, Virtual Reality and Merging these Technologies with Art This talk overviews our laboratory’s research activities. Our areas include computer vision, computer graphics, virtual reality and merging these component technologies with art. Based on these, more system oriented projects such as virtual telecommunication environments, Cyber Theater, advanced digital libraries, robotics systems, human interface systems and medicine/welfare applications are being researched. After overviewing the above-mentioned component technologies and system oriented projects, a few projects related to computer vision are detailed. For example, (1) A method for tracking multiple objects from time-sequential stereo images acquired by moving stereo cameras, and (2) Recognizing sign languages’ vocabularies that use both hand gestures and facial expressions from the video sequence. 
October 25, 2011 Martha Larson Beyond the User-Item Matrix: Recommendation Techniques for Social Settings Social information provides not only a rich source of information for improving recommendation, but also makes it possible to define new and interesting recommendation tasks. This talk covers a series of recent recommendation techniques developed at the Delft Multimedia Information Retrieval lab as part of an ongoing effort to improve recommendation by moving beyond the user-item matrix. Topics covered include context-aware recommendation, location recommendation, cross-domain recommendation and trust-aware recommendation. 
September 6, 2011 Daniel Povey Applications of weighted finite state transducers in a speech recognition toolkit The open-source speech recognition toolkit "Kaldi" uses weighted finite state transducer (WFSTs) for training and decoding, and uses the OpenFst toolkit as a C++ library. I will give an informal overview of WFSTs and of the standard AT&T recipe for WFST based decoding, and will mention some problems (in my opinion) with the basic recipe and how we addressed them while developing Kaldi. I will also describe how to use WFSTs to achieve "exact" lattice generation, in a sense will be explained. This is an interesting application of WFSTs because, unlike most WFST mechanisms, it does not have any obvious non-WFST-based analog. 
June 17, 2011 Timo Mertens Subword-based Pronunciation Modeling for Non-native Automatic Speech Recognition We propose a novel lexicon adaptation framework for non-native Automatic Speech Recognition based on linguistic subword knowledge. We exploit the linguistic subword structure of the target language to learn generalizable mispronunciation patterns from non-native speech. By arranging subword segmentations with varying amounts of context and different linguistic features in parse tables we demonstrate how a statistical pronunciation model can be trained in both supervised and semi-supervised fashions. This model can then be used to predict a set of mispronunciations for unseen words based on subword mispronunciations learned from the parse tables. Compared to traditional phone re-writing rules, parse tables model more context in the form of phone-clusters or syllables, and encode abstract features such as word-internal position or syllable structure. Lexicon adaptation by itself results in word error rate reductions of up to 7.9% and 3.3% absolute on Italian and German accented English. In combination with acoustic model adaptation, improvements of up to 12.4% and 11.3% absolute were achieved respectively. 
April 1, 2011 Tomi Kinnunen  Robust MFCC extraction for text-independent speaker verification In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). This method of computing the MFCCs has a few well-known shortcomings such as sensitivity to variations in environment and channel. In this presentation, I present some recent work done in our group in the context of text-independent speaker verification. I will discuss two alternative methods to compute the MFCCs, the first one based on parametric temporally weighted linear predictive models and the second one based on non-parametric multiple window (or multitaper) technique. I will present experimental results under environmental and channel degradations on the NIST 2002 and NIST 2008 SRE corpora. The proposed MFCC computation methods are shown to be generally more robust against signal degradations compared to the baseline methods. 
March 25, 2011 David Suendermann Deployed Spoken Dialog Systems' Alpha and Omega: Adaptation and Optimization Commercial spoken dialog systems process billions of calls every year producing immense cost savings for the customer care industry. As the bottom line of these systems is tied to the applications' performance, optimization and adaptation to the ever changing caller are crucial topics in the dialog system industry. This talk will cover the two main system components subject to adaptation and optimization: 
February 21, 2011 Anton Batliner Featuring Emotion Batliner spoke on the features used for emotion processing, by giving both an overview and demonstrating characteristic approaches with exemplary studies. He also presented the new INTERSPEECH 2011 Speaker State Challenge. 
Showing 8 items