Advanced Speech Technology Symposium
Sponsored by AVIOS
In addition to the 25th anniversary
celebration put on by the AVIOS board, AVIOS also organized the
Advanced Speech Technology Symposium at SpeechTEK in New York on
August 20th and
21st 2007.
The program is summarized
below:
C101 - Advances in Speech Recognition Processing
10:15 a.m. – 11:15 a.m.
MODERATOR: Thomas Schalk, Vice President, Voice Technology - ATX Group
Advances and improvement in core speech recognition technology are difficult to demonstrate, since accuracy is strongly dependent on application, particular speakers, background noise, and other variables. Beyond accuracy, speech recognition technology can be improved by better handling of complex or “natural” dialogs. Audio channels and speech platforms are important components of today’s speech applications. In this session, speakers explore the advances in core speech technology, audio channel processing, and speech platform integration and go behind the scenes of Vista to expose interesting aspects of the integration of speech technology.
Speech Technology in Vista
Fil Alleva, General Manager, Speech - Microsoft, Inc.
Windows Speech Recognition (WSR) in Vista is a practical solution for speech-enabled access to Windows-based PCs for users who find keyboard and mouse interfaces to be less productive than they would like. The technology behind WSR includes automated personalization, the Microsoft Speech Recognizer, SAPI 5.3, the accessibility framework, the text services framework, and Windows Desktop Search all being employed to deliver the Windows Speech user experience.
Speech Processing for DRS Versus NSR
Veeru Ramaswamy, Chief Technology Officer - Vianix
There are two methods for compressing and transmitting digital speech for server based automatic speech recognition. Distributed Speech Recognition (DSR) schemes gained popularity in the late 1990s due to limited data channel bandwidth availability. The evolution of higher bandwidth channels and advances in voice compression now allow Network Speech Recognition (NSR) applications to achieve the speech recognition accuracy of DSR in similar bandwidth and provide additional benefits. This presentation compares voice-based NSR with features-based DSR recognition schemes.
C102 - Advances in Text-to-Speech Processing
11:30 a.m. – 12:30 p.m.
MODERATOR: Thomas Schalk, Vice President, Voice Technology - ATX Group
Text-to-speech synthesis is getting better, more flexible, and is now used globally in a wide spectrum of speech applications. Advances in standards have improved text-to-speech quality. The Speech Synthesis Markup Language (SSML) provides a standard way to control speech synthesis and text processing parameters. The Pronunciation Lexicon Specification (PLS) is designed to enable interoperable specification of pronunciation information. This session reviews some much-needed clarifications about how text in multiple languages should be annotated and describes work being done to link SSML and PLS more seamlessly.
The Internationalization of the W3C Speech Synthesis Markup Language
Daniel Burnett, Speech Standards Lead Engineer - Nuance Communications, Inc.
In SSML, how do you mark tones, or use pinyin for pronunciation, or indicate a change in language but not a change in voice? Learn about the changes in SSML that provide improved support for Mandarin, Cantonese, Japanese, Hindi, and other world languages. This session also explains multi-language annotation and how to link with PLS.
Applying the Pronunciation Lexicon Specification to ASR & TTS
Patrizio Bergallo, Senior System Architect - Loquendo
Many speech applications demonstrate the need to define the pronunciation of certain words (for instance proper names, locations, etc.) or to expand acronyms/abbreviations, both for ASR and TTS usage. This presentation describes the W3C PLS (Pronunciation Lexicon Specification) that defines lexicon documents to be referenced by SRGS grammars and SSML prompts.
C103 - Advances in Natural Language Processing
1:45 p.m. – 2:45 p.m.
MODERATOR: Thomas Schalk, Vice President, Voice Technology - ATX Group
The demand for natural language has reached an all-time high as directed dialog applications continue to be criticized for being inefficient and not flexible enough. There is little dispute that out-of-grammar handling is generally poor when an active grammar is large. In-grammar accuracy for extensive vocabularies has been achieved by using large amounts of speech data to extract statistical information to represent acoustical units. Likewise, statistical approaches have been applied to advance natural language understanding. Most recently, statistical approaches are being applied to voice interface design with the goal of improving user experience. This session reveals some exciting advances in natural language that will affect the future of the user experience.
Creating More Natural Language Interfaces Using Robust Parsing
Krishna Govindarajan, Speech Science Global Discipline Leader, Professional Services - Nuance Communications, Inc.
For the current state-of-the art speech recognition systems, the in-grammar accuracy is quite good, especially for directed-dialog systems. However, due to the variability of how callers respond, a portion of the utterances are not covered by the grammar, i.e., they are out-of-grammar (OOG). OOGs affect the “perceived” accuracy of the system, and are one of the primary items addressed during tuning. This presentation discusses the concepts of “near OOGs,” “far OOGs,” and related concepts.
No Data Like More Data: Experimental Voice Use Interface in Action
Roberto Pieraccini, Chief Technology Officer - SpeechCycle, Inc.
Jonathan Bloom, Senior VUI Designer - SpeechCycle, Inc.
Today we are extending the data exploitation paradigm to voice user interface (VUI) design. Statistics and machine-learning sciences are now complementing the art of designing the best prompts and interaction strategies with the goal of optimizing automation and improving user experience. Using a few case studies, this presentation shows how to “experimentally” choose among competing VUI designs without disrupting the user experience while optimizing global indicators of performance.
C104 - Speech-to-Speech Translation
3:00 p.m. – 4:00 p.m.
MODERATOR: Bill Scholz, President - NewSpeech LLC
Recent innovative integration of recognition and synthesis technology has led to the realization of fully automatic speech-to-speech translation. This session explores the latest techniques for implementing automated language translation and considers the technology behind the integration: how to manage out-of-grammar responses, the effects of using robust parsing versus SLMs, and incorporating an open source speech analytics solution called Unstructured Information Management Architecture (UIMA).
Speech-to-Speech Infrastructure Based on UIMA
Jan Kleindienst, Manager, Conversational Interactions and Architectures - IBM Prague
This presentation shows a distributed infrastructure for integration of third-party recognition, translation, and synthesis technologies into speech-to-speech system combinations. The infrastructure is built over the Unstructured Information Management Architecture (UIMA), an open-source framework for speech analytics. The Web infrastructure has successfully been used for the remote automatic evaluation of speech-to-speech systems on pan-European scale.
Integrating Language Translation Software with Speech Recognition
Hannah Grap, Marketing Communications Manager - Language Weaver, Inc.
As automated language translation technology moves to statistically based computational methods, the timing is right to integrate language translation and speech recognition technologies. Case study examples and demos of existing integrated solutions will give the audience an overview of how to leverage speech applications across languages.
C105 - Voice Search
4:15 p.m. – 5:00 p.m.
MODERATOR: Thomas Schalk, Vice President, Voice Technology - ATX Group
Voice search is perhaps the hottest topic in recent speech deployments. Analogous to searching the Web with text, voice search can encompass a number of services, including directory search and searches for specific information, such as news or sports scores. What are the requirements for achieving effective dialogs when searching by voice? How does dynamic content, such as location-based ads, fit into the voice-user interface? What other analogies are there between voice searching and Web searching? This session is a must for those interested in learning about the trends in voice search.
Optimizing Software Architecture for Voice Search
Leo Chiu, Chief Technology Officer - Apptera, Inc.
Voice search is very hard to do well when you consider the millions of different accents, behaviors, and speech patterns a software program would have to decipher. What is the best way to architect the solution so that it has the best chance of providing an effective consumer experience? What are the business considerations for making it work in the real world? In this presentation you will hear thoughts and learnings from the edge of the “voice search” frontier.
Data Mining for Voice Search
Charles Galles, Principal Speech Solutions Architect - Intervoice
Voice search topics and Web content change all the time. How can an architect prepare the recognizer to recognize fundamentally new words and topics? With all of the activity on the Internet, are there any useful data sources for recognizer training? This presentation will explore how the Web and other data sources may be leveraged to keep your voice search solution current.
C201 - New Approaches to Dialog Design
10:45 a.m. – 12:00 p.m.
MODERATOR: Bill Scholz, President - NewSpeech LLC
As designers are urged to create ever-more sophisticated self-service applications, the pressure for evolving new techniques grows in importance. New Eclipse-based graphical tools oriented around the identification, definition, and reuse of hierarchical dialog patterns and novel nonlinear call flows assisted by agents are described in this dialog design session.
A Graphical Tool for Pattern-Based Dialog Design
Dominique Boucher, Lead Software Developer - Nü Echo Inc.
This presentation shows an Eclipse-based, graphical environment for developing speech applications that specifically addresses the problem of capturing and expressing recurring dialog patterns. This tool transforms the process of designing and implementing dialogs by specifically orienting the design process around the identification, definition, and reuse of hierarchical dialog patterns.
Non-Linear Call Flow Design
Clifford Harlow, Vice President, Client Services - Spoken Communications
Most speech IVR applications are unable to skip utterances that they don’t understand. In contrast, live agents can gather information out of sequence or discern intent. By uniquely combining speech technology with humans, callers can have a more natural, free-flowing self-service experience because they are not locked into a rigid call flow.
Adaptive Voice Dialogs Based on Automatic Speaker Classification
Joachim Stegmann, Head, Advanced Voice Solutions - T-Systems Enterprise Services GmbH
This presentation describes the technology and applications of automatic speaker classification (e.g., age, gender, language, and emotions) in voice portals. It shows how dialog parameters should be adapted to achieve improved user acceptance in IVR systems. The first results from pilot implementations within Deutsche Telekom prove the feasibility and show advantages compared to conventional, non-adaptive systems.
C202 - Artificial Intelligence & VUI Design
1:30 p.m. – 2:30 p.m.
MODERATOR: Bill Scholz, President - NewSpeech LLC
The growing sophistication of VUI designs demands the incorporation of new technologies, including those borrowed from other disciplines. This session focuses on the novel application of artificial intelligence technology concurrently using a dialog engine and a problem-solving engine. It also illustrates the use of natural language to understand the semantics and context of any phrase being processed, making it much easier to develop the answers.
Artificial Intelligence in Voice Self-Service Applications
Mahesh Rajagopalan, President & Co-Founder - Resolvity
Jacek Jarmulak, Senior AI Scientist - Resolvity
This presentation discusses how AI technologies may be used in voice self-service applications to separate the product support logic from the call flow logic, take advantage of the problem-solver’s knowledgebase to develop dialogs, improve speech recognition, create dynamic call flows, and provide an effective and efficient troubleshooting experience. Learn about strengths and weaknesses, rule-based systems, Bayesian inference, decision-trees, and knowledgebases.
Improve Your VUI Design with an AI-Based Conversational Dialog Solution
Peter Trompetter, Vice President, Global Development - GyrusLogic, Inc.
Natural language understanding is an excellent augmentation to an existing or new VUI for better automated call completion and customer satisfaction. Hear about a solution that makes it easier to develop the natural language application after understanding the semantics and context of any phrase.
C203 - Advances in Video & Multimedia Application Design
2:45 p.m. – 3:45 p.m.
MODERATOR: Bill Scholz, President - NewSpeech LLC
The availability of a robust 3G infrastructure throughout Europe and much of Asia has released pent-up customer demand to add live video to extend the utility of voice communications. This session illustrates how video menus, pictures of products, live video clips, and video commercials can be managed, as well as how sample speech/video-enabled self-service applications for universities, travel, retail, and home health can be developed. Also, the use of the Adobe Flash Player, a popular standard for delivering rich Web content to develop multimedia content, will be explained and illustrated.
Speech-Enabled Video Applications: A New Level of Customer Service
Valentine Matula, Director, Multimedia Research - Avaya Inc.
Around the world (including in the U.S. in 2007), many consumers have access to live 2-way video. Learn how speech-enabled self-service applications can become even more effective by showing the caller a visual display or video at the same time that they use the speech application—menus, pictures of products, live video clips, and video commercials. See sample speech-enabled self-service and proactive contact/outbound applications for universities, travel, retail, and home healthcare, and hear about the process of application authoring.
Architecture for Web Multimodal Applications
Jan Sedivy, Manager, VTS - IBM, Czech Republic
Learn about extending the Adobe Flash Player with speech recognition. A lightweight, embedded VoiceXML browser (VoiceXML 2.0-compatible) is easy to control through XML protocol from Action Script to speech-enable existing or new Flash applications. The VoiceXML is controlled by a browser extension for the IE and Firefox browsers. The browser uses the IBM ViaVoice Embedded Engine for speech recognition. Hear the key aspects of the design and about the challenges faced during the implementation.
C204 - Speech-to-Text Transcription
4:15 p.m. – 5:15 p.m.
MODERATOR: Bill Scholz, President - NewSpeech LLC
Recognition technology has matured to the point that recorded telephone-quality audio from unknown speakers can be accurately transcribed. Applications such as speech-enabled e-mail have become highly needed in the mobile environment because typing is not always practical when using hand-held devices. Recent applications of speech-to-text for searching and transcribing voice data will be illustrated for other applications, including medical data transcription and the near-real-time conversion of voice mail to text.
Technology & Applications Associated with Broadcast Transcription
Sara Basson, Program Director, Speech Transcription Strategy - IBM Research
As speech transcription technology improves and evolves, more opportunities emerge for captioning broadcast media. This presentation outlines some remaining challenges, such as latencies and understandability. It also addresses issues in combining speech transcription with other natural language technologies, such as search, translation, and named entity detection.
Are We Ready? A Look at the Latest Speech-to-Text Applications
Marie Meteer, Vice President, Speech & NLP - EveryZing
Speech-to-text has steadily improved in accuracy during the past 2 decades, but the question remains: “Is it good enough?” The answer lies not in the technology, but in the applications. Using her experience with BBN’s STT engine, Marie Meteer describes how STT performance affects a variety of applications: where it works, where it fails, and where supporting technologies can make the difference.
|