Subject: submission for tutorial at ISDA-2003 1) Title: Automatic Recognition of Natural Speech 2) Instructor: Prof. Douglas O'Shaughnessy INRS-Telecommunications (University of Quebec) 800 Gauchetiere west, Suite 6900, Montreal, Quebec H5A 1K6 Canada 514-875-1266 x2012 (fax 514-875-0344) (email: dougo@inrs-telecom.uquebec.ca) 3) Concise abstract: The automatic conversion of natural conversational speech into text is a computer task on the verge of finding wide commercial application. Current automatic speech recognition (ASR) systems are still quite limited in their capacity to handle natural speech, but applications are nonetheless growing each year. For an ISDA audience, this tutorial will discuss the modern techniques of automatic speech recognition, emphasizing the breadth of knowledge needed to approach near-human performance in this complex task. We will review essential aspects of human speech production from the acoustic-phonetic point of view, pertinent speech analysis methods (e.g., mel-cepstrum), statistical methods (e.g., hidden Markov models), and language models. We will describe the strengths and weaknesses of each technique, and attempt to predict future trends, in which more structure will likely be imposed on techniques which so far have been mostly driven by mathematical simplicity. In all aspects of the tutorial, we emphasize that communications principles are the basis for many decisions in how to design ASR. 4) Scope: This tutorial is specifically focussed in depth on one topic: automatic speech recognition. At the same time, it surveys many aspects of this problem, which is quite interdisciplinary. Since ASR simulates a component of human speech communication, it involves: acoustics (modeling the motion of air waves in the vocal tract), linguistics (phonetic units of speech), psychology (how do humans perceive speech), engineering (building a practical communications system), and computer science (designing an efficient algorithm). This tutorial will touch on all of the relevant aspects. 5) Audience: Any educated student or professional with at least a bachelor's degree in a field related to electrical engineering or computer science. People interested in learning an overview of the technical aspects of speech recognition. People wanting a broad overview of the task, rather than many very specific details of individual steps in the process (since time does not permit this; all questions, however detailed, of course are answered). People wanting clear explanations, without being snowed under by mounds of mathematical equations, which risks losing half a tutorial audience after the first 3 slides. 6) Motivation: The automatic conversion of natural conversational speech into text is a highly-interdisciplinary task involving aspects of computer science, engineering, acoustics, linguistics, and psychology. The recent major advances in this field have come from improvements in recognition algorithms as well as in computational speed, memory and power, but also from the integration of concepts from human speech production and perception and the use of powerful models of natural language. For an ISDA audience, this tutorial will discuss the modern techniques of automatic speech recognition from a communications point of view, emphasizing the breadth of knowledge needed to approach near-human performance in this complex task and the fact that ASR simulates a component of a human communication process. 7) Objectives of the tutorial: - Understand the problems of converting speech to text, from a communications point of view - Understand the common methods of automatic speech recognition (ASR) - Get an appreciation of current technology for current and future products and services - Give predictions for future developments and applications in speech recognition 9) Outline of the tutorial (roughly equal time per unit): A) ASR as a pattern recognition and communications task B) Basic ideas on human speech production C) Acoustic-phonetics D) Basic ideas on human speech perception E) Basic digital analysis methods for speech signals F) Methods of parameterization and feature extraction G) Overview of ASR approaches H) Stochastic Techniques I) Language Models J) Current Performance Levels K) Applications and commercial products in ASR L) Future research; predictions 10) Biographical information: Dr. O'Shaughnessy has worked in the speech communications field for 30 years, first in study at MIT (BSc and MS in 1972, PhD in 1976, all in electrical engineering and computer science), then as director of a research team at INRS in the areas of speech analysis, coding, synthesis, recognition and enhancement. After working on the MITalk synthesis project in the early 1970s, he developed one of the first French text-to-speech system in the early 1980s. His textbook "Speech Communication: Human and Machine" (Addison-Wesley, 1987, and now in second edition by IEEE Press, 2000) is well-known and has been widely used in university courses on speech. It indicates the breadth of knowledge he brings to bear on issues of speech communication. His most recent focus has been on speech recognition, where his research group publishes regularly in the ICASSP, ICSLP, and Eurospeech Proceedings. He is an associate editor for the Journal of the Acoustical Society of America and just completed a term as associate editor for the IEEE Transactions on Speech and Audio Processing. He also teaches every year as an adjunct professor in the electrical engineering department at McGill University. He is the General Chair for ICASSP-2004 in Montreal. 11) References: A. Acero (1980) Acoustical and environmental robustness in automatic speech recognition (Kluwer: Boston, MA) Y. Gong (1995) Speech \rc\ in noisy environments,'' Speech Communication {16}, 261-291 J-C. Junqua & J-P. Haton (1996) Robustness in automatic speech recognition, Kluwer M. Gales (1998) Maximum likelihood linear transformations for HMM-based speech recognition, Computer, Speech and Language {12}, 75-98 B-H. Juang, W. Chou & C-H. Lee (1997) Minimum classification error rate methods for speech recognition, IEEE Trans. SAP {5}, 257-265 H. Cung & Y. Normandin (1997) MMIE training of large vocabulary recognition systems, Speech Communication {22}, 303-314 R. Sitaram & T. Sreenivas (1997) Incorporating phonetic properties in hidden Markov models for speech recognition, J. Acoustical Society of Am. {102}, 1149-1158 S. Martin, J. Liermann & H. Ney (1998) \ldq Algorithms for bigram and trigram word clustering, Speech Communication {24}, 19-37 R. Rosenfeld (1996) A maximum entropy approach to adaptive statistical language modeling, Computer, Speech and Language{10}, 187-228 V. Zue (1985) The use of speech knowledge in automatic speech recognition,'' IEEE Proceedings {73}, 1602-1615 N. Morgan & H. Bourlard (1995) Neural networks for statistical recognition of continuous speech, IEEE Proceedings {83}, 742-770 D. O'Shaughnessy (2000) Speech Communications: Human and Machine (IEEE Press). L. Rabiner & B. Juang (1993) Fundamentals of speech recognition(Prentice-Hall: Englewood Cliffs, NJ) J. Deller, J. Proakis \& J. Hansen (1993) Discrete-time Processing of Speech Signals (Macmillan: New York). X. Huang, Y. Ariki \& M. Jack (1990) Hidden Markov models for speech recognition (Edinburgh Univ.~Press: Edinburgh, UK). J-C.~Junqua & J-P.~Haton (1996) Robustness in automatic speech recognition, (Kluwer). H. Kitano (1994) Speech-to-speech translation (Kluwer: Norwell, MA). 12) Supplementary material: a) More detailed description: We will first briefly examine human speech production from an acoustic-phonetic view. The standard methods of speech analysis (e.g., FFT and mel-based cepstrum) will be presented and discussed in terms of efficiency and robustness. The differences in objectives between speech coding and speech recognition will be noted. We will present the modern stochastic techniques to speech recognition (i.e., hidden Markov models), with simple examples to emphasize understanding for a non-expert audience. The issues of adequate training corpora and the many trade-offs for different practical applications will be discussed (e.g., continuous vs. isolated- word recognition; small vs. large vocabularies). The differences between read speech and conversational speech will be examined, in terms of disfluencies, variable speaking rate, and increased use of function words. The added difficulties of recognizing speech over the telephone and with hands-free terminals will be explained. The importance of appropriate language models will be emphasized, with both basic N-gram models and more complex class-based and distance models discussed. We will discuss the inadequacy of simply using N-grams as vocabulary size increases, despite the increasing availability of training texts and the increasing power of computers. We will describe the current state-of-the-art in recognition of natural speech, both commercial and research, noting where current systems do well and where they come up short. The possibilities of integrating knowledge-based sources (e.g., aspects of expert systems) into the current stochastic approaches to speech recognition will be examined. Predictions as to the future course of speech recognition research will be made, in the face of the current success of limited-application recognizers (but the continued failure to approach human performance on more general tasks). b) Course materials: Each participant in the tutorial will receive a booklet containing all slides used in the actual presentation. In addition, each slide will be augmented by more detailed information, with suitable references. c) Sample slides: ASR is pattern recognition: - normalize data in speech signal (normalization) - extract parameters and features (data reduction) - find the best match in memory (similarity measures) - make decisions based on costs (optimal decision) - Training phase: - design algorithm to map utterance to text - develop database model (rules and statistics) - develop in non-real-time - training establishes a "reference memory" or dictionary of speech patterns (often in the form of stochastic networks), which are assigned text labels (as phonemes, or more usually words and phrases). - In speaker-independent (SI) systems, training combines manual and automatic methods by the developer, whereas speaker-dependent (SD) recognizers may be trained by customers using automatic procedures (software provided by the developer). - Test phase: - same signal processing - apply rules, make decision - real-time (if possible) and efficient - minimize cost and maximize convenience Simplistic approach: - Store all possible speech signals and their corresponding texts - Then just need a table look-up - Moore's Law will solve ASR problem? - storage doubling every year - computation power doubling every 1.5 years - Suppose maximum utterance lasts 10 s and coding rate is 4 kbps - 40 000 bits: so 2^40000 signals (10^12000) - Simplifying: 1-s words, 25 frames/s, 10 coefficients/frame, 4 bits/coefficient (1 kb/word): 2^1000 (or 10^300) - Suppose each person spoke 1 word every second for 1000 hours: about 10^17 short utterances - From another viewpoint, use VQ (vector quantization): 10 bits/frame and 25 frame/s -> 125 bits for a brief word: still more than 10^30 possible signals How do pattern recognizers work? - capture signal (speech) - digitize and compress data - find closest (most likely) model - render decision Signal processing: - not just to cut costs - also to focus analysis on important aspects of the signal (and thus raise accuracy) - use the same analysis to create model and to test it Evaluation: - simplest measure: percentage of words correctly transcribed to text - or actions correctly performed - in real-time applications, can allow feedback - partial responses - repetitions Possible practical approaches to ASR: - examine how humans interpret speech, and simulate their processes - instead, treat simply as a PR problem - exploit power of computers - expert-system method - stochastic method - Practical systems are limited: - by the amount of training data - in SD systems, users tire of long sessions - memory limitations in the computer - by available computation (searching among many possible texts for the optimal one) - by inadequate models (e.g., popular methods make some poor assumptions to reduce computation and memory, at the cost of reduced accuracy)