Subject: submission for tutorial at ISDA-2003

1) Title: Automatic Recognition of Natural Speech

2) Instructor: Prof. Douglas O'Shaughnessy
INRS-Telecommunications (University of Quebec)
800 Gauchetiere west, Suite 6900, Montreal, Quebec H5A 1K6 Canada
514-875-1266 x2012 (fax 514-875-0344)
(email: dougo@inrs-telecom.uquebec.ca)

3) Concise abstract:
The automatic conversion of natural conversational speech into text
is a computer task on the verge of finding wide commercial application.
Current automatic speech recognition (ASR) systems are still quite limited
in their capacity to handle natural speech, but applications are nonetheless
growing each year.  For an ISDA audience, this tutorial will discuss the
modern techniques of automatic speech recognition, emphasizing the breadth
of knowledge needed to approach near-human performance in this complex task.
We will review essential aspects of human speech production from the
acoustic-phonetic point of view, pertinent speech analysis methods (e.g.,
mel-cepstrum), statistical methods (e.g., hidden Markov models), and
language models.  We will describe the strengths and weaknesses of each
technique, and attempt to predict future trends, in which more structure
will likely be imposed on techniques which so far have been mostly driven by
mathematical simplicity.  In all aspects of the tutorial, we emphasize that
communications principles are the basis for many decisions in how to design
ASR.

4) Scope:
This tutorial is specifically focussed in depth on one topic: automatic
speech recognition.  At the same time, it surveys many aspects of this
problem, which is quite interdisciplinary.  Since ASR simulates a component
of human speech communication, it involves: acoustics (modeling the motion
of air waves in the vocal tract), linguistics (phonetic units of speech),
psychology (how do humans perceive speech), engineering (building a
practical communications system), and computer science (designing an
efficient algorithm).  This tutorial will touch on all of the relevant
aspects.

5) Audience:
Any educated student or professional with at least a bachelor's degree in a
field related to electrical engineering or computer science.  People
interested in learning an overview of the technical aspects of speech
recognition.  People wanting a broad overview of the task, rather than many
very specific details of individual steps in the process (since time does
not permit this; all questions, however detailed, of course are answered).
People wanting clear explanations, without being snowed under by mounds of
mathematical equations, which risks losing half a tutorial audience after
the first 3 slides.

6) Motivation:

The automatic conversion of natural conversational speech into text
is a highly-interdisciplinary task involving aspects of computer
science, engineering, acoustics, linguistics, and psychology.  The recent
major advances in this field have come from improvements in recognition
algorithms as well as in computational speed, memory and power,
but also from the integration of concepts from human speech
production and perception and the use of powerful models of natural
language.  For an ISDA audience, this tutorial will
discuss the modern techniques of automatic speech recognition from a
communications point of view,
emphasizing the breadth of knowledge needed to approach
near-human performance in this complex task and the fact that ASR simulates
a component of a human communication process.

7) Objectives of the tutorial:
-  Understand the problems of converting speech to text, from a
communications point of view
-  Understand the common methods of automatic speech recognition (ASR)
- Get an appreciation of current technology for current and future products
and services
- Give predictions for future developments and applications in speech
recognition


9) Outline of the tutorial (roughly equal time per unit):
    A) ASR as a pattern recognition and communications task
        B) Basic ideas on human speech production
    C) Acoustic-phonetics
    D) Basic ideas on human speech perception
    E) Basic digital analysis methods for speech signals
    F) Methods of parameterization and feature extraction
    G) Overview of ASR approaches
    H) Stochastic Techniques
    I) Language Models
    J) Current Performance Levels
    K) Applications and commercial products in ASR
    L) Future research; predictions

10) Biographical information:

Dr. O'Shaughnessy has worked in the speech communications field for
30 years, first in study at MIT (BSc and MS in 1972, PhD in
1976, all in electrical engineering and computer science), then
as director of a research team at INRS in the areas of
speech analysis, coding, synthesis, recognition and enhancement.
After working on the MITalk synthesis project in the early 1970s, he
developed one of the first French text-to-speech system in the early
1980s.  His textbook "Speech Communication: Human and Machine"
(Addison-Wesley, 1987, and now in second edition by IEEE Press, 2000)
is well-known and has been widely used in university courses on speech.
It indicates the breadth of knowledge he brings to bear on issues of speech
communication.

His most recent focus has been on speech
recognition, where his research group publishes regularly in the
ICASSP, ICSLP, and Eurospeech Proceedings.  He is an associate editor
for the Journal of the Acoustical Society of America and just completed a
term as associate editor for the IEEE Transactions on Speech and Audio
Processing.  He also teaches every year as an adjunct professor in the
electrical engineering department at McGill University.  He is the General
Chair for ICASSP-2004 in Montreal.

11) References:
    A. Acero (1980) Acoustical and environmental robustness in automatic
speech recognition (Kluwer: Boston, MA)
    Y. Gong (1995) Speech \rc\ in noisy environments,'' Speech Communication
{16}, 261-291 
    J-C. Junqua & J-P. Haton (1996) Robustness in automatic speech
recognition, Kluwer
    M. Gales (1998) Maximum likelihood linear transformations for HMM-based
speech recognition,  Computer, Speech and Language {12}, 75-98
    B-H. Juang, W. Chou & C-H. Lee (1997) Minimum classification error rate
methods for speech recognition, IEEE Trans. SAP {5}, 257-265
    H. Cung & Y. Normandin (1997) MMIE training of large vocabulary
recognition systems, Speech Communication {22}, 303-314
    R. Sitaram & T. Sreenivas (1997)  Incorporating phonetic properties in
hidden Markov models for speech recognition, J. Acoustical Society of Am.
{102}, 1149-1158 
    S. Martin, J. Liermann & H. Ney (1998) \ldq Algorithms for bigram and
trigram word clustering, Speech Communication {24}, 19-37
    R. Rosenfeld (1996) A maximum entropy approach to adaptive statistical
language 
modeling,  Computer, Speech and Language{10}, 187-228
    V. Zue (1985) The use of speech knowledge in automatic speech
recognition,'' IEEE Proceedings {73}, 1602-1615
    N. Morgan & H. Bourlard (1995) Neural networks for statistical
recognition of continuous speech, IEEE Proceedings {83}, 742-770
    D. O'Shaughnessy (2000) Speech Communications: Human and Machine (IEEE
Press).
    L. Rabiner & B. Juang (1993) Fundamentals of speech
recognition(Prentice-Hall: Englewood Cliffs, NJ)
    J. Deller, J. Proakis \& J. Hansen (1993) Discrete-time Processing of
Speech Signals (Macmillan: New York).
    X. Huang, Y. Ariki \& M. Jack (1990) Hidden Markov models for speech
recognition (Edinburgh Univ.~Press: Edinburgh, UK).
    J-C.~Junqua & J-P.~Haton (1996) Robustness in automatic speech
recognition, (Kluwer).
    H. Kitano (1994) Speech-to-speech translation (Kluwer: Norwell, MA).


12) Supplementary material:

    a) More detailed description:

We will first briefly examine human speech production
from an acoustic-phonetic view.  The standard methods of speech
analysis (e.g., FFT and mel-based cepstrum) will be presented and
discussed in terms of efficiency and robustness.  The differences in
objectives between speech coding and speech recognition will be noted.

We will present the modern stochastic techniques to speech
recognition (i.e., hidden Markov models), with simple examples to
emphasize understanding for a non-expert audience.  The issues
of adequate training corpora and the many trade-offs for different
practical applications will be discussed (e.g., continuous vs. isolated-
word recognition; small vs. large vocabularies).  The differences
between read speech and conversational speech will be examined, in
terms of disfluencies, variable speaking rate, and increased use of
function words.  The added difficulties of recognizing speech over the
telephone and with hands-free terminals will be explained.

The importance of appropriate language models will be emphasized,
with both basic N-gram models and more complex class-based and
distance models discussed.  We will discuss the inadequacy of simply
using N-grams as vocabulary size increases, despite the increasing
availability of training texts and the increasing power of computers.

We will describe the current state-of-the-art in recognition of natural
speech, both commercial and research, noting where current systems do well
and where they come up short.  The possibilities of integrating
knowledge-based sources (e.g., aspects
of expert systems) into the current stochastic approaches to speech
recognition will be examined.  Predictions as to the future course of
speech recognition research will be made, in the face of the current
success of limited-application recognizers (but the continued failure
to approach human performance on more general tasks).

   b) Course materials:
Each participant in the tutorial will receive a booklet containing all
slides used in the actual presentation.  In addition, each slide will be
augmented by more detailed information, with suitable references.

   
c) Sample slides:

   ASR is pattern recognition:
    - normalize data in speech signal (normalization)
    - extract parameters and features (data reduction)
    - find the best match in memory (similarity measures)
    - make decisions based on costs (optimal decision)

    - Training phase:
        - design algorithm to map utterance to text
        - develop database model (rules and statistics)
        - develop in non-real-time

- training establishes a "reference memory" or dictionary of speech patterns
(often in the form of stochastic networks), which are assigned text labels
(as phonemes, or more usually words and phrases).

- In speaker-independent (SI) systems, training combines manual and
automatic methods by the developer, whereas speaker-dependent (SD)
recognizers may be trained by customers using automatic procedures (software
provided by the developer).

    - Test phase: 
        - same signal processing
        - apply rules, make decision
        - real-time (if possible) and efficient
        - minimize cost and maximize convenience

Simplistic approach:
    - Store all possible speech signals and their corresponding texts
    - Then just need a table look-up
    - Moore's Law will solve ASR problem?
        - storage doubling every year
        - computation power doubling every 1.5 years
    - Suppose maximum utterance lasts 10 s and coding rate is 4 kbps
    - 40 000 bits: so 2^40000 signals (10^12000)
    - Simplifying: 1-s words, 25 frames/s, 10 coefficients/frame, 4
bits/coefficient (1 kb/word): 2^1000 (or 10^300)
    - Suppose each person spoke 1 word every second for 1000 hours: about
10^17 short utterances
    - From another viewpoint, use VQ (vector quantization): 10 bits/frame
and 25 frame/s -> 125 bits for a brief word: still more than 10^30 possible
signals

How do pattern recognizers work?
    - capture signal (speech)
    - digitize and compress data
    - find closest (most likely) model
    - render decision

Signal processing:
    - not just to cut costs
    - also to focus analysis on important aspects of the signal (and thus
raise accuracy)
    - use the same analysis to create model and to test it


Evaluation:
    - simplest measure: percentage of words correctly transcribed to text
        - or actions correctly performed
    - in real-time applications, can allow feedback
        - partial responses
        - repetitions

Possible practical approaches to ASR:
    - examine how humans interpret speech, and simulate their processes
    - instead, treat simply as a PR problem
    - exploit power of computers
    - expert-system method
    - stochastic method

- Practical systems are limited:
    - by the amount of training data
        - in SD systems, users tire of long sessions
        - memory limitations in the computer
    - by available computation (searching among many possible texts for the
optimal one)
    - by inadequate models (e.g., popular methods make some poor assumptions
to reduce computation and memory, at the cost of reduced accuracy)