| |
4.3.1 Scoring: Error Analysis
Evaluating or
scoring
the performance of speech recognition systems is
critical to advances in their design and development. Several
evaluation metrics can be used, depending on the complexity of
the ASR system to which they are applied.
Here we describe a commonly used metric, word error rate
( WER).
This metric evaluates the number and type of recognition
errors made by the decoder.
Scoring the result hypothesized by the decoder requires
comparing it to the
reference transcription,
i.e., a text sequence of words (or other tags) representing
what was actually spoken (e.g., the answer). The
comparison must be aligned in order to determine
the total number and type of errors made by the decoder.
An example is given below:
Decoder Hypothesis:
Reference Transcription:
|
HAUL MOOSE FOR TREES
CUT TALL SPRUCE TREES
|
The scoring software supports both a time-aligned mode
and a text-alignment mode. The latter has historically
been most commonly used, though recent research is increasingly
shifting to using time-aligned scoring.
In time-aligned scoring,
the hypothesis and reference transcriptions include
start and stop times for each word. These are typically
generated from a forced alignment process described in
Section 4.4.
In this case, errors can be easily tabulated because
there is a temporal alignment of the two sequences.
However, historically, a text alignment algorithm, also
known as a string edit algorithm, has been used to compare
the two text sequences. The output of this alignment
algorithm, which simply tries to minimize the number
of edits required to map the hypothesis onto the reference,
is shown below:
|   |
T0
|
T1
|
T2
|
T3
|
T4
|
|
Time-Aligned Hypothesis:
|
***
|
HAUL
|
MOOSE
|
FOR
|
TREES
|
|
Time-Aligned Reference:
|
CUT
|
TALL
|
SPRUCE
|
***
|
TREES
|
In this example, the decoder recognized one word correctly
("TREES"). Note that the above alignment might be different
than what the recognizer actually produced. However,
it is convenient to score this way since it decouples
the scoring software from the recognizer output.
(We understand this sounds silly, but this is one of those
"historical accidents" in speech research.)
To understand how errors are counted and categorized, we must
examine each time period, Ti. At time
T0, the word "cut"
was spoken, but the decoder hypothesized no word. This
is considered a
deletion
error because the decoder removed or deleted a word that was actually spoken.
At time T1, the word "tall" was spoken, but the
decoder hypothesized
"haul". This is considered a
substitution
error because the decoder substituted the word "haul" for "tall". A similar
error was made at time T2. At time
T3, the decoder
hypothesized the word "for" when nothing was actually spoken.
This is considered an
insertion
error because the decoder inserted a word into silence, where no
word was actually spoken.
In summary, the decoder made two substitution errors, one insertion
error, one deletion error, and hypothesized one word correctly
out of a total of four words actually spoken.
For further general discussion of evaluation and scoring, see
evaluation metrics
from our on-line
speech recognition course notes.
To learn more about how to score using WER, continue to
NIST scoring
in the next section.
|
| |
|