| |
4.4.1 Forced Alignment: Overview
As we've seen thus far, a speech recognition system uses a search engine
along with an acoustic
and language model which contains a set of possible
words, phonemes, or some other set of data to match
speech data to the correct spoken utterance.
The search engine processes the features extracted
from the speech data to identify occurences of the words, phonemes,
or whatever set of data it is equipped to search for and returns the
results.
Forced alignment is similar to this process, but it
differs in one major respect. Rather than being given a set
of possible words to search for, the search engine is given an exact
transcription of what is being spoken in the speech data. The system then
aligns the transcribed data with the speech data, identifying which
time segments in the speech data correspond to particular words in the
transcription data.
Forced alignment can also be used to align the phonemes of the
transcription data to the speech data given, similar to the image
below, although with more explicitly defined boundaries on where
each phoneme begins and ends.
|
| |
|