| |
4.2.2 Network Decoding:
Recognition Using Word Models
In this section, we will focus on speech recognition using word models.
Word models are one of many different types of acoustic models that can
be used in our recognition system. We will
use the recognizer to decode a list of test utterances and will briefly
explain the recognition process.
Let's start by decoding a list of test utterances. We'll use utterances from
the TIDIGITS subset introduced in
Section 2. The features for this
subset have already been extracted.
Go to the directory $ISIP_TUTORIAL/sections/s04/s04_02_p02/.
cd $ISIP_TUTORIAL/sections/s04/s04_02_p02/
and run the following command:
isip_recognize -param params_decode_ihd.sof -list $ISIP_TUTORIAL/databases/lists/identifiers_test.sof -verbose all
Expected Output:
Command: isip_recognize -parameter_file params_decode.sof -list /ftp/pu./projects/speech/software/tutorials/production/ fundamentals/current/example./databases/lists/identifiers_test.sof -verbose all
Version: 1.23 (not released) 2003/05/21 23:10:45
loading audio database: $ISIP_TUTORIA./databases/db/tidigits_audio_db_test.sof
*** no symbol graph database file was specified ***
*** no transcription database file was specified ***
loading front-end: $ISIP_TUTORIAL/recipes/frontend.sof
loading language model: $ISIP_TUTORIAL/models/word_models/compare/lm_word_jsgf_8mix.sof
loading statistical model pool: $ISIP_TUTORIAL/models/word_models/compare/smp_word_8mix.sof
*** no configuration file was specified ***
opening the output file: $ISIP_TUTORIAL/sections/s04/s04_02_p02/results.out
processing file 1 (ah_111a): $ISIP_TUTORIA./databases/sof_8k/test/ah_111a.sof
hyp: ONE ONE ONE
score: -8946.990234375 frames: 138
processing file 2 (ah_1a): $ISIP_TUTORIAL/databases/sof_8k/test/ah_1a.sof
hyp: ONE
score: -5084.52880859375 frames: 79
....
The console output provides some brief diagnostic information about
the results, including the hypothesis for the current utterance, and the
(log) likelihood that the hypothesis
is correct. (Technically, this is simply a score presented
on a log scale that reflects the similarity between the utterances
and the best sequence of models that could have produced this score.)
Now, let's briefly examine the components needed to complete the recognition
process. There are two console input files, a parameter file and a list
of audio utterance identifiers.
The first input, a parameter file,
is explained in detail in
Section 4.2.6.
The second item is described in
detail in
Section 2.4.2.
It simply defines a list of utterances to be processed using
utterances identifiers.
The parameter file's main purpose is to provide a reference to the
three main components of the recognizer: a front end, an acoustic
model library, and a hierarchy of language models.
You can view the parameter file,
params_decode.sof,
in your browser. It contains the following information:
@ Sof v1.0 @
@ HiddenMarkovModel 0 @
algorithm = "DECODE";
implementation = "VITERBI";
output_mode = "DATABASE";
output_type = "TEXT";
output_file = "$ISIP_TUTORIAL/sections/s04/s04_02_p02/results.out";
frontend = "$ISIP_TUTORIAL/recipes/frontend.sof";
audio_database = "$ISIP_TUTORIAL/databases/db/tidigits_audio_db_test.sof";
language_model= "$ISIP_TUTORIAL/models/word_models/compare/lm_word_jsgf_8mix.sof";
statistical_model_pool = "$ISIP_TUTORIAL/models/word_models/compare/smp_word_8mix.sof";
This is a text Sof file that contains the essential files
to configure and run the recognizer. Algorithm and implementation
specify recognition mode (e.g., DECODE), and the type of search
algorithm to be used (e.g., VITERBI). The parameters output_file
and output_type direct the recognizer to store the results
in text format in the file "results.out". The default output
format is binary, which is necessary for large-scale experiments.
However, we use text mode so we can easily view the results.
The parameter frontend specifies the front end used to convert
audio data to features. This process is discussed extensively in
Section 3.
The recognizer needs this input file so that it can check whether
the front end used to generate the acoustic models is compatible
with the front end used to generate the features.
The parameter audio_database specifies the audio database
to be used to reference the input list, identifiers_test.sof, which
contains identifiers, to the correct audio data. Each identifier corresponds
to a record in the audio database that provides an audio data
file name (in this case a feature file). There is a corresponding
entry in the transcription database, which is not used here,
that can contain a start and stop time in the audio file that
defines the utterance to be processed.
This is described in more detail in
Section 2.4.2.
The language model file specifies a hierarchy of language
models that include a word-level grammar, which controls
what sequences of words are allowed, and a mapping of
words to acoustic models. This component of the system
actually merges acoustic and language modeling into
a hierarchy of finite state machines. Acoustic modeling
is described in more detail in
Section 5;
language modeling is described in more detail
Section 6.
The final parameter, statistical_model_pool,
describes a set of statistical models, typically
Gaussian mixture models,
which represent the terminal nodes in the hierarchy
language models, and allow feature vectors to be converted
to likelihoods.
|
The acoustic modeling component of a speech recognition system
models the individual sounds in a speech signal. Our recognition
system in the configuration demonstrated above is based on
Hidden Markov Models
(HMMs) which include a temporal component that capture variations of the
sound in time. A typical HMM, as shown in the figure to the
right, has two
components: the underlying statistical model at each state
and transition probabilities which model the temporal dimension
(variation in time).
The underlying statistical models are contained in the
statistical model pool
and the the topology and transition
probabilities are part of the
language model file.
|
|
For a system that uses word acoustic models, each word is
modeled by an HMM. Word models are popular in small vocabulary
tasks such as TIDigits where the number of words are small. For
TIDigits, there are 11 word models (ONE.....ZERO, OH) that
represent each word in the vocabulary. In addition to these 11
word models, there is a model to represent the non-speech
portions of the utterance called silence. The process of
modeling non-speech acoustics is known as silence modeling.
This process is a subject of active research and is immensely
challenging when detecting the start and end of utterances in
real-time systems.
|
In addition to the acoustic information, linguistic information is
extremely important in recognizing natural
speech. The language model component of the speech recognizer
describes grammar, which is a set of permissible rules for the
structure of a language. The language model is represented in
either a graph or text format. Language models can be broadly
classified into two types: stochastic and non-stochastic. Good
examples of stochastic language models are the N-gram models that are
widely used in state-of-the-art recognizers. These language models
assign probabilities to certain sequences of words. The probabilities
are taken into consideration when determining the sequence of words that
were actually spoken.
A loop grammar, popular in
digits recognition tasks, is an example of a non-stochastic
language model as shown in the figure to the right.
This type of language model determines all possible
word sequences through a graph.
The network shown to the right is described
in the
language model file.
The
configuration file
also is an important part of the specification
of the recognition system. Further details of this file,
along with editing and tuning
instructions, will be explained in
Section 4.2.7.
Any parameter that can be specified in the configuration file
can be overridden from the command line. Further, if no
configuration file is specified, the recognizer defaults to
a widely-used set of values for key parameters. The
example parameter file shown above would produce the
same results if the configuration file were deleted
from the main parameter file.
|
|
Once the results have been acquired by the recognizer, a
scoring
report can be produced.
Scoring is the process of comparing the results from the recognizer to
a true transcription of the utterances. The scoring report contains
a lot of useful information and statistics. Scoring is explained
in detail in
Section 4.3.
|
| |
|