| |
-
Lexicon:
We need to define a pronunciation lexicon (dictionary)
for all words in the application. The pronunciations are
typically defined in terms of phones or syllables. These
are the units that we are going to try to model through
the training process. Multiple pronunciations are
allowed for words. However, multiple pronunciations for
the same word must be defined in succession while using
the ISIP recognizer.
When we foresee using
context-dependency,
there needs to be one
pronunciation for each word which ends in the short
pause phone "sp". This is required to allow cross-word
context expansion since short-pause is the only model
that can be skipped in a pronunciation.
The ISIP
recognizer also requires a preset order for some of the
words that have no linguistic meaning but that serve as
place holders in utterances. "!SENT_END" and "!SENT_START"
need to be defined in that same order at the beginning
of the lexicon. Typically "!SENT_START", "!SENT_END" are modeled
by the "sil" model.
For our example application, alphadigits, here is the
phone lexicon.
-
Model Structure:
Since we are training an HMM based phone system to
recognize alphadigits, we need to define the structure
of the models we want to train. This involves deciding
the number of states in each model. Typically in a
triphone system, each phone has 5 states (3 true states,
a dummy start state and a dummy stop state), but the
number of states is user-defined.
The
states file
defines the number of states that each model begins
with. Note that the "sp" model does not have any states
because "sp" is defined and trained in a later stage
after a basic silence model has been trained. It is
only a place holder for now.
Once this file is defined, we can use the
create_models
utility to create the initial model definition file.
create_models
-states
fs_num_states.list
-output
fs_models.text
We now have to generate a phones list that is redundant
in the current stage but will be useful for context
dependent model processing. This is done automatically
by using the
create_triphone_map
utility.
create_triphone_map
-mono
fs_ci_models.text
-clist
fs_ci_models.text
-context
ci
-models
fs_models.text
-output
fs_phones.text
Note the order of the phones defined in the
fs_ci_models.text
file. The ISIP recognizer requires that sp and sil be defined
as the first entries of the monophone file and in that order.
Note that "ci" was used as the context here because we are training
context-independent (ci) models.
Another file that needs to be defined up-front is the list of phones
that should not have context while training the context-dependent
models. In this tutorial these are the silence ("sil") and short-pause
("sp"). In the parameter file for
hmm_train
this is called
special_models.list .
-
Label Files:
Word and model label files for each speech segment are required
for training an HMM system. The initial stages use model labels
for forced alignment during training and the later stages may
use word labels for alignment if the user so desires.
Here are a
sample model label file
and a
Here
sample word label file.
Initial and ending "sil" need not be added to the model
sequence explicitly.
-
Grammar:
The language model (LM) plays a vital role in improving the
performance of any speech recognition system. The complexity of
the language model is generally task-dependent. For small
tasks, like alphadigits, the language model can be specified in
the form of a regular expression grammar. For complex tasks,
statistical LMs, like bigrams and trigrams, are used.
The alphadigit grammar is conceptually very simple - any word can
follow any word. The ISIP speech recognition system expects
the LM to be specified either as an N-Gram or a word graph.
grammar_compiler
is the utility we use to convert regular expression grammars to
word graph/lattice format.
Here is the
grammar we use to define the alphadigit task.
The word graph is generated using the following command:
grammar_compiler
-input
alphadigit_gram.text
-output
alphadigit_gram.lat
The above output file is the word graph used during recognition.
prev
next
top
|
|
|
|