Home Software Docs Tutorials Demos Databases Dictionaries Models Research Support Mailing Lists What's New
You are here: Data Prep / Alphadigits Tutorial / Prototype System / Tutorials / Software / Home  

 
 
  • Lexicon:

    We need to define a pronunciation lexicon (dictionary) for all words in the application. The pronunciations are typically defined in terms of phones or syllables. These are the units that we are going to try to model through the training process. Multiple pronunciations are allowed for words. However, multiple pronunciations for the same word must be defined in succession while using the ISIP recognizer.

    When we foresee using context-dependency, there needs to be one pronunciation for each word which ends in the short pause phone "sp". This is required to allow cross-word context expansion since short-pause is the only model that can be skipped in a pronunciation.

    The ISIP recognizer also requires a preset order for some of the words that have no linguistic meaning but that serve as place holders in utterances. "!SENT_END" and "!SENT_START" need to be defined in that same order at the beginning of the lexicon. Typically "!SENT_START", "!SENT_END" are modeled by the "sil" model.

    For our example application, alphadigits, here is the phone lexicon.

  • Model Structure:

    Since we are training an HMM based phone system to recognize alphadigits, we need to define the structure of the models we want to train. This involves deciding the number of states in each model. Typically in a triphone system, each phone has 5 states (3 true states, a dummy start state and a dummy stop state), but the number of states is user-defined.

    The states file defines the number of states that each model begins with. Note that the "sp" model does not have any states because "sp" is defined and trained in a later stage after a basic silence model has been trained. It is only a place holder for now.

    Once this file is defined, we can use the create_models utility to create the initial model definition file.

    create_models -states fs_num_states.list -output fs_models.text

    We now have to generate a phones list that is redundant in the current stage but will be useful for context dependent model processing. This is done automatically by using the create_triphone_map utility.

    create_triphone_map -mono fs_ci_models.text -clist fs_ci_models.text -context ci -models fs_models.text -output fs_phones.text

    Note the order of the phones defined in the fs_ci_models.text file. The ISIP recognizer requires that sp and sil be defined as the first entries of the monophone file and in that order. Note that "ci" was used as the context here because we are training context-independent (ci) models.

    Another file that needs to be defined up-front is the list of phones that should not have context while training the context-dependent models. In this tutorial these are the silence ("sil") and short-pause ("sp"). In the parameter file for hmm_train this is called special_models.list .

  • Label Files:

    Word and model label files for each speech segment are required for training an HMM system. The initial stages use model labels for forced alignment during training and the later stages may use word labels for alignment if the user so desires.

    Here are a sample model label file and a Here sample word label file. Initial and ending "sil" need not be added to the model sequence explicitly.

  • Grammar:

    The language model (LM) plays a vital role in improving the performance of any speech recognition system. The complexity of the language model is generally task-dependent. For small tasks, like alphadigits, the language model can be specified in the form of a regular expression grammar. For complex tasks, statistical LMs, like bigrams and trigrams, are used.

    The alphadigit grammar is conceptually very simple - any word can follow any word. The ISIP speech recognition system expects the LM to be specified either as an N-Gram or a word graph. grammar_compiler is the utility we use to convert regular expression grammars to word graph/lattice format.

    Here is the grammar we use to define the alphadigit task.

    The word graph is generated using the following command:

    grammar_compiler -input alphadigit_gram.text -output alphadigit_gram.lat

    The above output file is the word graph used during recognition.


prev

next


top
   
   
    Help / Support / Site Map / Contact Us / ISIP Home