Guest Column
January 2006

Wesley Holland

XML Language Models

Language models are used extensively in speech recognition to provide a grammar for accepted utterances. Several industry standard grammar specifications such as the JSpeech Grammar Format (JSGF) and the XML Speech Recognition Grammar Specification (XML-SRGS or XML) exist. While these standards allow for the specification of context-free grammars, most language models have a regular grammar equivalent and can therefore be modeled as finite state machines. As finite state machines are considerably easier to process, the ISIP internal grammar format consists of a set of hierarchical or nested finite state machines collectively known as an ISIP Hierarchical Digraph (IHD).

Although the ISIP software natively processes only IHD format language models, it offers transparent conversion both from JSGF and XML to IHD and vice-versa in the form of the isip_network_converter and isip_network_builder utilities. Consequently, one may use XML language models with ISIP software without knowing the details of IHD. The construction and use of such XML language models is the focus of this tutorial. This tutorial assumes a basic understanding of the ISIP environment and of speech recognition principles.

Language Model Creation

If a new XML language model is to be created, the simplest choice is to use the ISIP utility isip_network_builder. This Java application makes the tedious task of language model construction as simple as drawing a directed graph. The directed graph can then be saved as an IHD, JSGF, or XML language model. This method of construction is detailed in the IES Fundamentals of Speech Recognition tutorial and requires no explanation.

A more difficult task is manually constructing an XML language model in plain-text. This is necessary when adapting an existing XML format grammar into an XML language model for use with ISIP software. In the ISIP environment, language models are stored in SOF files. To prepare an existing XML language model for use with ISIP software, the XML grammar must be encapsulated within an SOF file. An example XML language model is provided here to aid in this cause and to provide a framework for creation of XML language models. This example model will be used referenced through the rest of the section.

Upon examination, this language model can be divided into a header (a term loosely referring to the SOF header, an algorithm, and an implementation) and two levels. The levels themselves can be further divided into tags containing information about those levels. The information in each tag is encapsulated in a string containing an XML format grammar. While this may seem like a cumbersome method for specification of a level name or exclude symbols, it eases integration with web applications. Iterated below is a list of the allowable tags for each level and their descriptions.

  1. search_tag_(level) - This tag specifies the name of a given level through a single item contained in the root rule. This value is frequently "word", "phone", or "state". In the example, level0 is the word level and level1 is the state level.

  2. grammars_(level) - This tag contains a level's grammar(s) and is frequently the longest of a level's tags. In the example, the word level contains a single grammar that defines a sentence consisting of !SENT_DELIM, ONE, and SILENCE followed by TWO, THREE, or !DUMMY and concluding with another !SENT_DELIM. The state level grammars contain the states that serve as the interface between the language model and the statistical model pool.

The two preceding tags are the only tags necessary for minimal SOF encapsulation. If, however, recognition is to be performed, the following tags allow for specification of additional parameters that will aid in training and recognition.

  1. search_non_speech_boundary_symbols_(level) - This tag specifies the non-speech boundary symbols through a list of items contained in the root rule. In the example, the only boundary symbol is the sentence delimiter !SENT_DELIM.

  2. search_non_speech_internal_symbols_(level) - This tag specifies the non-speech internal symbols through a list of items contained in the root rule.

  3. search_dummy_symbols_(level) - This tag specifies the dummy symbols through a list of items contained in the root rule. In the example, this tag contains the !DUMMY symbol. Designation of as a dummy symbol allows the converter to equate said symbol with a NULL ruleref.

  4. search_exclude_symbols_(level) - This tag specifies the exclude symbols through a list of items contained in the root rule. In the example, this tag contains the !DUMMY, !SENT_DELIM, and SILENCE symbols as these symbols will not be counted in hypothesis scoring.

  5. search_spenalty_exclude_symbols_(level) - This tag specifies the spenalty exclude symbols through a list of items contained in the root rule.

  6. search_context_less_symbols_(level) - This tag specifies the context-less symbols through a list of items contained in the root rule. Unless specified as a context-less symbol, a given identifier will be assumed to reference a lower level grammar or a state in the statistical model pool.

  7. search_skip_symbols_(level) - This tag specifies the skip symbols through a list of items contained in the root rule.

  8. search_non_adaptation_symbols_(level) - This tag specifies the nonadaptation symbols through a list of items contained in the root rule.

Once a language model is constructed in this fashion, it is a good idea to verify its IHD representation, as it is in this format that recognition is performed behind the scenes. Such verification can be accomplished by opening the language model in isip_network_builder and examining the flow of each directed graph.

Training and Recognition

Once an XML language model is constructed, a corresponding statistical model pool must be generated before training and recognition may be performed. Said statistical model pool can be generated automatically through the "Save All" function of isip_network_builder.

It is interesting to note that the above mentioned statistical model pool is independent of the format of the language model by which it is used. The statistical model pool is concerned only with the physical characteristics of states. Therefore, it is possible to run a third of a network's training in IHD, a third in XML, and a third in JSGF, all the while using the same statistical model pool.

As to the actual process of recognition, isip_recognize, by default, will match its output language model format to the input language model format. This eliminates the need for manual conversions between training iterations and before decoding. Thus, training and recognition are performed exactly as outlined in the IES Fundamentals of Speech Recognition tutorial for IHD language models.

Results

In an effort to prove that training and recognition provide identical results regardless of language model format, a series of tests were run. In each test, a language model was trained and recognition was performed on 941 file and 336 file subsets, respectively, of the TIDigits speech database. Below are the results. The left-most column indicates the format of the starting language model (ex. IHD->XML->IHD indicates an IHD language model that underwent conversion to XML and conversion back to IHD). The middle column indicates the word-error rate (WER) obtained after 8 mix training for word models. The right-most column indicates the WER obtained after 8 mix training for monophone models.

Language ModelWord Model WERMonophone Model WER
IHD1.9%0.4%
JSGF1.9%0.4%
XML1.9%0.5%
IHD->JSGF1.9%0.4%
IHD->JSGF->IHD1.9%0.4%
IHD->JSGF->XML1.9%0.4%
IHD->XML1.9%0.5%
IHD->XML->IHD1.9%0.4%
IHD->XML->JSGF1.9%0.5%

As is evident, the three grammar formats produce equivalent results.


Footer
ISIP

Home | Projects | Publications | What's New | Contact | About Us | Search | Up

Please direct questions or comments to Isip_help@ece.msstate.edu

Mississippi State University
Footer