|
Guest Column
January 2006
Wesley Holland
XML Language Models
Language models are used extensively in speech recognition to
provide a grammar for accepted utterances. Several industry standard
grammar specifications such as the JSpeech Grammar Format (JSGF) and
the XML Speech Recognition Grammar Specification (XML-SRGS or XML)
exist. While these standards allow for the specification of
context-free grammars, most language models have a regular grammar
equivalent and can therefore be modeled as finite state machines. As
finite state machines are considerably easier to process, the ISIP
internal grammar format consists of a set of hierarchical or nested
finite state machines collectively known as an ISIP Hierarchical
Digraph (IHD).
Although the ISIP software natively processes only IHD format
language models, it offers transparent conversion both from JSGF and
XML to IHD and vice-versa in the form of the isip_network_converter
and isip_network_builder utilities. Consequently, one may use XML
language models with ISIP software without knowing the details of IHD.
The construction and use of such XML language models is the focus of
this tutorial. This tutorial assumes a basic understanding of the
ISIP environment and of speech recognition principles.
Language Model Creation
If a new XML language model is to be created, the simplest
choice is to use the ISIP utility isip_network_builder. This Java
application makes the tedious task of language model construction as
simple as drawing a directed graph. The directed graph can then be
saved as an IHD, JSGF, or XML language model. This method of
construction is detailed in the IES Fundamentals
of Speech Recognition tutorial and requires no explanation.
A more difficult task is manually constructing an XML language
model in plain-text. This is necessary when adapting an existing XML
format grammar into an XML language model for use with ISIP software.
In the ISIP environment, language models are stored in SOF
files. To prepare an existing XML language model for use with ISIP
software, the XML grammar must be encapsulated within an SOF
file. An example XML language model is provided here to aid in this cause and to
provide a framework for creation of XML language models. This example
model will be used referenced through the rest of the section.
Upon examination, this language model can be divided into a
header (a term loosely referring to the SOF header, an algorithm, and
an implementation) and two levels. The levels themselves can be
further divided into tags containing information about those levels.
The information in each tag is encapsulated in a string containing an
XML
format grammar. While this may seem like a cumbersome method for
specification of a level name or exclude symbols, it eases integration
with web applications. Iterated below is a list of the allowable tags
for each level and their descriptions.
search_tag_(level) - This tag specifies the name
of a given level through a single item contained in the root rule.
This value is frequently "word", "phone", or "state". In the example,
level0 is the word level and level1 is the state level.
grammars_(level) - This tag contains a level's
grammar(s) and is frequently the longest of a level's tags. In the
example, the word level contains a single grammar that defines a
sentence consisting of !SENT_DELIM, ONE, and SILENCE followed by TWO,
THREE, or !DUMMY and concluding with another !SENT_DELIM. The state
level grammars contain the states that serve as the interface between
the language model and the statistical model pool.
The two preceding tags are the only tags necessary for
minimal SOF encapsulation. If, however, recognition is to be
performed, the following tags allow for specification of additional
parameters that will aid in training and recognition.
search_non_speech_boundary_symbols_(level) -
This tag specifies the non-speech boundary symbols through a list of
items contained in the root rule. In the example, the only boundary
symbol is the sentence delimiter !SENT_DELIM.
search_non_speech_internal_symbols_(level) -
This tag specifies the non-speech internal symbols through a list of
items contained in the root rule.
search_dummy_symbols_(level) - This tag
specifies the dummy symbols through a list of items contained in the
root rule. In the example, this tag contains the !DUMMY symbol.
Designation of as a dummy symbol allows the converter to equate said
symbol with a NULL ruleref.
search_exclude_symbols_(level) - This tag
specifies the exclude symbols through a list of items contained in the
root rule. In the example, this tag contains the !DUMMY, !SENT_DELIM,
and SILENCE symbols as these symbols will not be counted in hypothesis
scoring.
search_spenalty_exclude_symbols_(level) - This
tag specifies the spenalty exclude symbols through a list of items
contained in the root rule.
search_context_less_symbols_(level) - This tag
specifies the context-less symbols through a list of items contained
in the root rule. Unless specified as a context-less symbol, a given
identifier will be assumed to reference a lower level grammar or a
state in the statistical model pool.
search_skip_symbols_(level) - This tag specifies
the skip symbols through a list of items contained in the root
rule.
search_non_adaptation_symbols_(level) - This tag
specifies the nonadaptation symbols through a list of items contained
in the root rule.
Once a language model is constructed in this fashion, it is a
good idea to verify its IHD representation, as it is in this format
that recognition is performed behind the scenes. Such verification
can be accomplished by opening the language model in
isip_network_builder and examining the flow of each directed graph.
Training and Recognition
Once an XML language model is constructed, a corresponding
statistical model pool must be generated before training and
recognition may be performed. Said statistical model pool can be
generated automatically through the "Save All" function of
isip_network_builder.
It is interesting to note that the above mentioned statistical
model pool is independent of the format of the language model by which
it is used. The statistical model pool is concerned only with the
physical characteristics of states. Therefore, it is possible to run
a third of a network's training in IHD, a third in XML, and a third in
JSGF, all the while using the same statistical model pool.
As to the actual process of recognition, isip_recognize, by
default, will match its output language model format to the input
language model format. This eliminates the need for manual
conversions between training iterations and before decoding. Thus,
training and recognition are performed exactly as outlined in the IES
Fundamentals
of Speech Recognition tutorial for IHD language models.
Results
In an effort to prove that training and recognition provide
identical results regardless of language model format, a series of
tests were run. In each test, a language model was trained and
recognition was performed on 941 file and 336 file subsets,
respectively, of the TIDigits
speech database. Below are the results. The left-most column
indicates the format of the starting language model (ex. IHD->XML->IHD
indicates an IHD language model that underwent conversion to XML and
conversion back to IHD). The middle column indicates the word-error
rate (WER) obtained after 8 mix training for word models. The
right-most column indicates the WER obtained after 8 mix training for
monophone models.
| Language Model | Word Model WER | Monophone Model WER |
| IHD | 1.9% | 0.4% |
| JSGF | 1.9% | 0.4% |
| XML | 1.9% | 0.5% |
| IHD->JSGF | 1.9% | 0.4% |
| IHD->JSGF->IHD | 1.9% | 0.4% |
| IHD->JSGF->XML | 1.9% | 0.4% |
| IHD->XML | 1.9% | 0.5% |
| IHD->XML->IHD | 1.9% | 0.4% |
| IHD->XML->JSGF | 1.9% | 0.5% |
As is evident, the three grammar formats produce equivalent results.
|