| |
6.3.3 N-Gram Modeling: The N-Gram Language Model File
Open the TIDigits n-gram language model file and glance over it. The file can
be easily read and hand edited since it's in text format. After the main file
header, you'll see a sub-header called "/research/isip/data/" and underneath you'll see
the total number of n-grams defined in this file.
\data\
ngram 1=13
ngram 2=141
ngram 3=1465
You'll see three different
totals. The first number is the total of unigrams, the second is the number
of bigrams and the third is the total number of trigrams. This file is a trigram
language-model file, but why does it contain bigrams and unigrams?. Look at
the number of unigrams. This number is the total of words, including
!SENT_DELIM. It follows that the number of trigrams should
be 13*13*13 or 2,197. The file, however, indicates that the total is only
1,465. In the event that a trigram encountered in an utterance is not
included in the trigram language model file, a "back-off" occurs. Instead
of using the probabilities from a trigram, the recognizer uses probablility
of a bigram. The
total number of bigrams in this file is 141 instead of 13*13 or 169. In the
event that a bigram encountered in test data is not included in the
language-model file, a second back-off occurs to the list of unigrams.
The rest of the file contains definitions of the unigrams, bigrams, and
trigrams.
\1-grams:
-0.6322352 !SENT_DELIM
-99 !SENT_DELIM -1.953935
-1.154977 EIGHT -99
-1.163263 FIVE -99
-1.15191 FOUR -99
-1.163263 NINE -99
......
\2-grams:
-1.04211 !SENT_DELIM EIGHT 0.5087579
-1.04211 !SENT_DELIM FIVE 0.5005441
-1.04211 !SENT_DELIM FOUR 0.5117977
-1.043239 !SENT_DELIM NINE 0.5016403
-1.040983 !SENT_DELIM OH 0.5050113
-1.043239 !SENT_DELIM ONE 0.5002577
......
\3-grams:
-0.5573978 !SENT_DELIM EIGHT !SENT_DELIM
-1.256368 !SENT_DELIM EIGHT EIGHT
-1.151632 !SENT_DELIM EIGHT FIVE
-1.151632 !SENT_DELIM EIGHT FOUR
-1.29776 !SENT_DELIM EIGHT NINE
-1.151632 !SENT_DELIM EIGHT OH
......
Notice that the trigrams have only one number associated with them (the number
to the left of the trigram). This number represents the probability of the
trigram. The bigrams and unigrams, however, have two numbers assigned to
them. The number to the left, as with the trigrams, is the n-gram's
probability. The number to the right is the back-off weight and is used
when determining which n-gram to use in case a back-off occurs.
In the TIDigits examples we've seen thus far, the word sequence probabilities
remain virtually constant since sequences of digits have no real language
structure. In a LVRS, the language structure becomes much more complex.
Within several million words of English text, more than 50% of
trigrams occur only once and 80% of trigrams occur less than five
times. This sparseness of words causes a problem in N-gram modeling.
Hence, smoothing is sometimes necessary to provide users a way of
generating broader language models. Smoothing is the process of
flattening a probability distribution implied by a language model so
that all reasonable word sequences can occur with some probability.
For more information about smoothing
click here.
|
| |
|