| |
6.3.2 N-gram Modeling: Training an N-gram Language Model
We use N-gram models trained by an SRI (Stanford
Research Institute) toolkit. In this section, you'll learn how to train
an n-gram language model. The SRI toolkit must be downloaded and installed
in order to run the following examples. The tools can be downloaded from
the
SRI Website.
Follow the installation instructions given
here.
To train the N-gram language model files, we'll use the tool
ngram-count.
First, go to the directory:
$ISIP_TUTORIAL/sections/s06/s06_03_p02/
Run the command
ngram-count -text tidigits_trans_word_text.text -order 3 -lm tidigits_word_ngram.lm
Expected Output:
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: discount coeff 1 is out of range: 1.41549
warning: discount coeff 6 is out of range: 1.36563
BOW denominator for context "NINE <s>" is zero; scaling probabilities to sum to 1
BOW denominator for context "THREE <s>" is zero; scaling probabilities to sum to 1
BOW denominator for context "SIX <s>" is zero; scaling probabilities to sum to 1
The -text parameter specifies the file containing the plain text
transcriptions. For this example, the transcription file is
tidigits_trans_word_text.text
The -order parameter tells the tool what type of n-gram
model to create. In this case, we're creating a trigram language model. The
final parameter, -lm, specifies the name of the trained language model file to
output, which in this case is
tidigits_word_ngram.lm.
The N-gram models generated by this tool must
be modified to follow the ISIP format. Two modifications need to be
made to the original file generated by SRI toolkit. First, add an Sof header
at the beginning of the file. Next, The following example shows the file
before and after the addition of the header.
Before Header
\data\
ngram 1=13
ngram 2=141
ngram 3=682
|
After Header
@ Sof v1.0 @
@ NGramModel 0 @
format = "NGRAM_ARPA";
\data\
ngram 1=13
ngram 2=141
ngram 3=682
|
Next, replace the tags "< s >" and "< /s >" in the original generated
file with !SENT_DELIM. Look at the example below.
Before Modification
\1-grams:
-0.6322352    </s>
-99       <s>     -1.953935
|
After Modification
\1-grams:
-0.6322352       !SENT_DELIM
-99        !SENT_DELIM     -1.953935
|
Save the modified file as tidigits_word_ngram.sof.
Take a look at the
N-gram model file
before
and
after its two modifications. The file is now usable in ISIP's
production system.
|
| |
|