Text parsing is one of the more essential software components in
speech recognition research since we ultimately process human language
in both audio and text formats. Languages that allow easy
manipulation and parsing of text, such as PERL, have become very
popular in speech research. In fact, PERL was written by linguists to
fulfill their need for an easy-to-use, flexible and extensible
language. PERL supports powerful regular expressions that include a
pattern-matching operator to find a pattern in a string, a
substitution operator to substitute one string for another, and
a split operator to parse the string based on a delimiter. Text
processing can be done quickly and efficiently in PERL, but such a
language is not necessarily optimal for computationally-intensive
research tasks such as speech recognition.
The
ISIP Foundation Classes (IFCs)
provide extensive support for string processing so that such code can
be easily integrated at a programming level with other speech
recognition software.
Most text parsing and text processing functionality is implemented
within the
SysString
class that belongs to the
system library
level of the
IFC class hierarchy. Users typically access this functionality through the
String
class, which inherits SysString.
Several important features of this interface are described below.
- string tokenize methods: parse a string into smaller substrings
based on a user-defined delimiter;
- count token methods: count the number of tokens in
the given string;
- replace/insert methods: substring manipulations;
- string search methods: search a given string for
the position of the first or the last occurrence of a character
or a substring;
- string/numeric concatenation methods: concatenate strings;
- trim methods: remove certain characters or substrings
from the input string.
Let us consider a few simple examples to constrast string processing
in PERL and the IFCs. These examples cover some of the
functionality described in the previous paragraph. Let us first consider
tokenization, one of the more important functions in language
processing. This function is the equivalent of split in
PERL. The PERL code to parse the words in the sentence Jack
and Jill went up to hill is given below:
#! /usr/local/bin/perl
# file: ./examples/example_01.pl
#
# sentence to be parsed
#
$sentence = "Jack and Jill went up to hill";
# split the words using multiple spaces as a delimiter
#
@words = split(/\s+/, $sentence);
# get the count of the words
#
$count = $#words;
# print the parsed words on the console, one in each line
#
for ($i=0; $i <=$count; $i++) {
print "word = @words[$i]\n";
}
(Click here
to download this code.)
Comparable code in the IFCs uses the tokenize function:
// file: ./examples/example_01.cc
//
// isip include files
//
#include <String.h>
#include <Vector.h>
// main program starts here
//
int main(int argc, const char **argv) {
// declare the sentence as an String object
//
String sentence(L"Jack and Jill went up to hill");
// get the counts of the words
//
long count = sentence.countTokens(L" ");
// declare the vector of words
//
Vector words(count);
// local variable position that returns position on string where
// next delimiter is
//
long pos = 0;
// get each word by tokenizing using multiple spaces as a delimiter
//
for (long i = 0; i < count; i++) {
sentence.tokenize(words(i), pos, L" ");
}
// print the words on the console one at a time using the debug
// method
//
for (long i = 0; i < count ; i++) {
words(i).debug(L"word");
}
// exit gracefully
//
Integral::exit();
}
(Click here
to download this code.)
The next example demonstrates pattern matching and substitution.
Consider an example in which we want to replace all occurrences of the
string five with the string ten. The following code,
written in PERL, uses the global substitution operator:
#! /usr/local/bin/perl
# file: ./examples/example_02.pl
#
# sentence to be modified
#
$sentence = "six one five four nine five three";
# replace the "five" by "ten" at all occurrences using global
# substitution
#
$five = "five";
$ten = "ten";
$sentence =~ s/$five/$ten/g;
# print the modified sentence to the console
#
print "modified sentence: $sentence\n";
(Click here
to download this code.)
The same functionality implemented with the IFCs would use the
replaceAll function. Here we replace all the occurrences of the
string five with the string ten:
// file: ./examples/example_02.cc
//
// isip include files
//
#include <String.h>
// main program starts here
//
int main(int argc, const char **argv) {
// declare the sentence as an String object
//
String sentence(L"six one five four nine five three");
// replace the "five" by "ten" at all occurrences
//
sentence.replaceAll(L"five", L"ten");
// print the modified sentence to the console
//
sentence.debug(L"modified sentence");
// exit gracefully
//
Integral::exit();
}
(Click here
to download this code.)
One of the most frequent uses of text parsing is to load parameter data
from files to configure programs. There are several parsers available
within the IFC environment to do such things, and users rarely have to
write any custom code to accomplish this task. Another way we avoid
the need to do intensive amounts of text processing is to avoid the
use of unformatted data in our environment. Most data is stored using a
Signal Object File (Sof)
representation that makes it easy to read and write such data to files.