| |
A speech recognition system is an integration of knowledge across
several domains such as digital signal processing, natural language
processing and machine learning. With the evolution of technology, and
the ever- increasing complexity of speech recognition tasks, the
development of a state-of-the-art speech recognition system becomes a
time-consuming and infrastructure-intensive task. Since 1998, we have
focused on the development of a modular and flexible recognition
research environment which we refer to as the production system. The
toolkit contains many common features found in modern speech to text
(STT) systems: a front end that converts the signal to a sequence of
feature vectors, an HMM-based acoustic model trainer, and a
time-synchronous hierarchical Viterbi decoder. In the following
sections, I will describe how these major components are designed in
our toolkit so that it brings you competitive technology with maximum
flexibility.
Front End Builder
An acoustic front end refers to the portion of the recognition system
that extracts feature vectors from the speech data. The development of
a completely new front end is a software intensive task. The
re-implementation of existing algorithms from scratch has slowed down
many researchers' efforts. The goal of our front end builder is to
provide users an efficient environment for the evaluation of new
research ideas. The design of this tool included these requirements:
- rapid prototyping without programming;
- a block diagram approach to describing algorithms;
- a library of standard DSP algorithms and functions;
- an ability to plug in new classes/modules.
The front end builder is designed as a three-level structure (shown in
figure above) so that above requirements can be achieved. First of
all, at the lowest level, a number of core DSP algorithms are
implemented as C++ classes, providing high computational
efficiency. These algorithms include different types of windows,
filters, filter banks, etc. All the algorithms share a common virtual
function interface known as an interface contract. Users can plug in
their own algorithms into the front end builder by following the
predefined interface specification, without a need to write any new
GUI code. Similarly, new algorithm classes can be added to the
algorithm library without changing any of the existing code base.
A Java GUI tool was developed to provide users a block diagram
approach to designing acoustic front ends. When building a new front
end, a user can select from a library of existing algorithms, and can
configure all parameters associated with these algorithms from the
tool. The Java language was used so that the tool can run across a
wide range of platforms (including Microsoft Windows), and in order
that the tool could be built using an industry-standard look and feel.
The output of the GUI tool is a recipe that schedules the sequences of
required signal processing operations. A control program was developed
to process the configuration file, pass data through algorithms and
generate the final feature vectors. We have successfully built several
complex front ends with this tool, including an industry standard
front end based on Mel-frequency cepstrum coefficients (MFCCs). The
control program supports multi-pass processing, which allows
non-real-time research ideas involving complex normalization and
adaptation schemes to be easily implemented.
Hierarchical Decoder
Time-synchronous Viterbi beam search has been the dominant search
strategy for continuous speech recognition systems for the past 20
years. Search is a good example of an algorithm that is conceptually
simple, but extremely hard to implement in a general way in
practice. Worse yet, the slightest inefficiencies in search can result
in a system that cannot solve large-scale problems. Most search
implementations are restricted to one approach (e.g., time-synchronous
Viterbi search) and cannot be separated from the statistical modeling
approach (e.g., HMM). Further, the search structure is limited to
three levels (e.g., word, phone, and state).The goals for the design
of our generalized decoder were:
- an arbitrary number of independent levels;
- long-span context-dependent models at any level;
- lexical tree expansion at any level;
- posterior symbol probabilities at any level.
These goals are achieved by the abstraction of the generalized
hierarchical search space, shown in the figure above. In such a search
space, each level can be conceptually considered as the same. Phrases,
words or phones are simply symbols at different levels. Each level
contains a list of symbols Sij and a list of graphs Gik., where, i is
the index of the level, j is the index of the symbol and k is the
index of the graph. Each graph has at least two dummy vertices, the
start vertex and the terminal vertex both indicating the start and the
end points of the graph in a search space.
The lower level graph (such as the graph with WALK and
RUN) is the expansion of the symbol (such as the symbol with
VP) at the level above. This relationship connects all the
levels together, realizing the entire search space. There are two
exceptions for such a symbol-graph expansion. First, at the highest
level, only one graph can exist. This graph is called the master
grammar. It is a map for the decoder to iterate through the entire
search space. Second, at the lowest level, each symbol represents an
underlying statistical model. These symbols can not be expanded into a
sub-graph. Observations probabilities will be evaluated when the
search process reaches these symbols.
Acoustic Trainer
Our acoustic trainer is a supervised learning machine that estimates
the parameters of the acoustic models given the speech data and
transcriptions.In most HMM-based systems, a phonetic transcription is
required as input to the parameter reestimation process. This
transcription is derived at automatically and iterated upon as the
recognition system improves. Silence must also be hypothesized between
each word model, since we don't know where in the utterance silence
actually occurred. In essence, the trainer decides on the optimal
alignment of the hypothesized transcription as part of the training
process.
The trainer we implemented is based on a hierarchical search space
described in the previous section. It imposes no constraints on the
topology of any model in the hierarchical structure. With such a
flexible design, our network trainer, which is depicted in the figure
above, alleviates the need for a phonetic transcription. Instead, it
is capable of hypothesizing all possible expansions of a symbol list,
and performing parameter re-estimation across this expanded
network. The benefits are twofold: we remove the dependency on an
intermediate transcription, and we can inherently train statistical
pronunciation models. This data-driven approach often results in
better performance.
Programming Interfaces
All the utilities presented above were developed based on an extensive set of foundation classes (IFCs). IFCs are a set of C++ classes organized as libraries in a hierarchical structure. These classes are targeted for the needs of rapid prototyping and lightweight programming without sacrificing the efficiency. Some key features include:
- unicode support for multilingual applications;
- math classes that provide basic linear algebra and efficient matrix manipulations;
- memory management and tracking;
- system and i/o libraries that abstract users from details of the operating systems.
The software environment provides support for users to develop new approaches without rewriting common functions. The software interfaces are carefully designed to be generic and extensible.
Future Work
Our future work will follow two directions: (1) the development of
core algorithms to improve recognition accuracy and speed, and (2)
integration of natural language processing and dialogue modeling
tools. Over the next year, we will release our first real-time
dialogue systems application as well as many improvements to the core
speech recognition system. The complete software toolkit release is
available on-line at
http://www.cavs.msstate.ed./projects/speech/index.html.
|
| |
|