| |
3.1.2 Overview:
Frame-Based Processing
Each feature used by the recognizer must be calculated a user-specified
number of times per second and computed over some time interval.
The first quantity, which we define as the number of seconds between
feature vector calculations, is called the
frame duration.
The second quantity, the number of seconds of data used to calculate
a feature, is called the
window duration.
A typical frame duration in speech recognition is 10 ms, while a typical
window duration is 25 ms. This means that every 10 ms a set of features are
computed using a window of 25 ms of data centered around the current
frame. This process is demonstrated in the figure shown to the right.
The measurements are taken over a set of samples. The number of samples
used is determined by several factors. One factor is the sampling
rate. As an example, for an 8 Khz sampling rate with a frame duration
of 10 ms, measurements would be taken over 80 samples to produce
one feature vector.
Note this assumes we use only frame duration to determine the number
of samples used in the measurements.
In practice, however, to get a
smoother representation of the speech data, a
window
of samples surrounding the frame is incorporated in the
measurements. Since the window incorporates samples from
surrounding frames, the window size determines the number of
samples used to produce a feature vector. The frame duration,
however, determines the number of times we produce a feature
vector. Our feature extraction tool
isip_transform_builder
provides an easy to use interface that alleviates the need to program
such complicated interactions between the data and analysis window.
This chapter is primarily a tutorial on how to use this tool.
The samples incorporated in the window can be taken from a frame
preceding the current frame (left alignment), following the current
frame (right alignment), or on either side of the current frame
(center alignment). The latter is most commonly used. The image
shown above illustrates a frame duration of 10 ms with a center alignment
window including samples from 5 ms of frame data on either side of the
current frame.
A speech waveform for a set of samples is shown at the bottom of the
figure. Each 10 ms frame is labeled between the vertical bars. The
window surrounding each frame is highlighted by a box with dashed
lines. Note the shared data in the windows overlap from frame to
frame. The window size chosen depends on the mathematical technique
used in the calculation. Some common windowing techniques include
Hamming, Hanning, and Gaussian. For more details on windowing
techniques see the
Window class
in the algorithm library of our foundation classes.
|
| |
|