| |
2.4.2 Auxiliary Resources:
Audio and Transcription Databases
One of the more time-consuming aspects of speech recognition research
is preparation and coordination of speech audio data and speech
transcriptions. Often, experiments are aborted because the list of
audio files does not match the list of transcriptions. Unless these
two are tied together in some way, it is difficult to avoid such
problems. Therefore, in our system, we provide a unique method for
storing and accessing speech data and transcriptions through two related
database representations,
AudioDatabase
and
TranscriptionDatabase.
These databases are created and manipulated using a single tool called
isip_make_db.
AudioDatabase
Storage and access to speech data files is managed through an
internally defined database format, AudioDatabase. This database
manages a set of records. A record typically contains 1) a unique
identifier, which we refer to as the id, and 2) the location of the
speech file on disk. To obtain a record from the audio database,
the id must be referenced.
Consider a collection of three files:
ae_12a.sof,
ae_1a.sof, and
ae_2789385a.sof.
We need to arrange these in a single file, called a list file,
with corresponding ids. An example of such a file is
audio_list.text.
Go to the directory:
$ISIP_TUTORIAL/sections/s02/s02_04_p02/
We can convert this list file to and audio database using
isip_make_db:
isip_make_db -db audio -audio audio_list.text -name TIDigits -type text audio_db.sof
The first option, "-db", indicates the type of database you want.
Currently available choices are "audio", "transcription" (which is
described below), and "both". In this case, we selected "audio" since
we want an audio database.
The second option, "-audio", provides the name of the listing file.
This listing file typically contains a filename followed by a key.
You can create these fairly easily using Unix commands such as "ls"
and a programmable editor such as "emacs". The key is optional, in
which case a unique key will be generated automatically. An example
of a listing file is
audio_list.text.
This file contains the three filenames mentioned above and
the corresponding ids (based on the file's basename in this example).
The third option, "-name", should be set to the name of the data.
The fourth option, "-type", is used to generate either a text or
binary Sof file. In this case we use "text" so we can view the
output file by simply listing it. The last entry, which is the
first argument, is the name of the output file which will contain
an audio database. See
audio_db.sof
for the output from the example given above.
The database file contains four Sof objects: an AudioDatabase object,
and three Filename objects which contain the names of the filenames
included in this example. The AudioDatabse object encapsulates the
database name (e.g., TIDigits), a list of ids, a mapping from ids to
Filename object numbers. The ids link filenames to transcriptions
described below.
Since the audio files are often located in a location different
from the current working directory, it is useful to make
these databases using filenames that contain work from any directory.
The obvious way to do this is to use a
fully qualified filename.
For example, "ae_12a.sof" could be represented as
"/isi./data/corpora/tidigits/ae_12a/sof".
Another convenient way to do this is to use an environment variable.
For example,
the file named "ae_12a.sof" can be represented as "$TUTORIAL/ae_12a.sof"
in the file audio_list.text. If the environment variable
"$TUTORIAL" is properly set to "/isi./data/corpora/tidigits",
then this file will be accessible from any location. The advantage
of an environment variable is that the database can be moved to
a new location and the only thing that needs to be updated is
the environment variable.
Transcription Database
Transcriptions for the speech files in an audio database
are managed by a TranscriptionDatabase. This database
uses
annotation graphs
to represent the transcriptions, which typically consist of
strings of words (though they can be much more complicated
than that). The transcriptions are
organized using the same key value used in the audio database.
To obtain a transcription of a particular speech file in an audio database,
the key for that particular data file must be referenced.
Continuing on the example described above, we can create a
transcription list file many different ways using standard Unix commands
and editors. For applications such as TIDigits, this is particularly
simple because the transcriptions are encoded in the filename.
An example of a transcription list file is provided in
trans_list.text.
This file contains fields of the form:
key [start_time] [stop_time] [channel]: ... transcription ...
The key should match the corresponding audio file described above.
The start and stop times are optional, and denote where the speech
data begins and ends in the corresponding audio file. The channel index
is used in the event that the audio file contains multiple channels
(e.g., stereo). The field after ":" contains the desired transcription
of the utterance.
The command to create a transcription database file from this data is:
isip_make_db -db transcription -trans trans_list.text -level word -name TIDigits -type text trans_db.sof
We have introduced two new options here: (1) "-trans" instructs the command
to generate a transcription database, and (2) "-level" assigns a tag
to this transcription. The level tag will be discussed later when
we introduce acoustic training (see
Section 5)
and recognition scoring (see
scoring in
Section 4).
The result transcription database can be viewed in
trans_db.sof.
This file contains a TranscriptionDatabase object and three
AnnotationGraph objects. The latter contain the actual transcription
along with the timing information. The former contains the ids
used to reference individual AnnotationGraphs. The format of this object
is the same as described above in
audio_db.sof.
Note that both of these databases could have been built using a single
command:
isip_make_db -db both -audio audio_list.text -trans trans_list.text \
-level word -name TIDigits -type text audio_db.sof trans_db.sof
This is the preferred way to run the command since it makes clear the
fact that they key is the bridge between the two types of information.
The beauty of our database approach to handling file lists is that
important subsets of a database are now simply referenced
using lists of ids. In this way, we avoid the
problem of mismatches between audio files and transcriptions.
The audio and transcription databases are created once for the
entire database, and users simply need to operate on the appropriate
lists of ids. Common problems such as a missing transcription or
an incorrect ordering of files, which cause mismatches between
simpling listing files, are alleviated because there is just one file,
a list of ids, that needs to be maintained.
For a more detailed explanation of isip_make_db, see our
on-line documentation.
|
| |
|