The annotation graph represents the linguistic annotation of recorded
speech data. The linguistic annotation, in the case of speech
recognition, is simply an orthographic annotation of speech data,
which may or may not be time-aligned to an audio recording [1]. The
orthographic annotation, generally referred to as a transcription, is
a label associated with the audio recording. The transcription along
with the audio recording is used to train the speech recognition
system in a surprised learning framework. The annotated transcription
may include a hierarchy of linguistic, syntactic and semantic
knowledge sources that needs to be conveniently represented.
The annotation graph provides a convenient means for representing a
hierarchy of knowledge sources. An annotation graph may be used to
represent a single transcription or an entire conversation depending
on how the speech database is organized. This alleviates the problem
of having multiple copies of the same transcription for each knowledge
source, and it also provides an application programmer interface (API)
to tag and query the various knowledge sources.
Framework
The design of the annotation graph framework follows the design
specification given by S. Bird and M. Liberman at the Linguistic Data
Consortium [2]. The following is a contrived example that will be used
to demonstrate the essential elements of an annotation graph.
The annotation graph represents the knowledge sources for the
orthographic transcription "the boy ran." The transcription is
time-aligned to the audio recording, i.e., the audio recording
contains the words in the transcription and has a duration of 1.4
seconds. The syntactic structure of the transcription (i.e., noun and
verb phrases in this case) is represented as a separate layer in the
annotation graph. The noun phrase, "the boy," lies in the interval
[0.0, 0.2], and the verb phrase, "ran," lies in the interval [0.8,
1.4]. The phonetic structure, phones realized by the words, is also
represented as a separate layer in the annotation graph. The word,
"the," that constitutes the noun phrase of the transcription, is
represented by the phone /dh/ followed by /ax/. The phones
corresponding to the words in the transcription can be strung together
to form a phonetic transcription.
The various layers in the annotation graph, described above,
represents a different knowledge source in the linguistic
annotation. The arcs in the annotation graph represent the linguistic
notations applied to the raw language data. The nodes in the
annotation graph represent the time offset corresponding to each
linguistic notation. The nodes need not contain time offsets, they can
be empty to indicate notations that are not time-aligned. This
framework, allows the various linguistic knowledge source to co-exist
within the same structure. The ability to represent the different
knowledge source, and the flexibility of tagging and queering the
notation, are the primary reasons why we chose to incorporate the
annotation graph framework in our system.
One way we use annotation graphs is to represent the hypothesis
generated by decoder during recognition. The decoder is hierarchical
in nature, i.e., each level of the hierarchy represents a separate
knowledge source. The hypothesis that is generated during recognition
could represent the different levels in the hierarchy as a separate
layer in the annotation graph. Therefore, each hypothesis generated
is easily output as an annotation graph.
Example
The following is an example of how to build an annotation graph that
contains the orthographic transcription "the" and its corresponding
phones /dh/ and /ax/:
#include <AnnotationGraph.h>
int main(int argc, const char** argv) {
String tmp_str;
String name(L"CONTRIVED EXAMPLE");
String type(L"TRANSCRIPTION");
String unit(L"seconds");
Float offset_00(0.0);
Float offset_01(0.1);
Float offset_02(0.2);
AnnotationGraph angr(name, type);
// annotation for the word "the"
//
tmp_str.assign(L"the");
angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
tmp_str);
// annotation for the phone "dh"
//
tmp_str.assign(L"dh");
angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
tmp_str);
// annotation for the phone "ax"
//
tmp_str.assign(L"ax");
angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
tmp_str);
// exit gracefully
//
Integral::exit();
}
The example above creates an annotation graph for the orthographic
transcription "the" and the corresponding phonemic transcription,
i.e., the phone /dh/ followed by /ax/. However, the example does not
tag the different levels of the annotation graph. We must tag the
different levels if we intend to extract the orthographic and phonemic
transcriptions form the graph. The following example shows how to tag
the different level in the annotation graph:
#include <AnnotationGraph.h>
int main(int argc, const char** argv) {
String new_id;
String tmp_str;
String key;
String value;
String name(L"EXAMPLE");
String type(L"TRANSCRIPTION");
String unit(L"SECONDS");
Float offset_00(0.0);
Float offset_01(0.1);
Float offset_02(0.2);
AnnotationGraph angr(name, type);
// annotation for the word "the"
//
tmp_str.assign(L"the");
new_id = angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
tmp_str);
// tag annotation with the value "ORTHOGRAPHIC"
//
key.assign(L"level");
value.assign(L"ORTHOGRAPHIC");
angr.setFeature(new_id, key, value);
// annotation for the phone "dh"
//
tmp_str.assign(L"dh");
new_id = angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
tmp_str);
// tag annotation with the value "PHONETIC"
//
value.assign(L"PHONETIC");
angr.setFeature(new_id, key, value);
// annotation for the phone "ax"
//
tmp_str.assign(L"ax");
new_id = angr.createAnnotation(name,
angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
tmp_str);
// tag annotation with the value "PHONETIC"
//
value.assign(L"PHONETIC");
angr.setFeature(new_id, key, value);
// exit gracefully
//
Integral::exit();
}
The example above allows us to extract the different levels of the
annotation graph by referring to the tag. In the example above we can
extract all annotations that have the level called "PHONETIC" or
"ORTHOGRAPHIC". In the latter case the annotation with the notation
"the" is returned, and in the former case the annotations with the
notations /dh/ and /ax/ are returned.
Resources
The following links point to resources that will be helpful in
learning how to build and use our annotation graph library. These
links contain documentation and API's related to the annotation graph
toolkit. More information on the Linguistic Data Consortium's
Annotation Graph Toolkit can be found
here.
References
- S. Bird and M. Liberman,
"A Formal Framework for Linguistic Annotation,"
Linguistic Data Consortium, University of Pennsylvania, Philadelphia,
Pennsylvania, USA, 2000.
- K. Maeda, X. Ma, H. Lee, and S. Bird,
The Annotation Graph Toolkit:
Application Developer's Manual (Draft),
Linguistic Data Consortium,
University of Pennsylvania, Philadelphia, Pennsylvania, USA, 2001
(see
http://agtk.sourceforge.net/).
|