| |
4.3.6 Scoring: Significance Testing
Although hypothesis scoring gives us a good idea of how well a recognition
system performs on a set of data, it is not the best way to compare the
performance of two different recognition systems to determing which one is
better. For this task, significance testing is often used.
Instead of looking at an entire utterance transcription at one time,
significance testing usually splits the transcriptions into segments
consisting of several words. The segments are specific to the pair
of systems being compared. They are bounded
on both sides by words correctly recognized by both systems (or by the
beginning or end of utterance). See the figure below:
The significance test involves the difference in the numbers of errors of
the two systems in each segment. The mean of these differences is used
along with a control parameter called the "significance level" to determine
through an experiment if one recognition system is significantly better
than another. For a more technical definition of this test, see
this report.
Now that you have a basic understanding of significance testing, let's run
through a simple example. This examples will use the results from the
experiments in
Section 4.2.4,
word-internal models, and
Section 4.2.5,
cross-word models.
Go to the following directory:
cd $ISIP_TUTORIAL/sections/s04/s04_03_p06/
This directory contains several files including hypotheses generated by the
two different experiments, and a script called
isip_eval_sgml.sh.
The following test will attempt to determine if one
system is significantly better than the other. Run the command:
isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists/identifiers_test.sof reference.score results_01.score
Expected Output:
./isip_eval_sgml.sh> converting from isip_word format to score format .....
./isip_eval_sgml.sh> evaluating using sclite .....
/usr/local/sctk/bin/sclite -F -i swb -r reference.score -h results_01.score.score -o sgml
sclite: 2.2 TK Version 1.2
Begin alignment of Ref File: 'reference.score' and Hyp File: 'results_01.score'
Alignment# 18 for speaker ah
Alignment# 17 for speaker ar
Alignment# 17 for speaker at
Alignment# 17 for speaker bc
Alignment# 17 for speaker be
Alignment# 17 for speaker bm
Alignment# 17 for speaker bn
....
This command aligns the hypothesis file to the reference file and splits the
utterances into segments of the type described above. Two files are
created: results_01.score.report and results_01.score.sgml. The
results_01.score.report is empty, and we will ignore it. The file
results_01.score.sgml is an sgml score file and will be used later with the
score file of the second system to test the two systems. Now that we have
the alignments for the results of the first system, we need to extract the
alignments for the results of the second system.
Run the command:
isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists/identifiers_test.sof reference.score results_02.score
This command generates two more files: results_02.score.report and
results_02.score.sgml. Once again, the file results_02.score.sgml
is the sgml score file for the second system. We can now use these
two sgml score files to compare both systems.
Run the command:
cat results_01.score.sgml results_02.score.sgml | sc_stats -p -t mapsswe -v -u -n result_sys_01_sys_02
Expected output:
sc_stats: 1.2
Beginning Multi-System comparisons and reports
Performing the Matched Pair Sentence Segment (Word Error) Test
Output written to 'result_sys_01_sys_02.stats.mapsswe'
Printing Unified Statistical Test Reports
Output written to 'result_sys_01_sys_02.stats.unified'
Successful Completion
This command uses NIST's sc_stats tool perform a two-tailed significance
test with the null hypothesis that there is no performance difference
between the two systems. Two files are generated:
result_sys_01_sys_02.stats.mapsswe and result_sys_01_sys_02.stats.unified.
The file ending with .unified contains the report. The other file is
empty and we will ignore it. The report consists of a detailed explanation
of how to read the significance findings between the two systems.
Click here
to see an example of this report.
|
| |
|