Evaluation and Annotation Benchmark

So as we mentioned before, we will have two perspectives to evaluate the performance of the pipeline.

Latency
Accuracy

Latency

For the latency part, if you log the time point and duration of each task within the Agent module, you should be able to automatically get the latency of each round of conversation. And the results will be presented in two ways:

Summary Latency Benchmark
Detailed Latency Benchmark

The above figure is the Detailed Latency Benchmark, which will show the latency of each round of conversation.

The below figure is the Summary Latency Benchmark, which will show the summary statistics of the latency.

summary_latency

Accuracy

For the accuracy part, some of the metrics can be automatically calculated, such as WER for Speech2Text. However, currently, most of the metrics will need human annotation. And research about how to get this to be automated is still ongoing and worth investigating.

So to solve the problem, we build the annotation functionality for the accuracy part.

Annotation System

Conversation Annotation

We have a table(Model concept in Django) called conversation to record each round of conversation.

The conversation will be associated with the input

Data Audio
Data Video (associated with the Data Audio)
Data Text (Converted by Speech2Text Task)

and the output

Response Text (Generated by the Agent module, Optional)
Generated Audio (Generated by the Text2Speech or directly module, Optional)

And the annotation will be based on the input and output.

As shown in this figure:

Annotation

Initially, the evaluation measurement metrics we built in includes a score [0,5], 5 means the response is perfect, 0 means the response is totally wrong. In this way, we can calculate a quantitative score for the performance of each component within pipeline.

So for each conversation, you can annotate

Speech2Text Score: Whether it is perfect, excellent, good, fair, poor, or bad
Give Speech2Text correct text
Text Generation Score: Same as the score above, evaluate the response text
Give Text Generation proper text
Text2Speech Score: Same as the score above, evaluate the generated audio
Overall Score: Overall score for the conversation
Comments: Any comments you want to add, which will be also shown in the benchmark page

One conversation can be annotated by multiple people, and the final score will be the average of all the scores.

The below figure shows the default annotation score:

Annotation_metrics

And the overview of the annotation for a conversation:

Annotation_overview

The annotated details will be shown in the bottom.

Customised Component Annotation

And for a specific component within the pipeline, which will not fit in the conversation table above, we will have a separate table to record the annotation. For example, the emotion detection will be a customized task we defined and developed, so we will have a separate table to record the annotation.

Compared to the above setup, the context part(input and output) will be the same, the annotation measurement metrics will be different:

Annotation_metrics_emotion

Multi-turn Conversation Annotation

The conversations actually is mulit-turn, which means we also want to be able to annotate multiple turns conversation.

This is also supported.

You can assign a "Tag" to a group of conversations, then the last conversation within the group will have an extra annotated field called "Multi turn annotation overall score and comment".

multi_turn

During and after the annotation process, you can track the progress by the Accuracy|Detail page. For example:

annotation_progress

After all annotation is done, you can view the summary of the accuracy by the Accuracy|Benchmark page.

accuracy_overall

And your multi turn conversation results can be checked with Accuracy|Multi-Turn Conversation page

multi_turn_overall

In summary, for the evaluation benchmark, latency can be automatically calculated, and accuracy will need human annotation. Our tool can help the advancement of the auto or semi-auto evaluation accuracy metrics development by collecting this kind of data.