Automated speech recognition (ASR) know-how has made conversations extra accessible with dwell captions in distant conferencing software program, cell functions, and head-worn shows. Nonetheless, to take care of real-time responsiveness, dwell caption methods usually show interim predictions which are up to date as new utterances are acquired. This could trigger textual content instability (a “flicker” the place beforehand displayed textual content is up to date, proven within the captions on the left within the video beneath), which may impair customers’ studying expertise as a consequence of distraction, fatigue, and problem following the dialog.
In “Modeling and Enhancing Textual content Stability in Reside Captions”, offered at ACM CHI 2023, we formalize this downside of textual content stability via a couple of key contributions. First, we quantify the textual content instability by using a vision-based flicker metric that makes use of luminance distinction and discrete Fourier remodel. Second, we additionally introduce a stability algorithm to stabilize the rendering of dwell captions through tokenized alignment, semantic merging, and clean animation. Lastly, we carried out a consumer examine (N=123) to know viewers’ expertise with dwell captioning. Our statistical evaluation demonstrates a robust correlation between our proposed flicker metric and viewers’ expertise. Moreover, it reveals that our proposed stabilization strategies considerably improves viewers’ expertise (e.g., the captions on the best within the video above).
Uncooked ASR captions vs. stabilized captions
Metric
Impressed by earlier work, we suggest a flicker-based metric to quantify textual content stability and objectively consider the efficiency of dwell captioning methods. Particularly, our aim is to quantify the sparkle in a grayscale dwell caption video. We obtain this by evaluating the distinction in luminance between particular person frames (frames within the figures beneath) that represent the video. Massive visible adjustments in luminance are apparent (e.g., addition of the phrase “brilliant” within the determine on the underside), however delicate adjustments (e.g., replace from “… this gold. Good..” to “… this. Gold is good”) could also be troublesome to discern for readers. Nonetheless, changing the change in luminance to its constituting frequencies exposes each the apparent and delicate adjustments.
Thus, for every pair of contiguous frames, we convert the distinction in luminance into its constituting frequencies utilizing discrete Fourier remodel. We then sum over every of the high and low frequencies to quantify the sparkle on this pair. Lastly, we common over the entire frame-pairs to get a per-video flicker.
For example, we are able to see beneath that two equivalent frames (high) yield a flicker of 0, whereas two non-identical frames (backside) yield a non-zero flicker. It’s value noting that increased values of the metric point out excessive flicker within the video and thus, a worse consumer expertise than decrease values of the metric.
Illustration of the sparkle metric between two equivalent frames.
Illustration of the sparkle between two non-identical frames.
Stability algorithm
To enhance the soundness of dwell captions, we suggest an algorithm that takes as enter already rendered sequence of tokens (e.g., “Earlier” within the determine beneath) and the brand new sequence of ASR predictions, and outputs an up to date stabilized textual content (e.g., “Up to date textual content (with stabilization)” beneath). It considers each the pure language understanding (NLU) side in addition to the ergonomic side (show, format, and so on.) of the consumer expertise in deciding when and the right way to produce a steady up to date textual content. Particularly, our algorithm performs tokenized alignment, semantic merging, and clean animation to realize this aim. In what follows, a token is outlined as a phrase or punctuation produced by ASR.
We present (a) the beforehand already rendered textual content, (b) the baseline format of up to date textual content with out our merging algorithm, and (c) the up to date textual content as generated by our stabilization algorithm.
Our algorithm deal with the problem of manufacturing stabilized up to date textual content by first figuring out three courses of adjustments (highlighted in pink, inexperienced, and blue beneath):
Pink: Addition of tokens to the top of beforehand rendered captions (e.g., “How about”).
Inexperienced: Addition / deletion of tokens, in the midst of already rendered captions.
B1: Addition of tokens (e.g., “I” and “associates”). These might or might not have an effect on the general comprehension of the captions, however might result in format change. Such format adjustments are usually not desired in dwell captions as they trigger important jitter and poorer consumer expertise. Right here “I” doesn’t add to the comprehension however “associates” does. Thus, you will need to stability updates with stability specifically for B1 sort tokens.
B2: Elimination of tokens, e.g., “in” is eliminated within the up to date sentence.
Blue: Re-captioning of tokens: This contains token edits which will or might not have an effect on the general comprehension of the captions.
C1: Correct nouns like “disney land” are up to date to “Disneyland”.
C2: Grammatical shorthands like “it is” are up to date to “It was”.
Courses of adjustments between beforehand displayed and up to date textual content.
Alignment, merging, and smoothing
To maximise textual content stability, our aim is to align the previous sequence with the brand new sequence utilizing updates that make minimal adjustments to the present format whereas making certain correct and significant captions. To realize this, we leverage a variant of the Needleman-Wunsch algorithm with dynamic programming to merge the 2 sequences relying on the category of tokens as outlined above:
Case A tokens: We instantly add case A tokens, and line breaks as wanted to suit the up to date captions.
Case B tokens: Our preliminary research confirmed that customers most popular stability over accuracy for beforehand displayed captions. Thus, we solely replace case B tokens if the updates don’t break an present line format.
Case C tokens: We examine the semantic similarity of case C tokens by reworking authentic and up to date sentences into sentence embeddings, measuring their dot-product, and updating them provided that they’re semantically completely different (similarity < 0.85) and the replace won’t trigger new line breaks.
Lastly, we leverage animations to scale back visible jitter. We implement clean scrolling and fading of newly added tokens to additional stabilize the general format of the dwell captions.
Consumer analysis
We carried out a consumer examine with 123 individuals to (1) look at the correlation of our proposed flicker metric with viewers’ expertise of the dwell captions, and (2) assess the effectiveness of our stabilization strategies.
We manually chosen 20 movies in YouTube to acquire a broad protection of subjects together with video conferences, documentaries, educational talks, tutorials, information, comedy, and extra. For every video, we chosen a 30-second clip with a minimum of 90% speech.
We ready 4 kinds of renderings of dwell captions to check:
Uncooked ASR: uncooked speech-to-text outcomes from a speech-to-text API.
Uncooked ASR + thresholding: solely show interim speech-to-text end result if its confidence rating is increased than 0.85.
Stabilized captions: captions utilizing our algorithm described above with alignment and merging.
Stabilized and clean captions: stabilized captions with clean animation (scrolling + fading) to evaluate whether or not softened show expertise helps enhance the consumer expertise.
We collected consumer scores by asking the individuals to observe the recorded dwell captions and charge their assessments of consolation, distraction, ease of studying, ease of following the video, fatigue, and whether or not the captions impaired their expertise.
Correlation between flicker metric and consumer expertise
We calculated Spearman’s coefficient between the sparkle metric and every of the behavioral measurements (values vary from -1 to 1, the place unfavourable values point out a unfavourable relationship between the 2 variables, optimistic values point out a optimistic relationship, and nil signifies no relationship). Proven beneath, our examine demonstrates statistically important (𝑝 < 0.001) correlations between our flicker metric and customers’ scores. Absolutely the values of the coefficient are round 0.3, indicating a reasonable relationship.
Behavioral Measurement
Correlation to Flickering Metric*
Consolation
-0.29
Distraction
0.33
Straightforward to learn
-0.31
Straightforward to comply with movies
-0.29
Fatigue
0.36
Impaired Expertise
0.31
Spearman correlation exams of our proposed flickering metric. *p < 0.001.
Stabilization of dwell captions
Our proposed approach (stabilized clean captions) acquired constantly higher scores, important as measured by the Mann-Whitney U take a look at (p < 0.01 within the determine beneath), in 5 out of six aforementioned survey statements. That’s, customers thought of the stabilized captions with smoothing to be extra snug and simpler to learn, whereas feeling much less distraction, fatigue, and impairment to their expertise than different kinds of rendering.
Consumer scores from 1 (Strongly Disagree) – 7 (Strongly Agree) on survey statements. (**: p<0.01, ***: p<0.001; ****: p<0.0001; ns: non-significant)
Conclusion and future route
Textual content instability in dwell captioning considerably impairs customers’ studying expertise. This work proposes a vision-based metric to mannequin caption stability that statistically considerably correlates with customers’ expertise, and an algorithm to stabilize the rendering of dwell captions. Our proposed resolution may be probably built-in into present ASR methods to reinforce the usability of dwell captions for quite a lot of customers, together with these with translation wants or these with listening to accessibility wants.
Our work represents a considerable step in direction of measuring and bettering textual content stability. This may be developed to incorporate language-based metrics that concentrate on the consistency of the phrases and phrases utilized in dwell captions over time. These metrics might present a mirrored image of consumer discomfort because it pertains to language comprehension and understanding in real-world situations. We’re additionally all in favour of conducting eye-tracking research (e.g., movies proven beneath) to trace viewers’ gaze patterns, akin to eye fixation and saccades, permitting us to higher perceive the kinds of errors which are most distracting and the right way to enhance textual content stability for these.
Illustration of monitoring a viewer’s gaze when studying uncooked ASR captions.
Illustration of monitoring a viewer’s gaze when studying stabilized and smoothed captions.
By bettering textual content stability in dwell captions, we are able to create more practical communication instruments and enhance how folks join in on a regular basis conversations in acquainted or, via translation, unfamiliar languages.
Acknowledgements
This work is a collaboration throughout a number of groups at Google. Key contributors embrace Xingyu “Bruce” Liu, Jun Zhang, Leonardo Ferrer, Susan Xu, Vikas Bahirwani, Boris Smus, Alex Olwal, and Ruofei Du. We want to lengthen our due to our colleagues who offered help, together with Nishtha Bhatia, Max Spear, and Darcy Philippon. We might additionally wish to thank Lin Li, Evan Parker, and CHI 2023 reviewers.