Detecting Scene Adjustments in Audiovisual Content material | by Netflix Know-how Weblog | Jun, 2023

Avneesh Saluja, Andy Yao, Hossein Taghavi

When watching a film or an episode of a TV present, we expertise a cohesive narrative that unfolds earlier than us, typically with out giving a lot thought to the underlying construction that makes all of it doable. Nevertheless, films and episodes usually are not atomic models, however somewhat composed of smaller components akin to frames, photographs, scenes, sequences, and acts. Understanding these components and the way they relate to one another is essential for duties akin to video summarization and highlights detection, content-based video retrieval, dubbing high quality evaluation, and video enhancing. At Netflix, such workflows are carried out tons of of instances a day by many groups around the globe, so investing in algorithmically-assisted tooling round content material understanding can reap outsized rewards.

Whereas segmentation of extra granular models like frames and shot boundaries is both trivial or can primarily depend on pixel-based information, increased order segmentation¹ requires a extra nuanced understanding of the content material, such because the narrative or emotional arcs. Moreover, some cues might be higher inferred from modalities aside from the video, e.g. the screenplay or the audio and dialogue observe. Scene boundary detection, specifically, is the duty of figuring out the transitions between scenes, the place a scene is outlined as a steady sequence of photographs that happen in the identical time and placement (typically with a comparatively static set of characters) and share a typical motion or theme.

On this weblog publish, we current two complementary approaches to scene boundary detection in audiovisual content material. The primary methodology, which might be seen as a type of weak supervision, leverages auxiliary information within the type of a screenplay by aligning screenplay textual content with timed textual content (closed captions, audio descriptions) and assigning timestamps to the screenplay’s scene headers (a.ok.a. sluglines). Within the second method, we present {that a} comparatively easy, supervised sequential mannequin (bidirectional LSTM or GRU) that makes use of wealthy, pretrained shot-level embeddings can outperform the present state-of-the-art baselines on our inner benchmarks.

Determine 1: a scene consists of a sequence of photographs.

Screenplays are the blueprints of a film or present. They’re formatted in a selected manner, with every scene starting with a scene header, indicating attributes akin to the placement and time of day. This constant formatting makes it doable to parse screenplays right into a structured format. On the identical time, a) modifications made on the fly (directorial or actor discretion) or b) in publish manufacturing and enhancing are hardly ever mirrored within the screenplay, i.e. it isn’t rewritten to mirror the modifications.

Determine 2: screenplay components, from The Witcher S1E1.

With the intention to leverage this noisily aligned information supply, we have to align time-stamped textual content (e.g. closed captions and audio descriptions) with screenplay textual content (dialogue and action² traces), taking into consideration a) the on-the-fly modifications which may end in semantically related however not equivalent line pairs and b) the doable post-shoot modifications which are extra vital (reordering, eradicating, or inserting complete scenes). To deal with the primary problem, we use pre skilled sentence-level embeddings, e.g. from an embedding mannequin optimized for paraphrase identification, to signify textual content in each sources. For the second problem, we use dynamic time warping (DTW), a technique for measuring the similarity between two sequences which will range in time or velocity. Whereas DTW assumes a monotonicity situation on the alignments³ which is often violated in apply, it’s strong sufficient to get better from native misalignments and the overwhelming majority of salient occasions (like scene boundaries) are well-aligned.

Because of DTW, the scene headers have timestamps that may point out doable scene boundaries within the video. The alignments may also be used to e.g., increase audiovisual ML fashions with screenplay info like scene-level embeddings, or switch labels assigned to audiovisual content material to coach screenplay prediction fashions.

Determine 3: alignments between screenplay and video through time stamped textual content for The Witcher S1E1.

The alignment methodology above is an effective way to rise up and operating with the scene change activity because it combines easy-to-use pretrained embeddings with a widely known dynamic programming method. Nevertheless, it presupposes the supply of high-quality screenplays. A complementary method (which in reality, can use the above alignments as a function) that we current subsequent is to coach a sequence mannequin on annotated scene change information. Sure workflows in Netflix seize this info, and that’s our major information supply; publicly-released datasets are additionally obtainable.

From an architectural perspective, the mannequin is comparatively easy — a bidirectional GRU (biGRU) that ingests shot representations at every step and predicts if a shot is on the finish of a scene.⁴ The richness within the mannequin comes from these pretrained, multimodal shot embeddings, a preferable design selection in our setting given the issue in acquiring labeled scene change information and the comparatively bigger scale at which we will pretrain numerous embedding fashions for photographs.

For video embeddings, we leverage an in-house mannequin pretrained on aligned video clips paired with textual content (the aforementioned “timestamped textual content”). For audio embeddings, we first carry out source separation to try to separate foreground (speech) from background (music, sound results, noise), embed every separated waveform individually utilizing wav2vec2, after which concatenate the outcomes. Each early and late-stage fusion approaches are explored; within the former (Determine 4a), the audio and video embeddings are concatenated and fed right into a single biGRU, and within the latter (Determine 4b) every enter modality is encoded with its personal biGRU, after which the hidden states are concatenated previous to the output layer.

Determine 4a: Early Fusion (concatenate embeddings on the enter).
Determine 4b: Late Fusion (concatenate previous to prediction output).

We discover:

  • Our outcomes match and generally even outperform the state-of-the-art (benchmarked utilizing the video modality solely and on our analysis information). We consider the outputs utilizing F-1 rating for the constructive label, and likewise calm down this analysis to contemplate “off-by-n” F-1 i.e., if the mannequin predicts scene modifications inside n photographs of the bottom fact. It is a extra life like measure for our use instances because of the human-in-the-loop setting that these fashions are deployed in.
  • As with earlier work, including audio options improves outcomes by 10–15%. A major driver of variation in efficiency is late vs. early fusion.
  • Late fusion is persistently 3–7% higher than early fusion. Intuitively, this outcome is sensible — the temporal dependencies between photographs is probably going modality-specific and must be encoded individually.

We have now introduced two complementary approaches to scene boundary detection that leverage a wide range of obtainable modalities — screenplay, audio, and video. Logically, the subsequent steps are to a) mix these approaches and use screenplay options in a unified mannequin and b) generalize the outputs throughout a number of shot-level inference duties, e.g. shot sort classification and memorable moments identification, as we hypothesize that this path can be helpful for coaching common objective video understanding fashions of longer-form content material. Longer-form content material additionally comprises extra advanced narrative construction, and we envision this work as the primary in a sequence of initiatives that intention to raised combine narrative understanding in our multimodal machine studying fashions.

Particular because of Amir Ziai, Anna Pulido, and Angie Pollema.