RNA : Video Editing with ROI-based Neural Atlas

Abstract

With the recent growth of video-based Social Network Service (SNS) platforms, the demand for video editing among common users has increased. However, video editing can be challenging due to the temporally-varying factors such as camera movement and moving objects. While modern atlas-based video editing methods have addressed these issues, they often fail to edit videos including complex motion or multiple moving objects, and demand excessive computational cost, even for very simple edits. In this paper, we propose a novel region-of-interest (ROI)-based video editing framework: ROI-based Neural Atlas (RNA). Unlike prior work, RNA allows users to specify editing regions, simplifying the editing process by removing the need for foreground separation and atlas modeling for foreground objects. However, this simplification presents a unique challenge: acquiring a mask that effectively handles occlusions in the edited area caused by moving objects, without relying on an additional segmentation model. To tackle this, we propose a novel mask refinement approach designed for this specific challenge. Moreover, we introduce a soft neural atlas model for video reconstruction to ensure high-quality editing results. Extensive experiments show that RNA offers a more practical and efficient editing solution, applicable to a wider range of videos with superior quality compared to prior methods.

Method

Overall framework of RNA. For video editing, (a) a user selects a reference frame from an input video and specifies an ROI where they want to edit. (b) For the specified ROI, our method estimates a 2D atlas representing its temporally-invariant appearance. (c) Then, the user edits the 2D atlas. (d) Finally, an edited video is reconstructed from the edited atlas and the input video. Specifically, our video editing framework has three main components: atlas estimation, mask refinement, and video reconstruction using a soft neural atlas model. we first estimate the mappings 𝕄, 𝔸, 𝕋, and 𝕃 for a given video in an end-to-end supervised manner. Then, we perform an additional mask refinement process to more accurately consider occlusions caused by foreground objects in motion. Following this, edited video is reconstructed using the video reconstruction method based on a novel soft neural atlas model.

Results

Qualitative comparisons

The prvious methods produce unnatural editing results because they fail to model foreground object atlases due to the complex relationships between foreground objects or their complex movements. In contrast, RNA achieves natural video editing results in these challenging scenarios.

Additional editing results with rendering of atlas

The rendering results of atlas exhibit the contents without occluding foreground objects. We effectively utilize this property in our mask refinement and soft neural atlas model, achieving high-quality editing results.

Reconstruction Quality and Efficiency

RNA achieves comparable PSNR values to previous methods with generally smaller and constant computational overload, regardless of the number of moving objects.

Ablation Study

We conduct an ablation study by sequentially applying our mask refinement and soft neural atlas model after atlas estimation.

Bibtex


         @article{lee2024rna,
         title={RNA: Video Editing with ROI-based Neural Atlas},
         author={Lee, Jaekyeong and Kim, Geonung and Cho, Sunghyun},
         journal={arXiv preprint arXiv:2410.07600},
         year={2024}
         }

Acknowledgements

The website template was borrowed from Text2Live.