Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Liu, C; Li, PP; Yu, Q; Sheng, H; Wang, D; Li, L; Yu, X

Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Liu, C Li, PP Yu, Q Sheng, H Wang, D Li, L Yu, X

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22712-22722
Issue Date:: 2024-09-16

Closed Access

	Filename	Description	Size
	1762136.pdf	Published version	1.52 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, C
dc.contributor.author	Li, PP
dc.contributor.author	Yu, Q
dc.contributor.author	Sheng, H
dc.contributor.author	Wang, D
dc.contributor.author	Li, L
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.date	2024-06-16
dc.date.accessioned	2024-12-02T01:50:50Z
dc.date.available	2024-12-02T01:50:50Z
dc.date.issued	2024-09-16
dc.identifier.citation	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22712-22722
dc.identifier.isbn	979-8-3503-5301-3
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/182291
dc.description.abstract	Existing audio visual segmentation datasets typically focus on short trimmed videos with only one pixel map annotation for a per second video clip In contrast for untrimmed videos the sound duration start and end sounding time positions and visual deformation of audible objects vary significantly Therefore we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions we introduce the Long Untrimmed Audio Visual Segmentation dataset LU AVS which includes precise frame level annotations of sounding emission times and provides exhaustive mask annotations for all frames Considering that pixel level annotations are difficult to achieve in some complex scenes we also provide the bounding boxes to indicate the sounding regions Specifically LU AVS contains 10M mask annotations across 6 6K videos and 11M bounding box annotations across 7K videos Compared with the existing datasets LU AVS videos are on average 4 8 times longer with the silent duration being 3 15 times greater Furthermore we try our best to adapt some baseline models that were originally designed for audio visual relevant tasks to examine the challenges of our newly curated LU AVS Through comprehensive evaluation we demonstrate the challenges of LU AVS compared to the ones containing trimmed videos Therefore LU AVS provides an ideal yet challenging platform for evaluating audio visual segmentation and localization on untrimmed long videos The dataset is publicly available at https yenanliu github io LU AVS
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartof	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/cvpr52733.2024.02143
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos
dc.type	Conference Proceeding
utslib.location.activity	Seattle, WA, USA
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-12-02T01:50:48Z
pubs.finish-date	2024-06-22
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2024-06-16
dc.location	Piscataway, USA

Abstract:

Existing audio visual segmentation datasets typically focus on short trimmed videos with only one pixel map annotation for a per second video clip In contrast for untrimmed videos the sound duration start and end sounding time positions and visual deformation of audible objects vary significantly Therefore we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions we introduce the Long Untrimmed Audio Visual Segmentation dataset LU AVS which includes precise frame level annotations of sounding emission times and provides exhaustive mask annotations for all frames Considering that pixel level annotations are difficult to achieve in some complex scenes we also provide the bounding boxes to indicate the sounding regions Specifically LU AVS contains 10M mask annotations across 6 6K videos and 11M bounding box annotations across 7K videos Compared with the existing datasets LU AVS videos are on average 4 8 times longer with the silent duration being 3 15 times greater Furthermore we try our best to adapt some baseline models that were originally designed for audio visual relevant tasks to examine the challenges of our newly curated LU AVS Through comprehensive evaluation we demonstrate the challenges of LU AVS compared to the ones containing trimmed videos Therefore LU AVS provides an ideal yet challenging platform for evaluating audio visual segmentation and localization on untrimmed long videos The dataset is publicly available at https yenanliu github io LU AVS

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/182291