Sounding that Object: Interactive Object-Aware Image to Audio Generation

摘要

Generating accurate sounds for complex audiovisual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project site: https://tinglok.netlify.app/files/avobject/.

作者

Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

期刊/会议

ICML 2025

模型成果

Seed2.0 Seedance 2.0 Seedream 5.0 Lite Seeduplex Seed GR-RL

研究团队

LLM Infrastructures Vision Speech Multimodal Interaction & World Model AI for Science Robotics Responsible AI

了解更多

博客 Seed Edge 校园招聘