TY - JOUR
T1 - Effectively Obtaining Acoustic, Visual, and Textual Data from Videos
AU - León, Jorge E.
AU - Carrasco, Miguel
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/12
Y1 - 2025/12
N2 - Featured Application: The proposed method for generating large-scale audio–image–text datasets from videos addresses the critical scarcity of high-quality multimodal data, enabling advancements in audio-conditioned image-to-image generation and related tasks. By extracting semantically aligned audio–image pairs and augmenting them with descriptive texts, this work facilitates the training of more robust multimodal models, such as those for enhancing low-resolution recordings, creating dynamic video content like music videos or virtual assistant interactions, and developing augmented reality systems that incorporate real-time environmental audio for immersive user experiences. Ultimately, it promotes the democratization of AI by providing accessible, diverse datasets that support transfer learning and reduce reliance on modality conversions, paving the way for innovative applications in fields such as creative media production, remote sensing, and deep audiovisual learning. The increasing use of machine learning models has amplified the demand for high-quality, large-scale multimodal datasets. However, the availability of such datasets, especially those combining acoustic, visual, and textual data, remains limited. This paper addresses this gap by proposing a method of extracting related audio–image–text observations from videos. We detail the process of selecting suitable videos, extracting relevant data pairs, and generating descriptive texts using image-to-text models. Our approach ensures a robust semantic connection between modalities, enhancing the utility of the created datasets for various applications. We also explore the obtained data, discuss the challenges encountered, and propose solutions to improve data quality. The resulting datasets, which are publicly available, aim to support and advance research in multimodal data analysis and machine learning.
AB - Featured Application: The proposed method for generating large-scale audio–image–text datasets from videos addresses the critical scarcity of high-quality multimodal data, enabling advancements in audio-conditioned image-to-image generation and related tasks. By extracting semantically aligned audio–image pairs and augmenting them with descriptive texts, this work facilitates the training of more robust multimodal models, such as those for enhancing low-resolution recordings, creating dynamic video content like music videos or virtual assistant interactions, and developing augmented reality systems that incorporate real-time environmental audio for immersive user experiences. Ultimately, it promotes the democratization of AI by providing accessible, diverse datasets that support transfer learning and reduce reliance on modality conversions, paving the way for innovative applications in fields such as creative media production, remote sensing, and deep audiovisual learning. The increasing use of machine learning models has amplified the demand for high-quality, large-scale multimodal datasets. However, the availability of such datasets, especially those combining acoustic, visual, and textual data, remains limited. This paper addresses this gap by proposing a method of extracting related audio–image–text observations from videos. We detail the process of selecting suitable videos, extracting relevant data pairs, and generating descriptive texts using image-to-text models. Our approach ensures a robust semantic connection between modalities, enhancing the utility of the created datasets for various applications. We also explore the obtained data, discuss the challenges encountered, and propose solutions to improve data quality. The resulting datasets, which are publicly available, aim to support and advance research in multimodal data analysis and machine learning.
KW - audio
KW - data generation
KW - image
KW - multimodal data
KW - text
KW - video
UR - https://www.scopus.com/pages/publications/105024706432
U2 - 10.3390/app152312654
DO - 10.3390/app152312654
M3 - Article
AN - SCOPUS:105024706432
SN - 2076-3417
VL - 15
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 23
M1 - 12654
ER -