1. Difficulties in data collection and processing: The research of multimodal data involves a variety of media data, such as text, images, audio, video, etc. Each data has its own characteristics and the processing method is very complicated.
2. Difficulties in model design and training: It is very difficult to design corresponding models and algorithms to study the processing of multimodal data, such as deep learning model, feature extraction algorithm, fusion algorithm, etc.