You are developing a multimodal model that combines time-series data from sensor readings with natural language descriptions of events. The time-series data has varying sampling rates and the text descriptions are often vague and ambiguous. How would you best address the challenge of aligning and fusing these two modalities to improve model performance?
Correct Answer: C
DTW helps align time-series data with varying lengths and temporal distortions to text. Cross-modal attention then effectively fuses the aligned modalities, allowing the model to learn relationships between them. Resampling and direct concatenation (A) doesn't account for temporal variations. Ignoring data (B) is counterproductive. Averaging (D) loses temporal information. Averaging separate model outputs (E) is a form of late fusion and less effective than joint learning after alignment.