You are working with a multimodal dataset containing images and corresponding text descriptions. You want to train a model to generate text descriptions for new images. You decide to use a transformer-based architecture with separate encoders for images and text. How should you effectively fuse the image and text representations to enable cross-modal interaction?
Correct Answer: C
Cross-attention allows the decoder to selectively attend to relevant parts of both the image and text representations, enabling fine- grained interaction between the modalities. Concatenation or averaging simply combines the representations without allowing for selective attention. Training the encoders separately and then combining their outputs doesn't allow for cross modal interaction during training. Multiply operation is not standard and is not efficient.