You have a large dataset of images and text descriptions. You want to train a model that can perform both image captioning (generating text from images) and text-to-image generation (generating images from text). What architectural approach is best suited for this multimodal bi-directional task?
Correct Answer: C
Separate encoders for images and text allow for specialized feature extraction for each modality. A shared attention mechanism enables cross-modal interaction, allowing the model to attend to relevant parts of both the image and text representations. Separate decoders allow for generating outputs in different modalities. Training separate models is less efficient and doesn't leverage shared knowledge. A shared encoder might struggle to capture modality-specific features effectively. A single transformer might be computationally expensive and difficult to train. GAN is suitable for image generation, not really bidirectional tasks.