NCA-GENM Exam Dumps | You are working with a multimodal dataset containing images and corresponding text descriptions. You

<< Prev Question Next Question >>

Question 75/192

You are working with a multimodal dataset containing images and corresponding text descriptions. You want to train a model to generate text descriptions for new images. You decide to use a transformer-based architecture with separate encoders for images and text. How should you effectively fuse the image and text representations to enable cross-modal interaction?

A. Concatenate the final hidden states of the image and text encoders and feed them into a decoder.

B. Average the final hidden states of the image and text encoders and feed the result into a decoder.

C. Use a cross-attention mechanism where the text decoder attends to the image encoder's hidden states and vice-versa.

D. Train the image and text encoders separately and then combine their outputs using a linear layer.

E. Multiply the final hidden states of the image and text encoders and feed them into a decoder.

Question 75/192

LEAVE A REPLY

Download PDF File