NCA-GENM Exam Dumps | You are tasked with building a multimodal A1 system that can generate video descriptions from video footage.

<< Prev Question Next Question >>

Question 118/192

You are tasked with building a multimodal A1 system that can generate video descriptions from video footage. You have experimented with several architectures, including combining CNNs for visual feature extraction and LSTMs for sequence generation. However, you are facing challenges with the model capturing long-range dependencies in the video. Which of the following architectural modifications or training techniques is MOST likely to address this issue?

A. Increasing the number of layers in the CNN to extract more detailed visual features.

B. Incorporating a Transformer-based architecture, such as a Vision Transformer (ViT) for visual feature extraction and a standard Transformer for sequence generation.

C. Reducing the frame rate of the input video to reduce the temporal complexity.

D. Using a smaller batch size during training to reduce memory consumption.

Question 118/192

LEAVE A REPLY

Download PDF File