MSMO: Multimodal Summarization with Multimodal Output

2023.07.25

This paper can be accessed form here.

This paper is the first work in Multimodal Summarization with Multimodal Output (MSMO) field. They collected dataset for MSMO research and proposed a model to handle this task.

Model Structure

Text Encoder

BiLSTM is used to extract text features.

Image Encoder

VGG19 is used to extract global and local image features. Here, global features are got from the last layer of VGG19 network and local features are the feature map after last pooling layer (before full connected layer).

Visual Attention

Consists of three parts:

Attention on global features
Attention on local features
Hierarchical visual attention on local features

Multimodal Attention Layer

To fuse the text and visual context information, multimodal attention layer is added. It calculate attention weight for visual context vector and text context vector respectively. Then weight visual and text context based on attention weights and add them together to get a final fused context information.

Decoder

Pointer-Generator Network is used to reduce repetition.

Contribution

Present MSMO task
Propose a multimodal summarization model

Possible Improvement

Some models (VGG19, BiLSTM, Pointer-Generator Network) can be replaced with state-of-the-art models.
It doesn’t consider cross-modal information when fusing multimodal information. Cross-modal can be added in multimodal attention layer.