Upload sample-sigplan.tex

#2
by keycharon - opened
Files changed (1) hide show
  1. sample-sigplan.tex +744 -0
sample-sigplan.tex ADDED
@@ -0,0 +1,744 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%
2
+ %% This is file `sample-sigplan.tex',
3
+ %% generated with the docstrip utility.
4
+ %%
5
+ %% The original source files were:
6
+ %%
7
+ %% samples.dtx (with options: `sigplan')
8
+ %%
9
+ %% IMPORTANT NOTICE:
10
+ %%
11
+ %% For the copyright see the source file.
12
+ %%
13
+ %% Any modified versions of this file must be renamed
14
+ %% with new filenames distinct from sample-sigplan.tex.
15
+ %%
16
+ %% For distribution of the original source see the terms
17
+ %% for copying and modification in the file samples.dtx.
18
+ %%
19
+ %% This generated file may be distributed as long as the
20
+ %% original source files, as listed above, are part of the
21
+ %% same distribution. (The sources need not necessarily be
22
+ %% in the same archive or directory.)
23
+ %%
24
+ %%
25
+ %% Commands for TeXCount
26
+ %TC:macro \cite [option:text,text]
27
+ %TC:macro \citep [option:text,text]
28
+ %TC:macro \citet [option:text,text]
29
+ %TC:envir table 0 1
30
+ %TC:envir table* 0 1
31
+ %TC:envir tabular [ignore] word
32
+ %TC:envir displaymath 0 word
33
+ %TC:envir math 0 word
34
+ %TC:envir comment 0 0
35
+ %%
36
+ %%
37
+ %% The first command in your LaTeX source must be the \documentclass
38
+ %% command.
39
+ %%
40
+ %% For submission and review of your manuscript please change the
41
+ %% command to \documentclass[manuscript, screen, review]{acmart}.
42
+ %%
43
+ %% When submitting camera ready or to TAPS, please change the command
44
+ %% to \documentclass[sigconf]{acmart} or whichever template is required
45
+ %% for your publication.
46
+ %%
47
+ %%
48
+ \documentclass[sigplan,screen]{acmart}
49
+
50
+ %%
51
+ %% \BibTeX command to typeset BibTeX logo in the docs
52
+ \AtBeginDocument{%
53
+ \providecommand\BibTeX{{%
54
+ Bib\TeX}}}
55
+
56
+ %% Rights management information. This information is sent to you
57
+ %% when you complete the rights form. These commands have SAMPLE
58
+ %% values in them; it is your responsibility as an author to replace
59
+ %% the commands and values with those provided to you when you
60
+ %% complete the rights form.
61
+ \setcopyright{acmcopyright}
62
+ \copyrightyear{2023}
63
+ \acmYear{2023}
64
+ \acmDOI{XXXXXXX.XXXXXXX}
65
+
66
+ %% These commands are for a PROCEEDINGS abstract or paper.
67
+ \acmConference[MM '23]{Make sure to enter the correct
68
+ conference title from your rights confirmation emai}{October 29--November 03,
69
+ 2023}{Ottawa, Canada}
70
+ \acmPrice{15.00}
71
+ \acmISBN{978-1-4503-XXXX-X/18/06}
72
+ \usepackage{algorithm}
73
+ \usepackage{algorithmic}
74
+ \usepackage{setspace}
75
+ \usepackage{color}
76
+ \usepackage{subfigure}
77
+ %%
78
+ %% Submission ID.
79
+ %% Use this when submitting an article to a sponsored event. You'll
80
+ %% receive a unique submission ID from the organizers
81
+ %% of the event, and this ID should be used as the parameter to this command.
82
+ %%\acmSubmissionID{123-A56-BU3}
83
+
84
+ %%
85
+ %% For managing citations, it is recommended to use bibliography
86
+ %% files in BibTeX format.
87
+ %%
88
+ %% You can then either use BibTeX with the ACM-Reference-Format style,
89
+ %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include
90
+ %% support for advanced citation of software artefact from the
91
+ %% biblatex-software package, also separately available on CTAN.
92
+ %%
93
+ %% Look at the sample-*-biblatex.tex files for templates showcasing
94
+ %% the biblatex styles.
95
+ %%
96
+
97
+ %%
98
+ %% The majority of ACM publications use numbered citations and
99
+ %% references. The command \citestyle{authoryear} switches to the
100
+ %% "author year" style.
101
+ %%
102
+ %% If you are preparing content for an event
103
+ %% sponsored by ACM SIGGRAPH, you must use the "author year" style of
104
+ %% citations and references.
105
+ %% Uncommenting
106
+ %% the next command will enable that style.
107
+ %%\citestyle{acmauthoryear}
108
+
109
+
110
+ %%
111
+ %% end of the preamble, start of the body of the document source.
112
+ \begin{document}
113
+
114
+ %%
115
+ %% The "title" command has an optional parameter,
116
+ %% allowing the author to define a "short title" to be used in page headers.
117
+ \title{Boosting Sentence Representation with Visually-supervised Multimodal Pre-training}
118
+
119
+ %%
120
+ %% The "author" command and its associated commands are used to define
121
+ %% the authors and their affiliations.
122
+ %% Of note is the shared affiliation of the first two authors, and the
123
+ %% "authornote" and "authornotemark" commands
124
+ %% used to denote shared contribution to the research.
125
+ \author{Zhe Li}
126
+ %\authornote{Both authors contributed equally to this research.}
127
+ \email{keycharon0122@gmail.com}
128
+ \orcid{1234-5678-9012}
129
+ %\author{G.K.M. Tobin}
130
+ %\authornotemark[1]
131
+ %\email{webmaster@marysville-ohio.com}
132
+ \affiliation{%
133
+ \institution{Huazhong University of Science and Technology}
134
+ \city{WuHan}
135
+ \country{China}
136
+ \postcode{430000}
137
+ }
138
+
139
+ %\author{Lars Th{\o}rv{\"a}ld}
140
+ %\affiliation{%
141
+ % \institution{The Th{\o}rv{\"a}ld Group}
142
+ % \streetaddress{1 Th{\o}rv{\"a}ld Circle}
143
+ % \city{Hekla}
144
+ % \country{Iceland}}
145
+ %\email{larst@affiliation.org}
146
+ %
147
+ %\author{Valerie B\'eranger}
148
+ %\affiliation{%
149
+ % \institution{Inria Paris-Rocquencourt}
150
+ % \city{Rocquencourt}
151
+ % \country{France}
152
+ %}
153
+ %
154
+ %\author{Aparna Patel}
155
+ %\affiliation{%
156
+ % \institution{Rajiv Gandhi University}
157
+ % \streetaddress{Rono-Hills}
158
+ % \city{Doimukh}
159
+ % \state{Arunachal Pradesh}
160
+ % \country{India}}
161
+ %
162
+ %\author{Huifen Chan}
163
+ %\affiliation{%
164
+ % \institution{Tsinghua University}
165
+ % \streetaddress{30 Shuangqing Rd}
166
+ % \city{Haidian Qu}
167
+ % \state{Beijing Shi}
168
+ % \country{China}}
169
+ %
170
+ %\author{Charles Palmer}
171
+ %\affiliation{%
172
+ % \institution{Palmer Research Laboratories}
173
+ % \streetaddress{8600 Datapoint Drive}
174
+ % \city{San Antonio}
175
+ % \state{Texas}
176
+ % \country{USA}
177
+ % \postcode{78229}}
178
+ %\email{cpalmer@prl.com}
179
+ %
180
+ %\author{John Smith}
181
+ %\affiliation{%
182
+ % \institution{The Th{\o}rv{\"a}ld Group}
183
+ % \streetaddress{1 Th{\o}rv{\"a}ld Circle}
184
+ % \city{Hekla}
185
+ % \country{Iceland}}
186
+ %\email{jsmith@affiliation.org}
187
+ %
188
+ %\author{Julius P. Kumquat}
189
+ %\affiliation{%
190
+ % \institution{The Kumquat Consortium}
191
+ % \city{New York}
192
+ % \country{USA}}
193
+ %\email{jpkumquat@consortium.net}
194
+
195
+ %%
196
+ %% By default, the full list of authors will be used in the page
197
+ %% headers. Often, this list is too long, and will overlap
198
+ %% other information printed in the page headers. This command allows
199
+ %% the author to define a more concise list
200
+ %% of authors' names for this purpose.
201
+ \renewcommand{\shortauthors}{Zhe Li et al.}
202
+
203
+ %%
204
+ %% The abstract is a short summary of the work to be presented in the
205
+ %% article.
206
+ \begin{abstract}
207
+ Large-scale pre-trained language models have garnered significant attention owing to its applicability to extracting sentence representations. Most pre-trained models use encoder based on transformer with single modality, performing well in natural language inference and question-answering. However, multimodal data can provide more effective features from different modalities. Unfortunately, existing multimodal pre-trained models either lack modality alignment, or do not explore complementary information between modalities, even cannot distinguish highly similar negative and positive samples. To alleviate these issues, we propose a Visually-supervised Pre-trained Multimodal Model (ViPMM) for sentence representation. We design diverse label-free multimodal proxy tasks to embed visual information into language. Our comprehensive downstream experiments on natural language understanding and sentiment classification show that ViPMM outperforms both existing unimodal and multimodal pre-trained models.
208
+ \end{abstract}
209
+
210
+ \begin{CCSXML}
211
+ <ccs2012>
212
+ <concept>
213
+ <concept_id>10010147.10010178</concept_id>
214
+ <concept_desc>Computing methodologies~Artificial intelligence</concept_desc>
215
+ <concept_significance>500</concept_significance>
216
+ </concept>
217
+ <concept>
218
+ <concept_id>10010147.10010178.10010179</concept_id>
219
+ <concept_desc>Computing methodologies~Natural language processing</concept_desc>
220
+ <concept_significance>500</concept_significance>
221
+ </concept>
222
+ </ccs2012>
223
+ \end{CCSXML}
224
+
225
+ \ccsdesc[500]{Computing methodologies~Artificial intelligence}
226
+ \ccsdesc[500]{Computing methodologies~Natural language processing}
227
+ %%
228
+ %% Keywords. The author(s) should pick words that accurately describe
229
+ %% the work being presented. Separate the keywords with commas.
230
+ \keywords{multimodal pre-trained, visually-supervised, sentence representation, proxy task}
231
+ %% A "teaser" image appears between the author and affiliation
232
+ %% information and the body of the document, and typically spans the
233
+ %% page.
234
+ %\begin{teaserfigure}
235
+ % \includegraphics[width=\textwidth]{sampleteaser}
236
+ % \caption{Seattle Mariners at Spring Training, 2010.}
237
+ % \Description{Enjoying the baseball game from the third-base
238
+ % seats. Ichiro Suzuki preparing to bat.}
239
+ % \label{fig:teaser}
240
+ %\end{teaserfigure}
241
+
242
+ %\received{20 February 2007}
243
+ %\received[revised]{12 March 2009}
244
+ %\received[accepted]{5 June 2009}
245
+
246
+ %%
247
+ %% This command processes the author and affiliation and title
248
+ %% information and builds the first part of the formatted document.
249
+ \maketitle
250
+
251
+ \section{Introduction}
252
+ Learning and acquiring knowledge in the human world often rely on supervision through labeled data. However, there exist certain tasks, such as natural language inference (NLI), where humans can perform well without supervision. In this case, individuals can differentiate between contradictory events, such as “\textit{a woman is brushing her teeth}” and “\textit{a woman is playing the piano}”, without the use of NLI labels. Interestingly, this unsupervised learning process still ensures the acquisition of knowledge.
253
+
254
+ Natural language understanding (NLU) is a critical component of many language tasks, including NLI and question-answering. While supervised pre-trained language models trained on large-scale datasets have demonstrated competitive performance on such tasks, it is important to note that humans predominantly acquire knowledge in an unsupervised manner. Despite the lack of supervision, there exist other modalities, particularly the visual modality, that can provide us with supervised signals. For example, we possess the commonsense knowledge that brushing teeth and playing the piano are mutually exclusive events, but such knowledge is not easy to acquire through text alone. However, the presence of images can readily convey this contradictory relationship. Therefore, it is reasonable to posit that combining vision and language would be advantageous for NLU.
255
+ \begin{figure}
256
+ \centering
257
+ \includegraphics[scale=0.35]{image/motivation.pdf}
258
+ \caption{Our goal is to jointly exploit the information of text and image, and the text can be influenced by the image in back-propagation through the red line. In addition, we hope to pull positive samples closer and push negative samples apart in feature space.}
259
+ \label{Fig.main}
260
+ \end{figure}
261
+ Previous studies \cite{kenton2019bert, liu2019roberta, yang2019xlnet} have demonstrated the efficacy of proxy tasks, such as masked language modeling in Bert\cite{kenton2019bert}, for single modality pre-trained without labeled data, leading to improved sentence representations for downstream tasks. However, recent studies \cite{zhang2020neural, zhao2020visually, bordes2019incorporating} have found that incorporating vision as a modality can improve language model grounding and enhance performance on NLU tasks. Additionally, studies \cite{radford2021learning, jia2021scaling, yuan2021florence} have shown that using both language and vision as inputs can adapt pre-trained models to a wider range of downstream tasks, including image-text retrieval and zero-shot image classification. \cite{cui2020unsupervised} proposes using pre-trained models with vision and language as inputs for downstream pure-text tasks to supervise language with vision, although it lacks fusion information and the ability to distinguish difficult samples. In another study \cite{zhang2022mcse}, a dropout mask method for data augmentation is employed to effectively identify indistinguishable positive and negative samples, but it did not consider image-text alignment. In contrast, our approach aims to alleviate these problems by aligning different modalities and pulling positive samples closer while pushing negative samples apart, as illustrated in Fig. \ref{Fig.main}.
262
+
263
+ In this work, we present a visually-supervised pre-trained multimodal model, which incorporates both text and image inputs during pre-training, and focuses on plain text tasks as the downstream task. The main objective of the model is to force the text representation to be close to its corresponding image and to align each token with a homologous patch in the image. To achieve this, the model employs global and local image-text contrastive learning, text swapping task, image-text matching task and data-based contrastive learning. To improve the discriminative ability of the text encoder in identifying positive samples from highly similar negative samples, the study applies an augmented data contrastive learning method. Rather than performing real data augmentation, we input the same data into different encoders to obtain highly similar but not identical features. The contributions of this study are summarized as follows: (i) we propose a visually-supervised pre-trained multimodal model, termed ViPMM. It outperfroms single modality pre-trained model and multimodal pre-trained model on seven pure-text and sentiment classification tasks; (ii) we employ a data-augmented contrastive learning method that uses both dynamically updated momentum encoder and tucker fusion learn effective features to enhance the discriminative ability of the text encoder in identifying positive samples from highly similar negative samples; (iii) we propose text swapping task, which exchanges the text in an image-text pair with another in a certain probability to compute bidirectional hinge similarity loss; (iv) we maximize the relevance score of the aligned token-image pairs over unaligned pairs, thus optimizing the model parameters to align tokens with their corresponding image patches. We perform the pre-training on image-text dataset, Flickr-30k. To verify the effectiveness of our method, we choose seven natural language understanding tasks and one multimodal sentiment classification task as downstream tasks for unsupervised evaluation. Experimental results demonstrate the effectiveness of our approach, where competitive performance is achieved on all tasks.
264
+
265
+ \section{Related Work}
266
+
267
+ \subsection{Language Pre-training}
268
+ In the field of natural language processing, language pre-trained models have been widely studied as a means of improving sentence representation learning. These models can be broadly classified into two categories: supervised\cite{wieting2020bilingual, reimers2019sentence, cer2018universal, conneau2017supervised} and unsupervised\cite{gao2021simcse, kenton2019bert, liu2019roberta, yang2019xlnet} approaches. While supervised methods rely on labeled natural language data, unsupervised methods leverage intrinsic semantic information to learn sentence representation. Recently, self-supervised pre-training language models have garnered much attention. Such models rely on various proxy tasks to bring positive samples closer and push negative samples apart. For instance, Bert\cite{kenton2019bert} performs masked word prediction and next sentence prediction tasks. Other approaches, such as those proposed by \cite{liu2021self}, categorize self-supervised learning into contrastive, generative, and generative-contrastive methods. Meanwhile, \cite{gao2021simcse} proposes the use of a dropout mask to augment data without real augmentation, thereby achieving a contrastive objective. However, it is worth noting that these approaches only consider language without taking into account the potential auxiliary effects of other modalities on language
269
+
270
+ \subsection{Visually-supervised Pre-training}
271
+ \begin{figure*}
272
+ \centering
273
+ \includegraphics[scale=0.70]{image/glo.pdf}
274
+ \caption{Detailed illustration of our model architecture. All features are involved in multimodal proxy tasks module.}
275
+ \label{global}
276
+ \end{figure*}
277
+ In the realm of representation learning, multimodality has emerged as a significant trend. In the context of natural language self-supervised learning, sentence representation can be effectively learned with supervision from other modalities, including visual and audio information. To this end, various novel approaches have been proposed. \cite{tan2020vokenization} introduces the concept of “voken”, which is akin to tokens, and utilizes visually-supervised methods to learn sentence representation. \cite{radford2021learning} proposes image-text contrastive learning, while \cite{zhang2022mcse} builds upon the work of \cite{gao2021simcse} and designs a multimodal contrastive objective. \cite{cui2020unsupervised}, on the other hand, puts forward a decoupled text-image encoder, which encodes text and image separately while ensuring full interaction between the two modalities through local mutual maximization. Lastly, \cite{bordes2019incorporating} embeds text and image into a common space and employs a technique that pulls matched pairs closer and pushes mismatched pairs apart. However, while these methods offer promising results, they either fail to align language with vision or do not fully leverage the complementary information between the modalities, thus struggling to distinguish positive samples from highly similar negative samples.
278
+
279
+ \subsection{Momentum Encoder}
280
+
281
+ Various pioneering approaches have been proposed to address the problem of difficult samples in contrastive learning. \cite{he2020momentum} puts forward a novel technique called momentum encoder, which involves manually updating the encoder's parameters to encode augmented data as negative samples. Moreover, \cite{li2021align} improves learning under noisy supervision by employing momentum distillation with momentum encoder. In our approach, we introduce a dynamic parameters update mechanism that adjusts the weight of parameter updates based on the similarity between data and augmented data. This allows us to make augmented features more similar to the original features and enhance the model's discriminative ability.
282
+
283
+ \section{The Proposed Approach}
284
+
285
+ In this section, we present our proposed approach and the multimodal proxy tasks we designed. Section 3.1 showcases the use of Image-Text Contrastive(ITC) to maximize global mutual information (MI). Section 3.2 introduces a local MI maximization method for local contrastive learning. Section 3.3 elaborates on Text Swapping(TS) and Image-Text Matching(ITM) as proxy tasks. In section 3.4, we discuss the utilization of augmented data contrastive learning. We compute the final loss in section 3.5. The model architecture is depicted in Fig. \ref{global}.
286
+
287
+ \subsection{Image-Text Contrastive Learning}
288
+ In this section, we introduce InfoNCE loss\cite{van2018representation} for image-text contrastive learning from the perspective of mutual information maximization.
289
+ Mutual information has become a trend to measure the correspondence between modalities. In the case of cross-modal self-supervised learning (SSL)\cite{hjelmlearning}, we intuitively think that the higher mutual information between text and image, the better they match. So multimodal representation learning is to maximize mutual information $ {\mathcal I(X,Y)} $ between one modality $X$ and another modality $Y$,
290
+
291
+ \begin{equation}\label{1}
292
+ {\mathcal I(X,Y)={\sum\limits}_{y\in Y}{\sum\limits}_{x\in X}P(x,y)\log \frac {P(x|y)}{P(x)}},
293
+ \end{equation}
294
+ where $x$, $y$ is modality feature from $X$ and $Y$.
295
+
296
+ From Eqn. \ref{1}, we see that if $x$ and $y$ are incompatible with each other, $ \frac{P(x \mid y)}{P(x)}$ is 0. Hence, $ \frac{P(x \mid y)}{P(x)}$ is proportional to the similarity of $x$ and $y$. The term “$glo$” is utilized to denote the representation learning process that pertains an entire text or image, as a means of distinguishing it from the local structures. Because computing $ \frac{P(x \mid y)}{P(x)}$ is hard, we use the function $ \phi_{glo}(x,y) $ to model it in Eqn. \ref{1},
297
+
298
+ \begin{equation}\label{2}
299
+ \phi_{glo}(x,y)\propto\frac{P(x|y)}{P(x)},
300
+ \end{equation}
301
+ where $\phi_{glo}(x,y)$ is a measurement of similarity and can not be normalized.
302
+
303
+ %In Eqn. (2), $\phi_{glo}(x,y)$ is proportional to $ \frac{P(x \mid y)}{P(x)}$, so we model $ \frac{P(x \mid y)}{P(x)}$ by cross-modal similarity. To compute that, we use text encoder and image encoder encode $x$, $y$ to $\mathcal X$ and $\mathcal Y$. Then, processed to $\mathcal X^{norm}$ and $\mathcal Y^{norm}$ through L2-normalization layer. We follow \cite{misra2020self} to compute $\phi_{glo}(x,y)$ with exponential function with cosine similarity,
304
+ In Eqn. \ref{2}, we propose a proportionality between $\phi_{glo}(x,y)$ and $ \frac{P(x \mid y)}{P(x)}$. We aim to model $ \frac{P(x \mid y)}{P(x)}$ using cross-modal similarity, wherein both text and image encoders are utilized to encode $x$ and $y$ into $\mathcal X$ and $\mathcal Y$. The encoded representations are then normalized through an L2-normalization layer, resulting in $\mathcal X_{norm}$ and $\mathcal Y_{norm}$. We adopt the approach proposed in \cite{misra2020self} to compute $\phi_{glo}(x,y)$ using the exponential function with cosine similarity,
305
+ \begin{equation}\label{5}
306
+ \begin{split}
307
+ \phi_{glo}(x,y)&=d(\frac{\mathcal X}{||\mathcal X||_2},\frac{\mathcal Y}{||\mathcal Y||_2})\\
308
+ &=\exp (\frac{\text{cosine}(\mathcal X_{norm},\mathcal Y_{norm})}{\tau_{\phi}}),
309
+ \end{split}
310
+ \end{equation}
311
+ where $\tau_{\phi}$ is a temperature hyper-parameter.
312
+
313
+ To maximize mutual information, we adpot InfoNCE based Noise-Contrastive Estimation (NCE)\cite{van2018representation} to estimate $\phi_{glo}(x,y)$, which is defined as,
314
+ \begin{equation}\label{3}
315
+ \begin{split}
316
+ \mathcal L^\text{NCE}_{P(y|x)}(x,y;\phi_{glo})=-\mathbb{E}_{f(x,y)}(\\
317
+ \frac{\phi_{glo}(x, y)}{{\sum\limits}_{y^{'} \sim P(y)}\phi_{glo}(x, y^{'})}),
318
+ \end{split}
319
+ \end{equation}
320
+ where $f(x,y)$ denotes [$x,y \sim P(y|x)P(x)$]. %$P(x)$ is real distribution of $x$, $P(y|x)P(x)$ denotes the distribution of $y$ given $x$, and $P(y)$ represents negative samples distribution of $y$. Eqn. (4). can be seen from negative samples identifying positive image $y \sim P(y|x)$ given $x$.
321
+ Here, we define $P(x)$ as the real distribution of $x$, $P(y|x)P(x)$ as the distribution of $y$ given $x$, and $P(y)$ as the distribution of negative samples of $y$. Eqn. \ref{3} describes a methodology for identifying positive images by leveraging negative images. Specifically, given $x$, the distribution $y \sim P(y|x)$ is utilized to identify positive images.
322
+
323
+ The crux of our approach towards maximizing mutual information involves the utilization of an alternative technique for maximizing its lower bound,
324
+ \begin{equation}
325
+ {\mathcal I(X,Y)}\geq \log N^{'}-\mathcal L^\text{NCE}(x,y;\phi),
326
+ \end{equation}
327
+ where $N{'}$ is number of negative samples. According to \cite{van2018representation}, minimizing $\mathcal L^\text{NCE}$ is equivalent to maximizing the lower bound of mutual information.
328
+
329
+ Symmetrically, we also need to identify positive texts $x \sim P(x|y)$ given positive images $y$ from negative texts $x^{'} \sim P(x)$,
330
+ \begin{equation}\label{4}
331
+ \begin{split}
332
+ \mathcal L^\text{NCE}_{P(x|y)}(x,y;\phi_{glo})\!\!=\!\!-\mathbb{E}_{f(y,x)}(\\
333
+ \frac{\phi_{glo}(x, y)}{{\sum\limits}_{x^{'} \sim P(x)}\phi_{glo}(x^{'}, y)}),
334
+ \end{split}
335
+ \end{equation}
336
+ where $f(y,x)$ denotes [$x, y\sim P(x|y)P(y)$].
337
+
338
+ Combining Eqn. \ref{3} and Eqn. \ref{4}, we get image-text contrastive loss for global mutual information maximization,
339
+ \begin{equation}\label{6}
340
+ \mathcal L_{itc}\!=\!\mathcal L^\text{NCE}_{P(y|x)}(x,y;\phi_{glo})+\mathcal L^\text{NCE}_{P(x|y)}(x,y;\phi_{glo}).
341
+ \end{equation}
342
+
343
+ \subsection{Local Image-Text Contrastive Learning}
344
+ %In this section, we elaborate $\phi_{local}$ for local structures and maximize local mutual information.
345
+
346
+ %Unalignment between image and text is a big challenge for representation learning. To solve this problem, we use the attention mechanism to align modalities and interact among local structures in Fig. 4.
347
+ %In this section, we present an elaboration of $\phi_{local}$, which pertains to local structures, and propose a technique that maximizes local mutual information.
348
+ In section 3.1, we elaborate the definition of $\phi_{glo}$, similiarly, this section we present an introduction of $\phi_{loc}$, which primarily deals with local structures. Additionally, we propose a novel technique that facilitates the maximization of local mutual information.
349
+
350
+ The issue of misalignment between image and text represents a substantial challenge in representation learning. To address this challenge, we leverage the attention mechanism to align modalities and foster interaction among local structures as depicted in Fig. \ref{local}.
351
+
352
+ %For a sentence containing $\mathcal V$ words $(s^{(1)}, ..., s^{(\mathcal V)})$, we utilize $(\mathcal S^{(1)}, ..., \mathcal S^{(\mathcal V)})$ to represent local features encoded by text encoder. Moreover, for an image with $\mathcal M\times\mathcal M$ patches ($p^{(1)} ......p^{(\mathcal M^{2})}$), local features is represented as ($\mathcal P^{(1)}, ..., \mathcal P^{(\mathcal M^{2})}$). We compute an attention map from the local features of text as well as image to achieve inter-modal alignment,
353
+ In the case of a sentence comprising of $V$ words $(s^{(1)}, ..., s^{(V)})$, we leverage $(\mathcal S^{(1)}, ..., \mathcal S^{(V)})$ as a representation for the local features encoded by the text encoder. Similarly, for an image consisting of $ M \times M$ patches $(p^{(1)} ......p^{(M^{2})})$, local features are represented as $(\mathcal P^{(1)}, ..., \mathcal P^{(M^{2})})$. By computing an attention map between the local features of both text and image, we aim to achieve inter-modal alignment,
354
+ %\begin{figure*} \centering
355
+ % \subfigure[figure 1 title.] {
356
+ % \label{fig:a}
357
+ % \includegraphics[scale=0.75]{image/local.pdf}
358
+ % }
359
+ % \subfigure[figure 2 title.] {
360
+ % \label{fig:b}
361
+ % \includegraphics[scale=0.65]{image/tucker_fusion.pdf}
362
+ % }
363
+ % \caption{ general title. }
364
+ % \label{fig}
365
+ %\end{figure*}
366
+ \begin{figure*}[!htb]
367
+ \centering
368
+ \begin{subfigure}[MI maximization for local structures. $V$ denotes length of sentence. $M\times M$ denotes the number of image patches.]{\includegraphics[width=0.64\linewidth]{image/local.pdf}
369
+ \label{local}
370
+ }
371
+ \end{subfigure}
372
+ \quad
373
+ \begin{subfigure}[The structure of tucker fusion.]{\includegraphics[width=0.31\linewidth]{image/tucker_fusion.pdf}
374
+ \label{tucker}
375
+ }
376
+ \caption{The structure of patch-word alignment and tucker fusion.}
377
+ \end{subfigure}
378
+ \vspace{0.2in}
379
+ \end{figure*}
380
+ %\begin{figure*}
381
+ % \centering
382
+ % \includegraphics[scale=0.76]{image/图7.pdf}
383
+ % \caption{MI maximization for local structures. $\mathcal V$ denotes length of sentence. $\mathcal M\times\mathcal M$ denotes the number of image patch.}
384
+ % \label{local}
385
+ %\end{figure*}
386
+ \begin{subequations}
387
+ \begin{align}
388
+ attn_{i,j}=\frac{\exp({\mathcal S^{(i)}}^T\cdot \mathcal P^{(j)})}{{\sum\limits}_k\exp({\mathcal S^{(i)}}^T\cdot \mathcal P^{(k)})},\\
389
+ attn^{'}_{i,j}=\frac{\exp({\mathcal S^{(i)}}^T\cdot \mathcal P^{(j)})}{{\sum\limits}_k\exp({\mathcal S^{(k)}}^T\cdot \mathcal P^{(j)})},
390
+ \end{align}
391
+ \end{subequations}
392
+ where $k$ is index of negative samples. $attn_{i,j}$ denotes the attention of $i$-th word to $j$-th patch and $attn^{'}_{i,j}$ denotes the attention of $j$-th patch to $i$-th word.
393
+
394
+ According to attention map above, we assign weights to features,
395
+
396
+ \begin{subequations}
397
+ \begin{align}
398
+ \mathcal S^{'}=\frac{\exp(attn_{i,j}/\tau_{1})}{{\sum\limits}_k \exp(attn_{k,j}/\tau_{1})}\mathcal S,\\
399
+ \mathcal P^{'}=\frac{\exp(attn^{'}_{i,j}/\tau_{1})}{{\sum\limits}_k \exp(attn^{'}_{i,k}/\tau_{1})}\mathcal P,
400
+ \end{align}
401
+ \end{subequations}
402
+ where $\tau_1$ denotes the temperature hyper-parameter. Then we compute the alignment score by Eqn. \ref{5},
403
+ \begin{subequations}
404
+ \begin{align}
405
+ \phi_{loc}(s,p^{(j)})=d(\mathcal {P^{'}}^{(j)}, \mathcal S^{'}),\\
406
+ \phi_{loc}(s^{(i)},p)=d(\mathcal {S^{'}}^{(i)}, \mathcal P^{'}).
407
+ \end{align}
408
+ \end{subequations}
409
+
410
+
411
+ Combining $\phi_{loc}(s,p^{(j)})$ and $\phi_{loc}(s^{(i)},p)$, we obtain $\phi_{loc}(s,p)$ function,
412
+ \begin{equation}
413
+ \begin{split}
414
+ \phi_{loc}(s,p)=&\sum\limits_{i=1}^{V}\phi_{loc}(s^{(i)},p)\\
415
+ &+\sum\limits_{j=1}^{M^2}\phi_{loc}(s,p^{(j)}).
416
+ \end{split}
417
+ \end{equation}
418
+
419
+ According to Eqn. \ref{3}, Eqn. \ref{4} and Eqn. \ref{6}, we can compute local image-text contrastive loss $\mathcal L_{loc}$ for local mutual information maximization,
420
+ \begin{equation}
421
+ \mathcal L_{loc}\!=\!\mathcal L^\text{NCE}_{P(p|s)}(s,p;\phi_{loc})+\mathcal L^\text{NCE}_{P(s|p)}(s,p;\phi_{loc}).
422
+ \end{equation}
423
+
424
+
425
+ \subsection{Image-Text Matching and Text Swapping}
426
+ %Lying in the propose of reducing the distance between image and text in feature space, we design two tasks to match them: (1) with a certain probability randomly exchange the text corresponding to current image with another text, see whether the model can accurately discriminate and (2) compute relevance score between token and image to match them while considering this sentence as a context, which brings text and image closer in a more granular method.
427
+ Lying in the propose of minimizing the semantic gap between vision and language in feature space, we present two tasks for their alignment: (1) random text swapping with a predetermined probability, followed by assessment of the model's discriminative ability. (2) computation of relevance scores between image tokens and contextualized sentences, resulting in a more nuanced approach for enhancing the text-image relationship.\\
428
+ \textbf{Image-Text Matching}. %Inspired by \cite{tan2020vokenization}, we design a proxy task for image-text matching, with the purpose of matching each token in a sentence with an image. So we tokenize a sentence, and then compute the relevance score of each token and image. %We model relevance score with an inner product of image feature representation $\mathcal J_y$ and local text feature representation $\mathcal J_s$,
429
+ Inspired by \cite{tan2020vokenization}, we propose a proxy task for aligning images with their corresponding textual content at a token level. To this end, we first tokenize the text and subsequently compute the relevance scores between each token and the corresponding image. Because image and text features are mapped into a shared space, the relevance score is modeled as an inner product between the global image feature representation $\mathcal J_y$ and the local text feature representation $\mathcal J_s$,
430
+ \begin{equation}\label{7}
431
+ r(\mathcal S,\mathcal Y)=\mathcal J_y^T\cdot \mathcal J_s.
432
+ \end{equation}
433
+
434
+ Before computing relevance score, we apply a multi-layer perceptron (MLP) $\text{MLP}_t$ to down project local text feature $\mathcal S$, an L2-normalization layer is applied subsequently. Similarily, for image feature $\mathcal Y$ we perform the same operation with $\text{MLP}_v$,
435
+ \begin{subequations}
436
+ \begin{align}
437
+ \mathcal J_s=\frac{\text{MLP}_t(\mathcal S)}{||\text{MLP}_t(\mathcal S)||_2},\\
438
+ \mathcal J_y=\frac{\text{MLP}_v(\mathcal Y)}{||\text{MLP}_v(\mathcal Y)||_2}.
439
+ \end{align}
440
+ \end{subequations}
441
+
442
+ According to Eqn. \ref{7}, we calculate the relevance score $r(\mathcal S,\mathcal Y)$ for a positive token-image pair. Subsequently, we randomly select another image $\mathcal Y^{'}$ and obtain its corresponding relevance score $r(\mathcal S,\mathcal Y^{'})$. To ensure that the difference between $r(\mathcal S,\mathcal Y)$ and $r(\mathcal S,\mathcal Y^{'})$ is greater than a pre-specified margin $M$, we utilize the hinge loss function to compute image-text match loss,
443
+ \begin{equation}
444
+ \mathcal L_{itm}=max(0, M-r(\mathcal S,\mathcal Y)+r(\mathcal S,\mathcal Y^{'})).
445
+ \end{equation}
446
+ \textbf{Text Swapping}. We present an additional objective for attaining image-text match, which we refer to as “Text Swapping”. This involves randomly substituting text with a predefined probability, and employing a bidirectional similarity hinge loss to penalize the model for inadequate discrimination capability.
447
+ %\begin{figure*}[htbp]
448
+ % \centering
449
+ % \subfigure[pic1.]{
450
+ % \begin{minipage}[t]{0.33\linewidth}
451
+ % \centering
452
+ % \includegraphics[width=2in]{image/图4.pdf}
453
+ % %\caption{fig1}
454
+ % \end{minipage}%
455
+ % }%
456
+ % \subfigure[pic2.]{
457
+ % \begin{minipage}[t]{0.33\linewidth}
458
+ % \centering
459
+ % \includegraphics[width=2.5in]{image/图3.pdf}
460
+ % %\caption{fig2}
461
+ % \end{minipage}%
462
+ % }%
463
+ % \subfigure[pic3.]{
464
+ % \begin{minipage}[t]{0.33\linewidth}
465
+ % \centering
466
+ % \includegraphics[width=1.5in]{image/图6.pdf}
467
+ % %\caption{fig2}
468
+ % \end{minipage}
469
+ % }%
470
+ % \centering
471
+ % \caption{ pics}
472
+ %\end{figure*}
473
+ Initially, we utilize an L2-normalization layer to normalize both the image feature $\mathcal Y$ and text feature $\mathcal X$,
474
+ \begin{subequations}
475
+ \begin{align}
476
+ \mathcal X_{norm}=\frac{\mathcal X}{||\mathcal X||_2},\\
477
+ \mathcal Y_{norm}=\frac{\mathcal Y}{||\mathcal Y||_2}.
478
+ \end{align}
479
+ \end{subequations}
480
+
481
+ We propose to fuse features from different modalities using cross-modal attention mechanisms,
482
+ \begin{subequations}
483
+ \begin{align}
484
+ &attn_{x2y}=\frac{\mathcal Y_{norm}\cdot \mathcal X_{norm}^T}{\sqrt{d}},\\
485
+ \mathcal F_x&=\text{Softmax}(attn_{x2y})\cdot \mathcal X_{norm},\\
486
+ &attn_{y2x}=\frac{\mathcal X_{norm}\cdot \mathcal Y_{norm}^T}{\sqrt{d}},\\
487
+ \mathcal F_y&=\text{Softmax}(attn_{y2x})\cdot \mathcal Y_{norm},
488
+ \end{align}
489
+ \end{subequations}
490
+ where $d$ is the feature dimension of $\mathcal X_{norm}$ and $\mathcal Y_{norm}$.
491
+
492
+ In order to measure similarity between text and image, we adopt a two-way linear similarity combination method,
493
+ \begin{subequations}
494
+ \begin{align}
495
+ g=(\mathcal X_{norm})^T\mathcal Y_{norm}+\alpha\cdot (\mathcal F_x)^T\mathcal F_y, \label{8}\\
496
+ g^{'}=(\mathcal X^{'}_{norm})^T\mathcal Y_{norm}+\alpha\cdot (\mathcal F^{'}_x)^T\mathcal F_y \label{9},
497
+ \end{align}
498
+ \end{subequations}
499
+ where $\mathcal X^{'}_{norm}$ is a normalized textual feature which does not align with the image feature $\mathcal Y$. The fused feature of $\mathcal X^{'}_{norm}$ and $\mathcal Y$ is denoted by $\mathcal F^{'}_x$. The scalar value $\alpha$ is treated as a constant in this context for weights balance.
500
+
501
+ Utilizing Eqn. \ref{8} and Eqn. \ref{9}, we can compute text swapping task loss,
502
+ \begin{equation}
503
+ \mathcal L_{ts}=max(0, M-g+g^{'}).
504
+ \end{equation}
505
+
506
+ \subsection{Augmented Data Contrastive Learning}
507
+ In propose to focus on effective and complementary features, we apply augmented data contrastive learning inspired by \cite{li2022clmlf}.
508
+
509
+ %Differently, since our downstream tasks are plain text tasks, proxy tasks focus on text. However, here we do not perform traditional data augmentation methods such as back translation, instead, we are inspired by \cite{gao2021simcse, he2020momentum}, apply momentum encoder to encode original data as augmented data. Meanwhile, we maintain a first-in-first-out queue to store negative samples. For image data, we perform random transformation to do augmentation encoded by momentum encoder, and we also maintain a same queue to store negative samples initialized randomly. Furthermore, in order to utilize complementary information between modalities, we perform modal fusion via tensor approach, which guarantees data loss to be minimal to some extent.
510
+
511
+
512
+ In contrast to previous approaches, we focus on proxy tasks that center on text since our downstream tasks involve plain text. To augment the data, we adopt a momentum encoder approach inspired by \cite{gao2021simcse, he2020momentum} rather than conventional methods such as back translation. Specifically, we encode original data using a momentum encoder to generate augmented data. To handle negative samples, we use a first-in-first-out queue. As for image data, we apply random transformations using a momentum encoder for image augmentation, and we also initialize a queue to store negative samples randomly. To take advantage of the complementary information across modalities, we perform modal fusion through a tensor approach that minimizes data loss to some extent.\\
513
+ \textbf{Data-based Contrastive Learning by Momentum Encoder}. %Instead of applying traditional data augmentation method, we are inspired by \cite{gao2021simcse} to utilize an approch similar to dropout mask. We design two momentum encoders initialized by image encoder and text encoder respectively, whose parameters are not affected by the back-propagation of gradient, but manually combined with the modality encoder's parameters linearly to perform dynamic or static update by ourselves. Since back-translation takes a lot of time, we input the same data into encoders (modality encoder and corresponding momentum encoder) with different parameters, so that the obtained data is similar but different.
514
+ Rather than using conventional data augmentation methods, we adopt an approach similar to dropout masks, inspired by \cite{gao2021simcse}. We construct two momentum encoders initialized by our image encoder and text encoder, respectively. These momentum encoders are designed to have their parameters not affected by the back-propagation of gradients, but are instead manually combined linearly with the common modality encoder's parameters for either dynamic or static updates. For static updates, we predefine an $m$ as momentum for updates, and dynamic updates is by computing the similiarity of data and augmented data as momentum. To avoid the time-consuming nature of back-translation, we input the same data into both the modality encoder and the corresponding momentum encoder with different parameters, resulting in similar but distinct representations of the data.
515
+
516
+ %Inspired by \cite{he2020momentum}, in order to reduce memory usage, we set two first-in-first-out queues for text and image respectively, which is randomly initialized used to store negative samples. The features output by momentum encoder as negative samples are enqueued from tail, and the earliest batch samples are dequeued from head. Algorithm 1 describes the process of above. We compute $\mathcal L_t$ and $\mathcal L_v$ respectively according to Algorithm 1.
517
+ Drawing inspiration from \cite{he2020momentum}, we implement a memory-efficient strategy by creating two first-in-first-out queues for text and image, respectively. These queues are randomly initialized and used to store negative samples. The momentum encoder generates features that serve as negative samples, which are then enqueued from the tail of the queue. The earliest batch of samples is dequeued from the head. Algorithm. \ref{alg} provides a detailed overview of this process. We compute text and augmented image contrastive loss $\mathcal L_{t2v}$ and image and augmented text contrastive loss $\mathcal L_{v2t}$ according to Algorithm. \ref{alg}.
518
+ \begin{algorithm}
519
+ % \begin{spacing}{0.8}
520
+ \caption{Augmented data contrastive learning} \label{alg}
521
+ \begin{algorithmic}
522
+ \STATE \textcolor[RGB]{0,90,20}{\# f\_o, f\_m: original and momentum encoder }
523
+ \STATE \textcolor[RGB]{0,90,20}{\# q: a queue stores negative samples }
524
+ \STATE \textcolor[RGB]{0,90,20}{\# m, $\tau$: momentum and temperature }
525
+ \STATE \textcolor[RGB]{0,90,20}{\# k: size of queue}
526
+ \STATE f\_m.param = f\_o.param \quad \textcolor[RGB]{0,90,20}{\# initialize}
527
+ \STATE $f$ = f\_o($x$) \quad \textcolor[RGB]{0,90,20}{\# $f$: encoded feature, $x$: data}
528
+ \STATE $f_{aug}$ = f\_m($x$) \quad \textcolor[RGB]{0,90,20}{\# $f_{aug}$: augmented feature}
529
+ \STATE l\_pos = $f \times f_{aug}$ \quad \textcolor[RGB]{0,90,20}{\# positive logits(dx1)}
530
+ \STATE l\_neg = $f \times$ q \quad \textcolor[RGB]{0,90,20}{\# negative logits(dxk)}
531
+ \STATE logits = cat([l\_pos, l\_neg], dim=-1) \quad \textcolor[RGB]{0,90,20}{\# dx(k+1)}
532
+ \STATE label = zeros(d) \quad \textcolor[RGB]{0,90,20}{\# positives are the 0-th}
533
+ \STATE loss = CrossEntropy(logits/$\tau$, label) \quad \textcolor[RGB]{0,90,20}{\# contrastive loss}
534
+ \STATE loss.backward() \quad \textcolor[RGB]{0,90,20}{\# backward}
535
+ \STATE \textcolor[RGB]{0,90,20}{\# parameter updating}
536
+ \STATE f\_m.param = m*f\_o.param+(1-m)*f\_m.param
537
+ \STATE enqueue(q, l\_neg) \quad \textcolor[RGB]{0,90,20}{\# enqueue current batch}
538
+ \STATE dequeue(q) \quad \textcolor[RGB]{0,90,20}{\# dequeue from head}
539
+ \end{algorithmic}
540
+ % \end{spacing}
541
+ \end{algorithm}\\
542
+ \textbf{Data-based Contrastive Learning by Tucker Fusion}. We present a fused data contrastive learning. Modal fusion has the following advantages: (1) it is more robust than a single mode, (2) the modal information is complementary and (3) it can run without any modal information. Inspired by \cite{ben2017mutan}, we apply Multimodal Tucker Fusion, shown in Fig. \ref{tucker}, to merge text and image for further contrastive learning,
543
+ \begin{equation}
544
+ {\mathcal T}=(({\mathcal T_c}\times_{1} \mathcal W_e)\times_{2}\mathcal W_{u})\times_3\mathcal W_{o}, \label{tk}
545
+ \end{equation}
546
+ where ${\mathcal T_c}$ is core tensor, and $W_e$, $W_u$ and $W_o$ are all factor matrices.
547
+
548
+ Bilinear models, as described in \cite{fukui2016multimodal, kim2016hadamard}, offer a solution to the fusion problem by encoding bilinear interactions between any two feature vectors $z$ and $h$,
549
+ \begin{equation}
550
+ \omega=(({\mathcal T_c}\times_{1}z)\times_{2}h).
551
+ \end{equation}
552
+
553
+ We follow Eqn. \ref{tk}, treat $W_o$ as a mapping matrix which is trainable and maps fused features to a certain dimensional space,
554
+ \begin{subequations}
555
+ \begin{align}
556
+ &\mathcal Q=(({\mathcal T_c}\times_{1}{\mathcal X})\times_{2}{\mathcal Y})\times_{3}\mathcal W_o,\\
557
+ \mathcal Q_{aug}&=(({\mathcal T_c}\times_{1}{\mathcal X_{aug}})\times_{2}{\mathcal Y_{aug}})\times_{3}\mathcal W_o,
558
+ \end{align}
559
+ \end{subequations}
560
+ where ${\mathcal X_{aug}}$, ${\mathcal Y_{aug}}$ are augmented ones. In addition, $\mathcal Q$ is fused feature and $\mathcal Q_{aug}$ denotes augmented fused feature.
561
+
562
+ Then, we compute contrastive loss via CrossEntropy loss,
563
+ \begin{equation}
564
+ \mathcal L_q \!=\!\text{CrossEntropy}(\mathcal Q\cdot{(\mathcal Q^{aug})}^T/\tau_2,label),
565
+ \end{equation}
566
+ where $\tau_2$ is temperature hyper-parameter, and $label \in [0, N-1]$, $N$ is the batch size. The final loss $\mathcal L_{adcl}$ is computed as,
567
+ \begin{equation}
568
+ {\mathcal L_{adcl}}={\mathcal L_{t2v}} + {\mathcal L_{v2t}} + {\mathcal L_q}.
569
+ \end{equation}
570
+ \subsection{Model Training}
571
+ All forementioned loss can be simply added to be final loss,
572
+ \begin{equation}
573
+ \mathcal L = \beta\mathcal L_{itc}+\gamma\mathcal L_{itm}+\eta\mathcal L_{ts}+\mu\mathcal L_{adcl} + \sigma\mathcal L_{loc},
574
+ \end{equation}
575
+ where $\beta$, $\gamma$, $\eta$, $\mu$ and $\sigma$ are hyper-parameters for weights balance.
576
+
577
+ \section{Experimental Settings}
578
+ \subsection{Downstream Tasks}
579
+ For downstream tasks, we choose seven GLUE tasks for evaluation as follows:\\
580
+ \textbf{MRPC} (The Microsoft Research Paraphrase Corpus), a similarity and paraphrase task, is to automatically extract a sentence pair corpus from online news sources, and manually annotate whether the sentences in the sentence pair are semantically equivalent.\\
581
+ \textbf{QQP} (The Quora Question Pairs), a similarity and paraphrase task, is a collection of question pairs in the community question-and-answer website Quora. The task is to determine whether a pair of questions are semantically equivalent.\\
582
+ \textbf{MNLI} (The Multi-Genre Natural Language Inference Corpus), a natural language inference task, given a premise statement and a hypothesis statement, the task is to predict whether the premise statement contains a hypothesis entailment), contradicts the hypothesis (contradiction) or both neither (neutral). And MNLI-mm is MNLI with mismatched validation set.\\
583
+ \textbf{QNLI} (Qusetion-answering NLI), natural language inference task.The task is to judge whether the questions and sentences imply, imply and not imply, and classify them into two categories.\\
584
+ \textbf{RTE} (The Recognizing Textual Entailment datasets), a natural language inference task, the task is to judge whether a sentence pair contains, whether sentence 1 and sentence 2 contain each other, and a binary classification task.\\
585
+ \textbf{SNLI} (The Stanford Natural Language Inference), is a collection of human-written English sentence pairs, with manually labeled categories entailment, contradiction, and neutral. The statistics of these datasets are shown in Table. \ref{data}.\\
586
+ \begin{table}
587
+ \centering
588
+ \setlength{\tabcolsep}{1mm}{
589
+ \resizebox{\linewidth}{!}{
590
+ \begin{tabular}{ccccc}
591
+ \toprule
592
+ Dataset & Type &Train & Dev & Test\\
593
+ \midrule
594
+ Flickr30k & Text2Image & 111,240 & 47,675 & -\\
595
+ RTE & NLI & 2,491 & 277 & 3,000\\
596
+ SNLI & NLI &550,152 & 10,000 & 10,000\\
597
+ MRPC &NLI &3,668 & 408 & 1,725\\
598
+ MNLI & NLI &392,702 & 9,815 & 9,796\\
599
+ MNLI-mm & NLI& 392,702 & 9,832 & 9,847\\
600
+ QQP & QA &363,870 & 40,431 & 390,965\\
601
+ QNLI & QA & 104,743 & 5,463 & 5,461\\
602
+ \bottomrule
603
+ \end{tabular}}
604
+ }
605
+ \caption{Statistics of datasets.}
606
+ \label{data}
607
+ \end{table}
608
+ %In addition, we perform \textbf{MSC} (Multimodal Sentiment Classification) on MOSEI dataset with one or two modalities, which consists of 23,453 utterance videos. It contains two tasks: sentiment analysis and emotion recognition.
609
+ Besides, we conduct multimodal sentiment classification on the MOSEI dataset with one or two modalities. This dataset includes 23,453 videos of utterances and encompasses two distinct tasks: sentiment analysis and emotion recognition.
610
+
611
+ \subsection{Implementation Details}
612
+
613
+ Our experiment is run over a computer with NVIDIA Geforce RTX 3090, 24G VRAM. For text encoding, we leverage the \textit{bert-base-uncased} model from the Hugging Face Transformers library, while we use the \textit{ResNet50} model as the image encoder. The local text feature representation $\mathcal S^{(i)}$ corresponds to the feature vector of the $i$-th word, whereas the local image feature $\mathcal P^{(i)}$ corresponds to the $i$-th patch, which represents the feature before the final pooling layer. In order to ensure that both the text and image modalities are embedded into the same feature space, we uniformly map the feature dimension to 768. We employ separate projection heads for text and image, both of which consist of two MLP layers with Tanh activation in the middle. This structure enables the mapping of features to a shared feature space with 128 dimensions, which is normalized prior to the computation of multimodal objectives. After tucker fusion, the fused feature dimension is 128.\\
614
+ Moreover, we utilize a momentum encoder to encode the same data for data augmentation. This encoder is initialized by the original data encoder, and its momentum $m$ is dynamically set while the queue size $K$ is 65536. Specifically, $m$ is set to the average similarity between the text and the augmented text,
615
+ \begin{equation}
616
+ m = \frac{1}{N}\text{cosine}(\mathcal X, \mathcal X^{aug}).
617
+ \end{equation}
618
+
619
+ \subsection{Hyper-parameter Settings}
620
+
621
+ In this section, we provide details of the hyper-parameters used in our approach. Specifically, we set the temperature of energy function $\tau_{\phi}$ to 0.1, and the weights of the objective are mainly set to 1, while $\mu$ is set to 0.04. We further explore the influence of parameter $\mu$ on the effectiveness of the model. During self-supervised learning, we set the training epoch $n$ to 10, and the batch size $N$ is set as 128. To set the learning rate ($lr$), we employ a warm-up strategy, where we initially set the learning rate as 1e-4 and the warm-up ratio $wr$ is set to 0.1.
622
+
623
+ \section{Experimental Results}
624
+ \subsection{Main Results}
625
+
626
+ In this section, we present a comparative evaluation of our proposed pre-trained model compared with mainstream pre-trained NLU models, as well as multimodal pre-trained models, across eight downstream tasks. We evaluate our proposed model on MOSEI using only language as input, and compare it with a single-modality pre-trained model that uses language and vision as inputs. Additionally, we perform experiments with language and speech as inputs for multimodal pre-trained models. We simply concat multimodal features and map them to results through a linear classification layer. Our results, presented in Table. \ref{mosei}, reveal that our proposed model with only language outperforms single modality pre-trained models with language and vision, and outperforms multimodal pre-training models (e.g. MCSE, MACD and i-Code) on all conditions. thereby emphasizing the efficacy of our approach. Moreover, we investigate the performance of forementioned models on seven pure-language tasks without fine-tuning for unsupervised representation. As shown in Table. \ref{un}, we employ both static and dynamic momentum updating methods, and demonstrate that both methods achieve competitive performance.
627
+ \begin{table}[!]
628
+ \centering
629
+ \resizebox{\linewidth}{!}{
630
+ \begin{tabular}{c|ccc|c}
631
+ \hline
632
+ Model&Language&Speech&Vision&Acc\\
633
+ \hline
634
+ Bert(NAACL'2019)&\checkmark& &\checkmark& 54.61\\
635
+ Roberta(2019)& \checkmark& &\checkmark& 55.19\\
636
+ XLNet(NIPS'2019)& \checkmark& &\checkmark& 53.23\\
637
+ i-Code(2022)$\mathcal y$ & \checkmark& &\checkmark& 49.0\\
638
+ \hline
639
+ Bert(NAACL'2019)&\checkmark& & & 53.18\\
640
+ Roberta(2019)& \checkmark& & & 55.19\\
641
+ XLNet(NIPS'2019)& \checkmark& & & 54.09\\
642
+ MACD(EMNLP'2020)& \checkmark& & & 55.7\\
643
+ MCSE(NAACL'2022)& \checkmark& & & 46.06\\
644
+ i-Code(2022)$\mathcal y$ & \checkmark& & & 46.3\\
645
+ ViPMM (Ours)& \checkmark& & & \textbf{55.86}\\
646
+ \hline
647
+ Bert(NAACL'2019)& \checkmark& \checkmark& & 49.0\\
648
+ Roberta(2019)& \checkmark& \checkmark& & 38.61\\
649
+ XLNet(NIPS'2019)& \checkmark& \checkmark& & 53.87\\
650
+ MACD(EMNLP'2020)& \checkmark& \checkmark& & 55.7\\
651
+ MCSE(NAACL'2022)& \checkmark& \checkmark& & 46.06\\
652
+ i-Code(2022)$\mathcal y$ & \checkmark& \checkmark& & 49.2\\
653
+ ViPMM (Ours)& \checkmark& \checkmark& & \textbf{57.75}\\
654
+ \hline
655
+ \end{tabular}}
656
+ \caption{Unsupervised multimodal sentiment classification results on MOSEI. $\mathcal y$: results from \cite{yang2022code}.}
657
+ \label{mosei}
658
+ \end{table}
659
+ \begin{table}[h]
660
+ \centering
661
+ \resizebox{\linewidth}{!}{
662
+ \begin{tabular}{cccccccccccc}
663
+ \toprule
664
+ Model & MNLI & MNLI-mm & QNLI & SNLI & QQP & MRPC & RTE \\
665
+ \midrule
666
+ Bert(NAACL'2019) & 35.9 & 36.6 & 49.5 & 34.4 & 47.0 & 67.5 & 52.7 \\
667
+ Roberta(NIPS'2019) & 35.4 & 35.2 & 49.4 & 33.7 & 36.8 & 67.4 & 52.7 \\
668
+ XLNet(NIPS'2019) & 35.7 & 36.0 & 49.7 & 34.0 & 60.3 & 67.4 & 53.4 \\
669
+ MACD(EMNLP'2020) & 41.6 & 41.3 & 58.6 & 50.69 & 67.0 & 65.1 & 53.1\\
670
+ MCSE(NAACL'2022) & 41.1 & 42.0 & 56.1 & 41.2 & 67.3 & 68.6 & 55.7\\
671
+ ViPMM(Ours, static) & \textbf{42.3} & 42.0 & \textbf{60.2} & \textbf{51.2} & 67.5 & \textbf{68.4} & 56.6\\
672
+ ViPMM(Ours, dynamic) & 42.0 & \textbf{42.1} & 59.2 & 50.3 & \textbf{68.7} & \textbf{68.4} & \textbf{65.0}\\
673
+ \bottomrule
674
+ \end{tabular}}
675
+ \caption{Unsupervised natural language understanding results.}
676
+ \label{un}
677
+ \end{table}
678
+
679
+ \begin{table}[h]
680
+ \centering
681
+ \resizebox{\linewidth}{!}{
682
+ \begin{tabular}{cccccccccccc}
683
+ \toprule
684
+ & MNLI & MNLI-mm & QNLI & SNLI & QQP & MRPC & RTE \\
685
+ \midrule
686
+ w/o ADCL & 35.5 & 35.3 & 49.5 & 33.7 & 38.7 & 67.5 & 55.6 \\
687
+ w/o tucker fusion & 42.0 & 42.0 & 57.7 & 50.8 & 64.8 & 68.4 & 53.1\\
688
+ w/o text swapping & 41.7 & \textbf{42.2} & 58.3 & 50.6 & 65.2 & 67.5 & 53.4 \\
689
+ w/o momentum encoder & 42.0 & 41.8 & 56.8 & 50.5 & 67.0 & 66.8 & 53.0 \\
690
+ w/o image-text matching & 41.8 & 41.8 & 58.9 & 50.6 & 67.0 & 67.5 & 52.7\\
691
+ w/o local MI maximization & 41.8 & 42.1 & 58.5 & 50.0 & 65.7 & 68.1 & 55.9\\
692
+ ViPMM (Full)& \textbf{42.3} & 42.0 & \textbf{60.2} & \textbf{51.2} & \textbf{67.5} & \textbf{68.6} & \textbf{56.6}\\
693
+ \bottomrule
694
+ \end{tabular}}
695
+ \caption{Ablation results of our ViPMM.}
696
+ \label{tab:example}
697
+ \end{table}
698
+ \begin{figure*}
699
+ \centering
700
+ \begin{subfigure}[Experimental results of core tensor's rank.]{\includegraphics[width=0.35\linewidth]{image/cl.pdf}
701
+ \label{rank}
702
+ }
703
+ \end{subfigure}
704
+ \quad
705
+ \begin{subfigure}[Experimental results of the weights of augmented data contrastive learning.]{\includegraphics[width=0.35\linewidth]{image/mu.pdf}
706
+ \label{cl}
707
+ }
708
+ \caption{Experiment results of hyper-parameters related to tucker fusion.}
709
+ \end{subfigure}
710
+ \vspace{0.2in}
711
+ \end{figure*}
712
+
713
+ \subsection{Ablation Study}
714
+
715
+ To validate the effectiveness of ViPMM with our designed proxy tasks, we conduct ablation studies on the influence of various components, including tucker fusion instead of cross-modal attention-based fusion, global and local image-text contrastive learning with momentum encoder, text swapping, image-text matching, and augmented data contrastive learning (ADCL), on the performance of our across seven tasks, as reported in Table. \ref{tab:example}. Our results demonstrate that ViPMM outperforms other models across all tasks. ADCL plays a key role in improving our model's performance by enabling tucker fusion to fuse features with less missing information. Text swapping and image-text matching both contribute to enhancing the model's ability to reduce the distance between image and text. Momentum encoder with augmented data empowers the model to identify similar but different text-image pairs. Local image-text contrastive learning assists the model in aligning text and image, thereby preventing the learning of irrelevant information. Tucker fusion to some extent reduces the loss after information fusion, and successfully utilizes complementary information between modalities, achieving better performance compared to cross-modal attention fusion.
716
+
717
+ \subsection{Visualization}
718
+
719
+ In order to explore the impact of the tensor fusion method on the experiment, we display the influence of the core tensor's rank and the hyper-parameter $\mu$ of related task to the downstream tasks in Fig. \ref{rank} and Fig. \ref{cl}. As an illustration of our approach, we apply t-SNE to cluster and visualize the encoded features of MACD and ViPMM, using the RTE task as a test case. Fig. \ref{tsne} display the resulting cluster diagrams. From these diagrams, we observe that the distribution of each feature encoded by our model tends to be circular, while that of MACD tends to be elliptical. This suggests that the feature encoded by ViPMM as a pre-trained encoder is more isotropic and remains unchanged as the direction changes.
720
+ \begin{figure}[H]
721
+ \centering
722
+ \includegraphics[scale=0.32]{image/VPMTM.pdf}
723
+ \caption{t-SNE visualization on RTE dataset encoded by ViPMM (left) and MACD (right).}
724
+ \label{tsne}
725
+ \end{figure}
726
+
727
+ \section{Conclusion}
728
+
729
+ We propose a novel multimodal training framework for learning sentence representations. Our approach includes both intra-modal and inter-modal paths with proxy tasks for visual supervision of text. Our experiments demonstrate competitive performance in seven GLUE tasks and multimodal sentiment classification without fine-tuning. Our dynamic momentum updating method improves the accuracy of discovering modalities' similarities and differences. We use local mutual information maximization to align text and image. Text swapping task improves model credibility, and augmented data contrastive learning enhances model discrimination. Our model can be directly utilized into any natural language understanding tasks as text encoder.
730
+
731
+ %\begin{acks}
732
+ %To Robert, for the bagels and explaining CMYK and color spaces.
733
+ %\end{acks}
734
+
735
+ %%
736
+ %% The next two lines define the bibliography style to be used, and
737
+ %% the bibliography file.
738
+ \bibliographystyle{ACM-Reference-Format}
739
+ \bibliography{reference.bib}
740
+
741
+ \end{document}
742
+ \endinput
743
+ %%
744
+ %% End of file `sample-sigplan.tex'.