'Computing a matrix of cosine similarities versus a matrix of transformer embeddings

It seems like the authors of CLIP opted in for a DSSM-like approach of embedding the image and the caption in parallel pathways and then taking dot products. At the same time, BERT's approach of concatenating two sequences of embeddings and training to predict another embedding concatenated to those two has been shown to achieve superior quality of ranking.

Hence two questions:

Are there any specific reasons why one would opt in for the DSSM-like architecture except reduced compute, like CLIP authors did?
Are there any subsequent works that setup experiments with training CLIP-like models with the BERT-like transformer approach of concatenating embeddings for the image and the caption?

bert-language-model transformer

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Computing a matrix of cosine similarities versus a matrix of transformer embeddings

Sources

Related Questions