'why do pooler use tanh as a activation func in bert, rather than gelu?

class BERTPooler(nn.Module): def init(self, config): super(BERTPooler, self).init() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh()

def forward(self, hidden_states):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token.
    first_token_tensor = hidden_states[:, 0]
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output)
    return pooled_output


Solution 1:[1]

The author of the original BERT paper answered it (kind of) in a comment on GitHub.

The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.

I agree it doesn't fully answer "whether" tanh is preferable, but from the looks of it, it'll probably work with any activation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Daisuke Shimamoto