'why do pooler use tanh as a activation func in bert, rather than gelu?

class BERTPooler(nn.Module): def init(self, config): super(BERTPooler, self).init() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh()

def forward(self, hidden_states):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token.
    first_token_tensor = hidden_states[:, 0]
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output)
    return pooled_output

nlp bert-language-model

Solution 1:^[1]

The author of the original BERT paper answered it (kind of) in a comment on GitHub.

The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.

I agree it doesn't fully answer "whether" tanh is preferable, but from the looks of it, it'll probably work with any activation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Daisuke Shimamoto

'why do pooler use tanh as a activation func in bert, rather than gelu?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]