'Is the recursive form in a transformer efficient for long sequences?
I am reading the paper Transformers are RNNs:Fast Autoregressive Transformers with Linear Attention, where the authors propose the linear transformer, which seems to be very efficient in practice. Yet, I have some confusions about the recursive operations in Algorithm 1.
For tasks that require casual masking:
- Is the forward and backward pass presented in Algorithm 1 used during both training and inference?
- According to my understanding, the major computational bottleneck of RNN lies in its recursive operation. This makes it incredibly slow in handling very long sequences. Will it cause the same problem for Algorithm 1, given that it is also conducted in a recursive manner? (Apart from the fact that we can use C++ to have faster recursion.)
Can someone please provide some insights with regard to these two questions? I would be really grateful!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
