Aug 29, 2022
Thanks for your post Frank. It is my understanding that with multihead attention, each head only attends 1/number_of_heads of the embedding. In your code, each head seems to attend the full embedding. What are your thoughts on this?
Thanks for your post Frank. It is my understanding that with multihead attention, each head only attends 1/number_of_heads of the embedding. In your code, each head seems to attend the full embedding. What are your thoughts on this?