Emil Rijcken
Aug 29, 2022

--

Thanks for your post Frank. It is my understanding that with multihead attention, each head only attends 1/number_of_heads of the embedding. In your code, each head seems to attend the full embedding. What are your thoughts on this?

--

--

Emil Rijcken
Emil Rijcken

Written by Emil Rijcken

PhD candidate in Natural Language Processing

No responses yet