Thanks for your post Frank.

Transformers from Scratch in PyTorch
779
12
Frank Odom
Emil Rijcken
·Follow
Aug 29, 2022
--
Thanks for your post Frank. It is my understanding that with multihead attention, each head only attends 1/number_of_heads of the embedding. In your code, each head seems to attend the full embedding. What are your thoughts on this?
--
--
Written by Emil Rijcken80 Followers
·18 Following
PhD candidate in Natural Language Processing
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams