Vanilla Transfromer Learning Note
Published:
Learning note for the vanilla transfromer.
Transformer
RNN
Defect:
并行性差
前期ht遗忘
LayerNorm vs BatchNorm
Word Embedding
brief intro of embedding:https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca
video:https://www.youtube.com/watch?v=D-ekE-Wlcds
demo:https://ronxin.github.io/wevi/
ppt:https://docs.google.com/presentation/d/1yQWN1CDWLzxGeIAvnGgDsIJr5xmy4dB0VmHFKkLiibo/edit#slide=id.ge79682746_0_501
Used to map words or phrases from a vocabulary to a corresponding vector of real numbers.
It’s a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words.
Need focus on two things:
- Dimensionality Reduction — it is a more efficient representation
- Contextual Similarity — it is a more expressive representation
The weight matrix is usually called the embedding matrix, and can be queried as a look-up table.
Two ways to implement word2vec:CBOW,SG.
Model architecture:
How to decomposed the one-hot to the smaller vector? Decomposed.Linear weights.
Self-attention
All have 4 learnable matrix:WQ,WK,WV,WO.
What query,key and values are?
你有一个问题Q,然后去搜索引擎里面搜,搜索引擎里面有好多文章,每个文章V有一个能代表其正文内容的标题K,然后搜索引擎用你的问题Q和那些文章V的标题K进行一个匹配,看看相关度(QK —>attention值)。QK夹角小,正相关;QK夹角大,负相关。然后你想用这些检索到的不同相关度的文章V来表示你的问题,就用这些相关度将检索的文章V做一个加权和,那么你就得到了一个新的Q’,这个Q’融合了相关性强的文章V更多信息,而融合了相关性弱的文章V较少的信息。这就是注意力机制,注意力度不同,重点关注(权值大)与你想要的东西相关性强的部分,稍微关注(权值小)相关性弱的部分。
How to generate Q,K,V?
Input is all X.By applying a linear transformation to the original input vector.(WQ,WK,WV)
The outputs of the self-attention layer(dk equals to the head size):
Why multi-headed self-attention?
使网络能学习针对不同任务的不同的表示方法
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
How to apply multi-headed self-attention?
Paper use 8 attention head.Same as the single head calculation.
Then it generate 8 outputs.
This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
How do we do that?
We concat the matrices then multiple them by an additional weights matrix WO.
The whole process is shown below:
The output shape is identical with the input X.So it’s easy to apply residual.
Positional Encoding
Generated by sin, cos.(不学习这个矩阵参数,因为应用sin,cos的启发式方法)
Find the need position in the following plot:
Mask
-inf
make the dot(Q,kT) become 0
don’t let the model to see the future value in predict mode
Decode
Use the K,V from the encoder token. Q from the decoder token :
Loss
label-smoothing = 0.1
cross entropy