Vanilla Transfromer Learning Note

2 minute read

Published:

Learning note for the vanilla transfromer.


Transformer

Video detail introdution

Paper url

image-20210329234607247

RNN

Defect:

并行性差

前期ht遗忘

LayerNorm vs BatchNorm

image-20220314213253415

Word Embedding

brief intro of embedding:https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca

video:https://www.youtube.com/watch?v=D-ekE-Wlcds

demo:https://ronxin.github.io/wevi/

ppt:https://docs.google.com/presentation/d/1yQWN1CDWLzxGeIAvnGgDsIJr5xmy4dB0VmHFKkLiibo/edit#slide=id.ge79682746_0_501

Used to map words or phrases from a vocabulary to a corresponding vector of real numbers.

It’s a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words.

Need focus on two things:

  • Dimensionality Reduction — it is a more efficient representation
  • Contextual Similarity — it is a more expressive representation

The weight matrix is usually called the embedding matrix, and can be queried as a look-up table.

Two ways to implement word2vec:CBOW,SG.

image-20210328120825180

Model architecture:

image-20210328122522699

How to decomposed the one-hot to the smaller vector? Decomposed.Linear weights.

image-20210328122649235

Self-attention

All have 4 learnable matrix:WQ,WK,WV,WO.

What query,key and values are?

你有一个问题Q,然后去搜索引擎里面搜,搜索引擎里面有好多文章,每个文章V有一个能代表其正文内容的标题K,然后搜索引擎用你的问题Q和那些文章V的标题K进行一个匹配,看看相关度(QK —>attention值)。QK夹角小,正相关;QK夹角大,负相关。然后你想用这些检索到的不同相关度的文章V来表示你的问题,就用这些相关度将检索的文章V做一个加权和,那么你就得到了一个新的Q’,这个Q’融合了相关性强的文章V更多信息,而融合了相关性弱的文章V较少的信息。这就是注意力机制,注意力度不同,重点关注(权值大)与你想要的东西相关性强的部分,稍微关注(权值小)相关性弱的部分。

How to generate Q,K,V?

Input is all X.By applying a linear transformation to the original input vector.(WQ,WK,WV)

image-20210330002224397

image-20210330001337240

The outputs of the self-attention layer(dk equals to the head size):

image-20210329004340003

Why multi-headed self-attention?

使网络能学习针对不同任务的不同的表示方法

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

image-20220316105752916

How to apply multi-headed self-attention?

Paper use 8 attention head.Same as the single head calculation.

Then it generate 8 outputs.

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that?

We concat the matrices then multiple them by an additional weights matrix WO.

image-20210329005844128

The whole process is shown below:

image-20210329010049082

The output shape is identical with the input X.So it’s easy to apply residual.

Positional Encoding

image-20210329010952390

Generated by sin, cos.(不学习这个矩阵参数,因为应用sin,cos的启发式方法)

Find the need position in the following plot:

image-20210329235014053

Mask

-inf

make the dot(Q,kT) become 0

don’t let the model to see the future value in predict mode

Decode

Use the K,V from the encoder token. Q from the decoder token :

image-20210330004844569

Loss

label-smoothing = 0.1

cross entropy