Vanilla Transfromer Learning Note

2 minute read

Published: July 04, 2022

Learning note for the vanilla transfromer.

Transformer

Video detail introdution

Paper url

RNN

Defect:

并行性差

前期ht遗忘

LayerNorm vs BatchNorm

Word Embedding

brief intro of embedding:https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca

video:https://www.youtube.com/watch?v=D-ekE-Wlcds

demo:https://ronxin.github.io/wevi/

ppt:https://docs.google.com/presentation/d/1yQWN1CDWLzxGeIAvnGgDsIJr5xmy4dB0VmHFKkLiibo/edit#slide=id.ge79682746_0_501

Used to map words or phrases from a vocabulary to a corresponding vector of real numbers.

It’s a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words.

Need focus on two things:

Dimensionality Reduction — it is a more efficient representation
Contextual Similarity — it is a more expressive representation

The weight matrix is usually called the embedding matrix, and can be queried as a look-up table.

Two ways to implement word2vec:CBOW,SG.

Model architecture:

How to decomposed the one-hot to the smaller vector? Decomposed.Linear weights.

Self-attention

All have 4 learnable matrix:WQ,WK,WV,WO.

What query,key and values are?

你有一个问题Q，然后去搜索引擎里面搜，搜索引擎里面有好多文章，每个文章V有一个能代表其正文内容的标题K，然后搜索引擎用你的问题Q和那些文章V的标题K进行一个匹配，看看相关度（QK —>attention值）。QK夹角小，正相关；QK夹角大，负相关。然后你想用这些检索到的不同相关度的文章V来表示你的问题，就用这些相关度将检索的文章V做一个加权和，那么你就得到了一个新的Q’，这个Q’融合了相关性强的文章V更多信息，而融合了相关性弱的文章V较少的信息。这就是注意力机制，注意力度不同，重点关注（权值大）与你想要的东西相关性强的部分，稍微关注（权值小）相关性弱的部分。

How to generate Q,K,V?

Input is all X.By applying a linear transformation to the original input vector.(WQ,WK,WV)

The outputs of the self-attention layer（dk equals to the head size）:

Why multi-headed self-attention?

使网络能学习针对不同任务的不同的表示方法

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

How to apply multi-headed self-attention?

Paper use 8 attention head.Same as the single head calculation.

Then it generate 8 outputs.

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that?

We concat the matrices then multiple them by an additional weights matrix WO.

The whole process is shown below:

The output shape is identical with the input X.So it’s easy to apply residual.

Positional Encoding

Generated by sin, cos.(不学习这个矩阵参数，因为应用sin，cos的启发式方法)

Find the need position in the following plot:

Mask

-inf

make the dot(Q,kT) become 0

don’t let the model to see the future value in predict mode

Decode

Use the K,V from the encoder token. Q from the decoder token :

Loss

label-smoothing = 0.1

cross entropy

Share on

Twitter Facebook LinkedIn

Meta Knowledge Condensation for Federated Learning

less than 1 minute read

Published: January 26, 2023

Homomorphic Encryption

3 minute read

Published: September 07, 2022

Definition

Homomorphic encryption aims at encrypting the tokens and when dealing with the ciphertext, the results stays the same as dealing with the tokens directly. So we say that it keeps homorphic.

If we define a operation $\bigtriangledown$, then to the Encoder and Decoder algorithm, satisfy:

E(X $\bigtriangledown$ Y) = E(X) $\bigtriangledown$ E(Y)

then, this operation haves homomorphism.

Homomorphism comes from the field of algebra and generally includes four types: additive homomorphism, multiplication homomorphism, subtraction homomorphism and division homomorphism. Satisfying additive homomorphism and multiplicative homomorphism at the same time means that it is algebraic homomorphism, that is, Full Homomorphic. If all four homomorphisms are satisfied at the same time, it is called arithmetic homomorphism.

For computer operations, achieving full homomorphism means that homomorphism can be achieved for all processing. Homomorphism that can only implement some specific operations is called Somewhat Homomorphic.

Challenge

The problem of homomorphic encryption was first proposed by Ron Rivest, Leonard Adleman and Michael L. Dertouzos in 1978 (the same year Ron Rivest, Adi Shamir and Leonard Adleman also co-invented the RSA algorithm). But the first “fully homomorphic” algorithm was not presented and mathematically proven until 2009 by Craig Gentry in his paper “Fully Homomorphic Encryption Using Ideal Lattices”.

Algorithms that satisfy only additive homomorphism include Paillier and Benaloh algorithms; algorithms that satisfy only multiplicative homomorphism include RSA and ElGamal algorithms.

Homomorphic encryption is of great significance in the era of cloud computing and big data. At present, although cloud computing brings advantages including low cost, high performance, and convenience, from a security perspective, users are not afraid to put sensitive information directly on a third-party cloud for processing. If there is a more practical homomorphic encryption technology, everyone can use various cloud services with confidence, and various data analysis processes will not reveal user privacy. After the encrypted data is processed by the third-party service, the encrypted result is obtained. This result can only be decrypted by the user himself, and the third-party platform cannot obtain any valid data information in the whole process.

On the other hand, for blockchain technology, homomorphic encryption is also a good complement. Using homomorphic encryption technology, smart contracts running on the blockchain can process ciphertext, but cannot know the real data, which greatly improves privacy security.

Currently, fully homomorphic encryption schemes mainly include the following three types:

Scheme based on ideal lattice: The ideal lattice-based scheme proposed by Gentry and Halevi in 2011 can achieve a security strength of 72 bits, and the corresponding public key size is about 2.3 GB, while the processing time of refreshing the ciphertext takes several hours. ten minutes.
Schemes based on approximating the GCD problem on integers: The scheme proposed by Dijk et al. in 2010 (and subsequent schemes) adopts a more simplified conceptual model and can reduce the public key size to the order of tens of MB.
Schemes based on the Learning With Errors (LWE) problem: Brakerski and Vaikuntanathan et al. proposed related schemes around 2011; Lopez-Alt A et al. designed a multi-key fully homomorphic encryption scheme in 2012, which is close to real-time multi-party encryption. The need for secure computing.

At present, the known homomorphic encryption technology often requires higher computing time or storage cost, and there is still a gap between the performance and strength of traditional encryption algorithms, but the attention in this field has always been high. The author believes that in the near future, There will be solutions that are close to practical.

Reference

https://yeasy.gitbook.io/blockchain_guide/05_crypto/homoencryption

Expainable AI Survey

less than 1 minute read

Published: August 31, 2022

Expectation-Maximization Algorithm(EM)

less than 1 minute read

Published: July 14, 2022

SHEN Jiyuan

Vanilla Transfromer Learning Note

Transformer

RNN

LayerNorm vs BatchNorm

Word Embedding

Self-attention

What query,key and values are?

How to generate Q,K,V?

Why multi-headed self-attention?

How to apply multi-headed self-attention?

Positional Encoding

Mask

Decode

Loss

Share on

You May Also Enjoy

Meta Knowledge Condensation for Federated Learning

Homomorphic Encryption

Definition

Challenge

Reference

Expainable AI Survey

Expectation-Maximization Algorithm(EM)