Computer Architecture
2024 Spring
Final Project Part 2Overview
Tutorial
● Gem5 Introduction
● Environment Setup
Projects
● Part 1 (5%)
○ Write C++ program to analyze the specification of L1 data cache.
● Part 2 (5%)
○ Given the hardware specifications, try to get the best performance for more
complicated program.
2Project 2
3In this project, we will use a two-level cache
computer system. Your task is to write a
ViT(Vision Transformer) in C++ and optimize it.
You can see more details of the system
specification on the next page.
Description
4System Specifications
● ISA: X86
● CPU: TimingSimpleCPU (no pipeline, CPU stalls on every memory request)
● Caches
* L1 I cache and L1 D cache connect to the same L2 cache
● Memory size: 8192MB
5
I cache
size
I cache
associativity
D cache
size
D cache
associativity
Policy Block size
L1 cache 16KB 8 16KB 4 LRU 32B
L2 cache – – 1MB 16 LRU 32BViT(Vision Transformer) – Transformer Overview
6
● A basic transformer block consists of
○ Layer Normalization
○ MultiHead Self-Attention (MHSA)
○ Feed Forward Network (FFN)
○ Residual connection (Add)
● You only need to focus on how to
implement the function in the red box
● If you only want to complete the project
instead of understanding the full
algorithm about ViT, you can skip the
section masked as redViT(Vision Transformer) – Image Pre-processing
7
● Normalize, resize to (300,300,3) and center crop to (224,224,3)ViT(Vision Transformer) – Patch Encoder
8
● In this project, we use Conv2D as Patch
Encoder with kernel_size = (16,16), stride =
(16,16) and output_channel = 768
● (224,224,3) -> (14,14, 16*16*3) -> (196, 768)ViT(Vision Transformer) – Class Token
9
● Now we have 196 tokens and each
token has 768 features
● In order to record global information, we
need concatenate one learnable class
token with 196 tokens
● (196,768) -> (197,768)ViT(Vision Transformer) – Position Embedding
10
● Add the learnable position information
on the patch embedding
● (197,768) +
position_embedding(197,768) ->
(197,768)ViT(Vision Transformer) – Layer Normalization
11
T
# of tokens
C
embedded dimension
● Normalize each token
● You need to normalize with the formulaAttention
ViT(Vision Transformer) – MultiHead Self Attention (1)
12
● Wk
, Wq
, Wv
∈ RC✕C
● b
q
, bk
, bv
∈ RC
● W
o
∈ RC✕C
● b
o
∈ RC
Input
Linear
Projection
X Attention
split
into
heads
merge
heads
Output
Linear
Projection
Y
Wk
, Wq
, Wv W
o
b
q
, bk
, bv b
o
ViT(Vision Transformer) – MultiHead Self Attention (2)
13
T
# of tokens
C
embedded dimension
● Get Q, K, V ∈ RT✕(NH*H) after input linear projection
● Split Q, K, V into Q1
, Q2
, Q3
,..., QNH K1
, K2
, K3
,..., KNH V1
, V2
, V3
,..., VNH
∈ RT✕H
H
hidden dimension
Linear Projection and split into heads
Linear Projection
Q = XWq
T
+ b
q
K = XWk
T
+ bk
V = XW
v
T
+ b
v
NH
# of head C = H * NHViT(Vision Transformer) – MultiHead Self Attention (2)
14
● For each head i, compute Si
= QiKi
T
/square_root(H) ∈ RT✕T
● Pi = Softmax(Si
) ∈ RT✕T
, Softmax is a row-wise function
● Oi = Pi Vi ∈ RT✕H
Matrix
Multiplication
and scale
Qi
Ki
Softmax
Matrix
Multiplication Vi
Oi
SoftmaxViT(Vision Transformer) – MultiHead Self Attention (3)
15
T
# of tokens
C
embedded dimension
● Oi ∈ RT✕H
, O = [O1
, O2
,...,O2
]
H
hidden dimension
merge heads and Linear Projection
Linear Projection
output = OWo
T
+ b
o
NH
# of headViT(Vision Transformer) – Feed Forward Network
16
● Get Q, K, V ∈ RT✕(h*H) after input linear projection
● Split Q, K, V into Q1
, Q2
, Q3
,..., Qh
K1
, K2
, K3
,..., Kh V1
, V2
, V3
,..., Vh ∈ RT✕H
T
# of tokens
C
embedded dimension
Input
Linear
Projection
T
# of tokens
OC
hidden dimension
GeLU
output
Linear
ProjectionViT(Vision Transformer) – GeLU
17ViT(Vision Transformer) – Classifier
18
● Contains a Linear layer to transform 768 features to 200 class
○ (197, 768) -> (197, 200)
● Only refer to the first token (class token)
○ (197, 200) -> (1, 200)ViT(Vision Transformer) – Work Flow
19
Pre-pocessing
Embedder
Transformer x12
Classifier
m5_dump_init
Load_weight
m5_dump_stat
Argmax
layernorm
MHSA
layernorm
FFN
matmul
attention
matmul
matmul
layernorm
matmul
Black footed Albatross
+
+
gelu
matmul
gelu
$ make gelu_tb
$ make matmul_tb
$ make layernorm_tb
$ make MHSA_tb
$ make feedforward_tb
$ make transformer_tb
$ run_all.sh
layernorm
layernorm
MHSA
residualViT(Vision Transformer) – Shape of array
20
layernorm token 1 token 2 …… token T
C
input/output [T*C]
MHSA input/output/o [T*C]
MHSA qkv [T*3*C] q token 1
C
k token 1 v token 1 …… q token T k token T v token T
feedforward input/output [T*C]
feedforward gelu [T*OC] token 1
OC
token 2 …… token TCommon problem
21
● Segmentation fault
○ ensure that you are not accessing a nonexistent memory address
○ Enter the command $ulimit -s unlimited All you have to do is
22
● Download TA’s Gem5 image
○ docker pull yenzu/ca_final_part2:2024
● Write C++ with understanding the algorithm in ./layer folder
○ make clean
○ make <layer>_tb
○ ./<layer>_tbAll you have to do is
23
● Ensure the ViT will successfully classify the bird
○ python3 embedder.py --image_path images/Black_Footed_Albatross_0001_796111.jpg
--embedder_path weights/embedder.pth --output_path embedded_image.bin
○ g++ -static main.cpp layer/*.cpp -o process
○ ./process
○ python3 run_model.py --input_path result.bin --output_path torch_pred.bin --model_path
weights/model.pth
○ python3 classifier.py --prediction_path torch_pred.bin --classifier_path
weights/classifier.pth
○ After running the above commands, you will get the following top5 prediction.
● Evaluate the performance of part of ViT, that is layernorm+MHSA+residual
○ Need about 3.5 hours to finish the simulation
○ Check stat.txtGrading Policy
24
● (50%) Verification
○ (10%) matmul_tb
○ (10%) layernorm_tb
○ (10%) gelu_tb
○ (10%) MHSA_tb
○ (10%) transformer_tb
● (50%) Performance
○ max(sigmoid((27.74 - student latency)/student latency))*70, 50)
● You will get 0 performance point if your design is not verified.Submission
● Please submit code on E3 before 23:59 on June 20, 2024.
● Late submission is not allowed.
● Plagiarism is forbidden, otherwise you will get 0 point!!!
25
● Format
○ Code: please put your code in a folder
named FP2_team<ID>_code and compress
it into a zip file.
2
2
2FP2_team<ID>_code folder
26
● You should attach the following documents
○ matmul.cpp
○ layernorm.cpp
○ gelu.cpp
○ attention.cpp
○ residual.cpp
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp