[Paper Review; Transformer Inference] Transformer Model Workload Analysis

‘Full Stack Optimization of Transformer Inference: a Survey’ 리뷰 시리즈 2편

논문의 2.2장에서는 모델 워크로드를 분석합니다. Idle한 상황을 가정하고, 각 트랜스포머 모델의 이론적 최대 성능(upper bound)을 분석합니다. 그 과정에서 각 모델의 특성을 이해해 볼 수 있습니다.

Models

논문은 BERT-Base, BERT-Large, GPT-2 모델로 워크로드를 분석했습니다. 모두 트랜스포머 기반입니다. 주요 특징과 parameter configuration은 아래와 같습니다.

모델	구조	방향성	목적	주요 용도
12-layer BERT-Base	인코더-only	양방향	마스킹된 단어 예측 (MLM)	문장 이해 (분류, 질의응답 등)
24-layer BERT-Large	인코더-only	양방향	BERT-Base 확장	고성능 문장 이해
12-layer GPT-2	디코더-only	단방향 (왼→오)	다음 단어 예측 (causal LM)	텍스트 생성 (요약, 번역, 대화 등)

Symbol	Parameter	BERT-Base	BERT-Large	GPT-2
$N$	# Layers	12	24	12
$d$	Model dimension	768	1024	768
$h$	# Attention Heads	12	16	12
$d_{FFN}$	FFN dimension	3072	4096	3072

Assumptions

BERT 모델의 최대 입력 Sequence number $l$ 인 512를 무시
8-bit(1Byte) precision for all operations
무한한 메모리 사용 가정 - upper bound performance

Arithmetic Intensity

모델의 실행 효율은 다음 두 지표에 의해 표현될 수 있습니다.

FLOPs: 부동소수점 연산 수
MOPs: 메모리 접근 수

이 지표의 비율인 Arithmetic Intensity를 사용하여 모델을 비교합니다. Arithmetic Intensity는 메모리 접근 1Byte 당 수행될 수 있는 floating-point operation 수 입니다.

$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{MOPs}} $$

Arithmetic intensity를 통해 모델의 성능이 compute-bound인지 memory-bound인지 알 수 있습니다. 동일 FLOPs를 가지는 모델이라면 arithmetic intensity가 큰 모델이 메모리 병목으로 인한 성능 저하가 적을 것이므로 비슷하거나 더 높은 성능을 가질 수 있습니다.

E2E 워크로드 특성

FLOPs and MOPs

flops and mops

FLOPs 와 MOPs는 sequence length 에 super-linear(초선형적)입니다
Sequence length 증가에 따라 FLOPs와 MOPs가 급격히 증가하고, 이는 act-to-act matmul(query x key, attention score x value) 연산이 sequence length에 quadratic 하기 때문입니다 (참고: 지난 포스트)

Arithmetic Intensity

arithmetic intensity

BERT-* 모델은 sequence length 512까지는 arithmetic intensity가 증가하나, 그 이후부터 줄어듭니다
- Sequence length 가 커지면 더 큰 dimension의 matmul이 발생하고, parameter load 당 더 많은 연산을 수행하기 때문에 자연스레 arithmetic intensity가 증가합니다.
- 하지만 sequence length가 512보다 커지면 FFN 모듈보다 MHA 모듈이 dominate하는 양상을 보입니다 - MHA 모듈의 act-to-act matmul과 Softmax 연산이 두드러지게 됩니다. 이와 관련된 내용은 다음 섹션에서 자세히 다룹니다.
BERT 대비 GPT 모델은 확연히 낮은 arithmetic intensity를 보여줍니다
- Decoder 모델은 matrix-matrix가 아닌 matrix-vector 연산으로만 이루어져 있기 때문입니다 - 이는 데이터 재활용률을 낮춥니다.
- 성능은 memory bandwidth-bound가 됩니다.

Per-layer 특성

다음은 Intel Gold 6242 CPU에서의 레이어 별 프로파일링 결과입니다.

latency profiling

BERT-base 모델에서는 sequence length가 작을 때는 FFN이, 클 때는 MHA 연산이 dominate 합니다
GPT 모델에서는 MHA 연산이 큰 비중을 이루고, sequence length가 커지면 MHA 연산 비중이 높아집니다 (BERT-base 만큼의 비중은 아니지만요)

normalized latency

각 모델의 normalized latency를 보여주는 그림 9는 BERT 모델과 GPT 모델의 차이를 보여줍니다. 더 높은 arithmetic intensity를 가진 모델(BERT)은 더 빠르게 수행되고 있음을, decoder inference는 memory-bound problem임을 보여줍니다.

Per-layer 실행 특성을 정리하면 다음과 같습니다.

모듈	연산 비중	메모리 접근 비중	병목 원인
Attention (QKV projection + attention score)	높음	중간	QKᵀ와 softmax의 메모리 병목
FFN (2-layer MLP)	가장 많은 FLOPs	비교적 낮은 MOPs	MatMul 중심, 효율적
LayerNorm + residuals	FLOPs 적음	MOPs 많음	메모리 접근 병목, 낮은 AI

실제로 FFN이 전체 연산량의 대부분을 차지하지만, 실행 속도를 결정하는 건 softmax, norm 등 low-intensity 연산의 병목 효과입니다. “실제 병목은 FLOPs가 많은 곳이 아니라, MOPs가 많은 곳에서 발생한다” 는 것을 알 수 있습니다.

MatMul (GEMM)
- 높은 FLOPs, 낮은 MOPs → 높은 Arithmetic Intensity
- 하드웨어에서 효율적으로 실행 가능
Softmax, LayerNorm, GELU
- 낮은 FLOPs, 높은 MOPs → 낮은 Arithmetic Intensity
- 메모리 접근 병목 발생

이번 글에서는 세 가지 트랜스포머 모델의 워크로드 특성과 병목 지점 분석 내용을 리뷰해 봤습니다. FLOPs가 많은 연산이 아니라, 메모리 접근이 많은 연산이 실제 성능을 좌우합니다. 특히 decoder-only 구조인 GPT-2는 낮은 arithmetic intensity로 인해 memory-bound 특성이 강하게 나타났으며, 이는 인퍼런스 최적화에서 중요한 고려 요소가 됩니다. 다음 글에서는 일반적인 DNN accelerator의 특성에 대해 알아볼 예정입니다.

Models#

Assumptions#

Arithmetic Intensity#

E2E 워크로드 특성#

FLOPs and MOPs#

Arithmetic Intensity#

Per-layer 특성#

Models

Assumptions

Arithmetic Intensity

E2E 워크로드 특성

FLOPs and MOPs

Arithmetic Intensity

Per-layer 특성