打印

[硬件] What comes after Piledriver?

引用:
原帖由 qcmadness 於 2012-4-15 20:01 發表

decouple INT同FP就係最佳證明
咁會增加latency, 如果唔係為左掉FP俾GPU做, 我諗唔到有咩理由一定要shared FP

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333
...
Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔係證明,呢個由 K7 至今既設計... 你引 David Kanter 果段只係佢 personal guess.


比如話呢段 pseudo code
複製內容到剪貼板
代碼:
load file A from disks to memory location A
do 2048 times, index = idx, start from 0:
         load from memory location [A + idx] to register A
         add 1 to register A
         write to memory location [B + idx] with the value of register A
end
於 CPU 角度而言,佢只係一個 I/O transaction 外加一個 Read-execute-write 既 Sequential Loop WITHIN a same thread,唔理佢既 value 係乜。而呢段 Code 既 Program Counter 只得一個,得一條 codepath。係 CPU 上要跑 Vector 就一定係 explict vector.

e.g. add GPR (64-bit Integer) / paddw xmm128 (128-bit Packed Integer) / vpaddw ymm128 (256-bit Packed Integer)

但如果用 CPU + GPU 黎寫,個 pseudo code 就會係
複製內容到剪貼板
代碼:
load file A from disks to memory location A
Task on a 32-CU GPU working over 2048 work-items with 64-wide workgroups:
         load from memory location [A + getIndex()] to register A
         add 1 to register A
         write to memory location [B + getIndex()] with the value of register A
end
睇落都係一樣,但係呢個 "Task" 事實上係 32 條 64-wide hardware GPU threads 跑係 32 CU 既 GPU 上面。夾埋有 32 個 program counter (32 codepaths) 同埋 32 套 thread state... 外加 Implictly vector 為主. 即係 32 個 Mini Co-processor Cores 咁既樣,但係跑唔同既 ISA...



分別就已經出左黎,以 x86 CPU ISA/Programming Model 唔會做到。

CPU 最多係 control GPU execution,例如話擲一個 high-priority kernel 入去 GPU workgroup queue,GPU 就 context switch 部份 Compute Unit 黎優先執行呢個 high-priority kernel 咁。直接 offload 比 GPU 係唔可能既事。換句話講,GPU 依然要自己 handle branching,自己 syscall,自己 decode 自己既 instructions.

[ 本帖最後由 Puff 於 2012-4-15 20:18 編輯 ]

TOP

引用:
原帖由 Puff 於 2012-4-15 20:08 發表

Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔 ...
問題係你果堆code對CPU / GPU loading唔同
CPU: check cache hieracy and conflicts > load to L1 cache > execute > write to memory
GPU: load to cache > execute > write to memory

decouple CPU同GPU queue係AMD一貫做法, 但唔係而家的unified scheduler

TOP

引用:
原帖由 qcmadness 於 2012-4-15 20:16 發表

問題係你果堆code對CPU / GPU loading唔同
CPU: check cache hieracy and conflicts > load to L1 cache > execute > write to memory
GPU: load to cache > execute > write to memory

decouple CPU同GPU queue係 ...
重點唔係 loading 上,而係佢地既 execution model.
你諗下啦,你用一堆 Compute Unit 取代左 Flex FP 之後,請問不論 CPU vector instructions 定 GPU applications 要點 run?

GPU 既 model 係 many hardware threads/wavefronts with many code stream. 基本上每個 CU 自己已經係一個 independent code stream.
CPU 係 sequential/single code stream (SMT 另計). 就算係 Flex FP,佢都係 working for a single code stream,follows a single codepath。

[ 本帖最後由 Puff 於 2012-4-15 20:27 編輯 ]

TOP

引用:
原帖由 Puff 於 2012-4-15 20:21 發表

重點唔係 loading 上,而係佢地既 execution model.
你諗下啦,你用一堆 Compute Unit 取代左 Flex FP 之後,請問不論 CPU vector instructions 定 GPU applications 要點 run?

GPU 既 model 係 many hardware threads/wa ...
到時會係shared scheduler

TOP

引用:
原帖由 qcmadness 於 2012-4-15 20:27 發表

到時會係shared scheduler
我唔覺得關 scheduler 事。
Flex FP 或者 CPU 自己既 Vector Unit 唔理你係 Unified Scheduler 定 Decoupled Co-processor Model 都好,都唔可能會被 GPU 取代。

同 bunch of Compute Units 黎取代 Flex FP 就同你將 32 個 CPU Cores 既 execution resources 塞入一個 4-wide 既 Core 一樣。

[ 本帖最後由 Puff 於 2012-4-15 20:30 編輯 ]

TOP

引用:
原帖由 Puff 於 2012-4-15 20:29 發表

我唔覺得關 scheduler 事。
Flex FP 或者 CPU 自己既 Vector Unit 唔理你係 Unified Scheduler 定 Decoupled Co-processor Model 都好,都唔可能會被 GPU 取代。

同 bunch of Compute Units 黎取代 Flex FP ...
做到, 只係而家未做, 如果唔係點樣share resource

TOP

http://developer.amd.com/afds/assets/presentations/2901_final.pdf
P.17

其實都唔洗我 explain.
引用:
These two approaches suit different algorithm designs. We cannot, unfortunately, have both in a single core

TOP

引用:
原帖由 Puff 於 2012-4-15 20:38 發表
http://developer.amd.com/afds/assets/presentations/2901_final.pdf
P.17

其實都唔洗我 explain.
of course, but you can load it in different core or execution units in a silicon

TOP

引用:
原帖由 qcmadness 於 2012-4-15 20:50 發表

of course, but you can load it in different core or execution units in a silicon
總覺得我地講緊既野南轅北轍。乜叫 scheduler? 乜野既 scheduler? Schedule D 乜既 scheduler?

[ 本帖最後由 Puff 於 2012-4-15 21:27 編輯 ]

TOP

引用:
原帖由 Puff 於 2012-4-15 21:25 發表

總覺得我地講緊既野南轅北轍。乜叫 scheduler? 乜野既 scheduler? Schedule D 乜既 scheduler?
use x86 instruction and at the same time, utilize GPU as co-processor

TOP

引用:
原帖由 qcmadness 於 2012-4-15 21:31 發表

use x86 instruction utilize GPU as co-processor
咁問題就係呢個 "scheduler" 點運作法,係 pipeline 上乜野位置呀嘛。

TOP

引用:
原帖由 Puff 於 2012-4-15 21:32 發表

咁問題就係呢個 "scheduler" 點運作法,係 pipeline 上乜野位置呀嘛。
replace the current FP scheduler and GPU scheduler

TOP

引用:
原帖由 qcmadness 於 2012-4-15 21:33 發表

replace the current FP scheduler and GPU scheduler
Okay. Is there any new ISA extension for this?

TOP

引用:
原帖由 Puff 於 2012-4-15 21:33 發表

Okay. Is there any new ISA extension for this?
no

TOP

引用:
原帖由 qcmadness 於 2012-4-15 21:33 發表

no
How would you define a GPU scheduler?

TOP