打印

[硬件] What comes after Piledriver?

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

31^# 大中小發表於 2012-4-15 20:08 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 20:01 發表

decouple INT同FP就係最佳證明
咁會增加latency, 如果唔係為左掉FP俾GPU做, 我諗唔到有咩理由一定要shared FP

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333
...

Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔係證明，呢個由 K7 至今既設計... 你引 David Kanter 果段只係佢 personal guess.

比如話呢段 pseudo code

複製內容到剪貼板

代碼:

load file A from disks to memory location A

do 2048 times, index = idx, start from 0:

         load from memory location [A + idx] to register A

         add 1 to register A

         write to memory location [B + idx] with the value of register A

end

於 CPU 角度而言，佢只係一個 I/O transaction 外加一個 Read-execute-write 既 Sequential Loop WITHIN a same thread，唔理佢既 value 係乜。而呢段 Code 既 Program Counter 只得一個，得一條 codepath。係 CPU 上要跑 Vector 就一定係 explict vector.

e.g. add GPR (64-bit Integer) / paddw xmm128 (128-bit Packed Integer) / vpaddw ymm128 (256-bit Packed Integer)

但如果用 CPU + GPU 黎寫，個 pseudo code 就會係

複製內容到剪貼板

代碼:

load file A from disks to memory location A

Task on a 32-CU GPU working over 2048 work-items with 64-wide workgroups:

         load from memory location [A + getIndex()] to register A

         add 1 to register A

         write to memory location [B + getIndex()] with the value of register A

end

睇落都係一樣，但係呢個 "Task" 事實上係 32 條 64-wide hardware GPU threads 跑係 32 CU 既 GPU 上面。夾埋有 32 個 program counter (32 codepaths) 同埋 32 套 thread state... 外加 Implictly vector 為主. 即係 32 個 Mini Co-processor Cores 咁既樣，但係跑唔同既 ISA...

分別就已經出左黎，以 x86 CPU ISA/Programming Model 唔會做到。

CPU 最多係 control GPU execution，例如話擲一個 high-priority kernel 入去 GPU workgroup queue，GPU 就 context switch 部份 Compute Unit 黎優先執行呢個 high-priority kernel 咁。直接 offload 比 GPU 係唔可能既事。換句話講，GPU 依然要自己 handle branching，自己 syscall，自己 decode 自己既 instructions.

[ 本帖最後由 Puff 於 2012-4-15 20:18 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

32^# 大中小發表於 2012-4-15 20:16 只看該作者

引用:

原帖由 Puff 於 2012-4-15 20:08 發表

Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔 ...

問題係你果堆code對CPU / GPU loading唔同
CPU: check cache hieracy and conflicts > load to L1 cache > execute > write to memory
GPU: load to cache > execute > write to memory

decouple CPU同GPU queue係AMD一貫做法, 但唔係而家的unified scheduler

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

33^# 大中小發表於 2012-4-15 20:21 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 20:16 發表

問題係你果堆code對CPU / GPU loading唔同
CPU: check cache hieracy and conflicts > load to L1 cache > execute > write to memory
GPU: load to cache > execute > write to memory

decouple CPU同GPU queue係 ...

重點唔係 loading 上，而係佢地既 execution model.
你諗下啦，你用一堆 Compute Unit 取代左 Flex FP 之後，請問不論 CPU vector instructions 定 GPU applications 要點 run？

GPU 既 model 係 many hardware threads/wavefronts with many code stream. 基本上每個 CU 自己已經係一個 independent code stream.
CPU 係 sequential/single code stream (SMT 另計). 就算係 Flex FP，佢都係 working for a single code stream，follows a single codepath。

[ 本帖最後由 Puff 於 2012-4-15 20:27 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

34^# 大中小發表於 2012-4-15 20:27 只看該作者

引用:

原帖由 Puff 於 2012-4-15 20:21 發表

重點唔係 loading 上，而係佢地既 execution model.
你諗下啦，你用一堆 Compute Unit 取代左 Flex FP 之後，請問不論 CPU vector instructions 定 GPU applications 要點 run？

GPU 既 model 係 many hardware threads/wa ...

到時會係shared scheduler

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

35^# 大中小發表於 2012-4-15 20:29 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 20:27 發表

到時會係shared scheduler

我唔覺得關 scheduler 事。

Flex FP 或者 CPU 自己既 Vector Unit 唔理你係 Unified Scheduler 定 Decoupled Co-processor Model 都好，都唔可能會被 GPU 取代。

同 bunch of Compute Units 黎取代 Flex FP 就同你將 32 個 CPU Cores 既 execution resources 塞入一個 4-wide 既 Core 一樣。

[ 本帖最後由 Puff 於 2012-4-15 20:30 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

36^# 大中小發表於 2012-4-15 20:35 只看該作者

引用:

原帖由 Puff 於 2012-4-15 20:29 發表

我唔覺得關 scheduler 事。
Flex FP 或者 CPU 自己既 Vector Unit 唔理你係 Unified Scheduler 定 Decoupled Co-processor Model 都好，都唔可能會被 GPU 取代。

同 bunch of Compute Units 黎取代 Flex FP ...

做到, 只係而家未做, 如果唔係點樣share resource

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

37^# 大中小發表於 2012-4-15 20:38 只看該作者

http://developer.amd.com/afds/assets/presentations/2901_final.pdf
P.17

其實都唔洗我 explain.

引用:

These two approaches suit different algorithm designs. We cannot, unfortunately, have both in a single core

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

38^# 大中小發表於 2012-4-15 20:50 只看該作者

引用:

原帖由 Puff 於 2012-4-15 20:38 發表
 http://developer.amd.com/afds/assets/presentations/2901_final.pdf
P.17

其實都唔洗我 explain.

of course, but you can load it in different core or execution units in a silicon

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

39^# 大中小發表於 2012-4-15 21:25 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 20:50 發表

of course, but you can load it in different core or execution units in a silicon

總覺得我地講緊既野南轅北轍。乜叫 scheduler? 乜野既 scheduler? Schedule D 乜既 scheduler?

[ 本帖最後由 Puff 於 2012-4-15 21:27 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

40^# 大中小發表於 2012-4-15 21:31 只看該作者

引用:

原帖由 Puff 於 2012-4-15 21:25 發表

總覺得我地講緊既野南轅北轍。乜叫 scheduler? 乜野既 scheduler? Schedule D 乜既 scheduler?

use x86 instruction and at the same time, utilize GPU as co-processor

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

41^# 大中小發表於 2012-4-15 21:32 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 21:31 發表

use x86 instruction utilize GPU as co-processor

咁問題就係呢個 "scheduler" 點運作法，係 pipeline 上乜野位置呀嘛。

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

42^# 大中小發表於 2012-4-15 21:33 只看該作者

引用:

原帖由 Puff 於 2012-4-15 21:32 發表

咁問題就係呢個 "scheduler" 點運作法，係 pipeline 上乜野位置呀嘛。

replace the current FP scheduler and GPU scheduler

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

43^# 大中小發表於 2012-4-15 21:33 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 21:33 發表

replace the current FP scheduler and GPU scheduler

Okay. Is there any new ISA extension for this?

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

44^# 大中小發表於 2012-4-15 21:33 只看該作者

引用:

原帖由 Puff 於 2012-4-15 21:33 發表

Okay. Is there any new ISA extension for this?

no

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

45^# 大中小發表於 2012-4-15 21:34 只看該作者

引用:

原帖由 qcmadness 於 2012-4-15 21:33 發表

no

How would you define a GPU scheduler?

TOP