Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔係證明,呢個由 K7 至今既設計... 你引 David Kanter 果段只係佢 personal guess.
比如話呢段 pseudo code
複製內容到剪貼板
代碼:
load file A from disks to memory location A
do 2048 times, index = idx, start from 0:
load from memory location [A + idx] to register A
add 1 to register A
write to memory location [B + idx] with the value of register A
end
於 CPU 角度而言,佢只係一個 I/O transaction 外加一個 Read-execute-write 既 Sequential Loop
WITHIN a same thread,唔理佢既 value 係乜。而呢段 Code 既 Program Counter 只得一個,得一條 codepath。係 CPU 上要跑 Vector 就一定係 explict vector.
e.g. add GPR (64-bit Integer) / paddw xmm128 (128-bit Packed Integer) / vpaddw ymm128 (256-bit Packed Integer)
但如果用 CPU + GPU 黎寫,個 pseudo code 就會係
複製內容到剪貼板
代碼:
load file A from disks to memory location A
Task on a 32-CU GPU working over 2048 work-items with 64-wide workgroups:
load from memory location [A + getIndex()] to register A
add 1 to register A
write to memory location [B + getIndex()] with the value of register A
end
睇落都係一樣,但係呢個 "Task" 事實上係 32 條 64-wide hardware GPU threads 跑係 32 CU 既 GPU 上面。夾埋有 32 個 program counter (32 codepaths) 同埋 32 套 thread state... 外加 Implictly vector 為主. 即係 32 個 Mini Co-processor Cores 咁既樣,但係跑唔同既 ISA...
分別就已經出左黎,以 x86 CPU ISA/Programming Model 唔會做到。
CPU 最多係 control GPU execution,例如話擲一個 high-priority kernel 入去 GPU workgroup queue,GPU 就 context switch 部份 Compute Unit 黎優先執行呢個 high-priority kernel 咁。直接 offload 比 GPU 係唔可能既事。換句話講,GPU 依然要自己 handle branching,自己 syscall,自己 decode 自己既 instructions.
[
本帖最後由 Puff 於 2012-4-15 20:18 編輯 ]