打印

[硬件] What comes after Piledriver?

throughput =/= real world performance

TOP

引用:
原帖由 Puff 於 2012-4-15 02:07 發表

Errr... Did I mention anything related to throughput?
bandwidth主要影響real world, 唔太影響throughput (theoretical max)

TOP

引用:
原帖由 Puff 於 2012-4-15 13:28 發表

Yep. 果句係 reply 緊 graphics/3D (real world) performance.
no, cpu都係

TOP

引用:
原帖由 Puff 於 2012-4-15 18:55 發表

GPU 可以直接跑由 Standard C/C++ Code。

目前大部份既 GPU 只可以跑 Shader Assembly/IL (HLSL, GLSL, Cg, IL...) 同埋 Modified 既高階語言。比如話 OpenCL C 或者 CUDA C 之類。除左 Modified Language 外都有  ...
that will become power inefficient

TOP

引用:
原帖由 Puff 於 2012-4-15 18:59 發表

How come? Coding with the same full set of language features doesn't mean that you code them together.
愈要power efficient, 就愈要support少d instruction set
x86比ARM曬電的主要係legacy野太多

TOP

引用:
原帖由 Puff 於 2012-4-15 19:06 發表

Paging, Syscall 同埋 Virtual Function 咋喎。Target ISA 都已經唔同。
virtual function... 已經夠你麻煩

TOP

引用:
原帖由 Puff 於 2012-4-15 19:10 發表

但我唔覺得同 power efficiency 有關係,同 execution time 有關姐。換句話講,x86 同 ARM 咪又係一樣 support standard C/C++.
但係同樣transistor的效能差好遠

TOP

引用:
原帖由 Puff 於 2012-4-15 19:18 發表
但我真係覺得唔關事,Support C/C++0x 唔代表 GPU 要支援所有 legacy instructions 或者本來由 CPU run 既 low-level instructions。正如你 malloc 都會係 allocate 完你先開始跑 kernel 一樣咋嘛。
舉個例 virtual ...
問題係某2間廠想將所有野用GPU跑

TOP

引用:
原帖由 Puff 於 2012-4-15 19:20 發表

一間咋喎。
2間

NVIDIA: GPU兼跑CPU code
Intel: CPU arch當GPU, 行CPU/GPU code

TOP

引用:
原帖由 Puff 於 2012-4-15 19:25 發表
Intel... Larrabee. Hmmm... 呢個條件符合. 但佢唔係 GPU with CPU ISA,佢係 Many streamlined CPU cores running graphics pipeline,同 AMD/Nvidia 有本質上既分別。GenX Graphics 就係同 Nvidia/AMD 一樣既路。
至 ...
所以Intel GPU的問題遠遠比NVIDIA的嚴重...

TOP

引用:
原帖由 Puff 於 2012-4-15 19:30 發表

但我講緊 AMD 咋喎
AMD其實根本唔好攪咁多呢類野, INT留番比CPU好了

GPU要support既野愈多, 就會愈inefficient

TOP

引用:
原帖由 Puff 於 2012-4-15 19:36 發表

AMD 搞緊既係 Better Programmability on GPU. Virtual Function, Exception Handling, Syscall, x86 Paging Support 諸如此類.

你講果堆 "Integer" Workload,或者話 Low-level System Feature e.g. memory managem ...
呢d野都唔駛用GPU做, bulldozer就係設計來CPU handle一部分, GPU handle一部分

TOP

引用:
原帖由 Puff 於 2012-4-15 19:42 發表

I object. 呢 D 野係 GPU 無得唔做,如果你要跑 standard code as you run on CPU 既話。
GPU 同 CPU integrate 又係一件唔可能既事。
decouple INT同FP就係最佳證明
咁會增加latency, 如果唔係為左掉FP俾GPU做, 我諗唔到有咩理由一定要shared FP

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333
引用:
Like all previous designs from AMD (and in contrast to Intel), Bulldozer separates the integer and floating point schedulers, register files and execution units. In proof that Sutherland’s Wheel of Reincarnation applies to more than just graphics, Bulldozer employs a co-processor model for floating point and SIMD execution that is shared by both cores in a module – reminiscent of the days when x87 floating point co-processors would reside on a separate chip altogether. One advantage of this more formalized separation is that the floating point cluster might eventually be replaced or supplemented by a GPU shader array, an evolution of Bulldozer to fit the ‘Fusion’ mold. This co-processor model is an example of a substantial change that is also familiar from previous AMD CPUs, the resemblance is clear from Figure 4 below.

TOP

引用:
原帖由 Puff 於 2012-4-15 20:08 發表

Different HW characteristics 就已經注定左 GPU/CU 唔會同 CPU integrate.
Bulldozer 既 Modular Design 同 GPU 都無乜關係... 係關 area/power efficiency & throughput maximizing 既事。
Decoupled INT/FP 都唔 ...
問題係你果堆code對CPU / GPU loading唔同
CPU: check cache hieracy and conflicts > load to L1 cache > execute > write to memory
GPU: load to cache > execute > write to memory

decouple CPU同GPU queue係AMD一貫做法, 但唔係而家的unified scheduler

TOP

引用:
原帖由 Puff 於 2012-4-15 20:21 發表

重點唔係 loading 上,而係佢地既 execution model.
你諗下啦,你用一堆 Compute Unit 取代左 Flex FP 之後,請問不論 CPU vector instructions 定 GPU applications 要點 run?

GPU 既 model 係 many hardware threads/wa ...
到時會係shared scheduler

TOP