打印

[硬件] What comes after Piledriver?

What comes after Piledriver?

Kaveri 同 Kabini 果代 APU,係 Fusion Architecture 有 substantial changes 就鉆板上既事啦。
有興趣猜燈謎既話,謎面係 2P。請自便。雖然謎底岩唔岩我都唔 sure,不過鐘意估就估下啦,當買半年後 Fusion12 先開既六合彩。
其實仲有 D 好有趣既 rumor,例如 Charlie 曾經 imply 過 Kaveri is not that bandwidth constrained. 大家各自演譯啦。



要估 Steamroller 有乜變動又係一件難事。不過數下手指都係 AMD 庶出大皇子後宮初現世三年後既產物,大概應該唔會太差掛。
至少應該做得儲君啦?

引用:
after Piledriver, there will be substantial changes in both cores and system architecture from Steamroller onwards, that should help make AMD competitive closer to the top.
Read more: http://vr-zone.com/articles/amd-to-survive-and-thrive-still-/15564.html#ixzz1s1XxjITb

[ 本帖最後由 Puff 於 2012-4-14 22:25 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 00:30 發表
throughput =/= real world performance
Errr... Did I mention anything related to throughput?

TOP

引用:
原帖由 qcmadness 於 2012-4-15 02:25 發表

bandwidth主要影響real world, 唔太影響throughput (theoretical max)
Yep. 果句係 reply 緊 graphics/3D (real world) performance.

TOP

引用:
原帖由 qcmadness 於 2012-4-15 13:28 發表

no, cpu都係
我講 charlieeee 果句野姐,雖然 throughput 都係同 mem BW 唔太關事。

TOP

算啦,反正都係估,講埋出黎啦。

Speculation of Future Fusion Architecture... Kaveri/Kabini: (First Gen of HSA compatible APU...)


1. Unified Virtual Address Space
2. Memory Coherence = GPU and CPU access the same piece of memory with the same address
3. Unified Phyiscal Address Space with NUMA (as GPU can access all system memory and page fault)


4. x86 pointer will work for GPU = GPU uses x86 page tables...
5. Can over allocate memory... = (3) + NUMA memory management
6. GPU will page fault = GPU will syscall CPU for OS page fault handling = Graphics memory can access and managed by OS...

http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf
7. P.5 Stacked Memory & Main Memory Co-existing
雖然話係 Concept represents engineering capability only,不過畫得出黎就大概有諗過或者打算要做啦。
8. P.7 Single unified & coherent address space

... Conclusion of this piece of crap
= CPU + GPU working together in a seamless, unified memory. Allocate memory like a NUMA system (?)
+ GDDR5 memory on package (???) GDDR5 memory can access by both CPU and GPU using the same address. Managed by OS (??)
+ Limited Cache Coherency between CPU and GPU??*
+ Large Capacity System Memory vs High-bandwidth GDDR5 memory (will be replaced by Stacked Memory). (??)
+ Discrete GPU 經 Coherent PCIe 玩埋一份 (之前 HSA roadmap 有 Coherent PCIe 架)



應該無用到乜敏感名詞掛,嗯。地氈地捕料捕左幾個月,綜合出黎既野全部吐嘔晒出黎啦。跟往落黎等 Fusion12 六合彩開彩。
GDDR5 memory on package 就真係無乜根據既,based on 個人幻想、Point (7) 同埋 rumours 黎估架咋。


* Wild Guess: 如果所有 GPU access 都要 probe CPU cache 就真係弊傢伙. 大概只會得 GPU access to system memory 先會 probe CPU cache 掛。換句話講,access could be unsafe if the piece of memory (in graphics memory) is shared between CPU and GPU.

[ 本帖最後由 Puff 於 2012-4-15 17:53 編輯 ]

TOP

引用:
原帖由 XT 於 2012-4-15 18:13 發表

GPU support C , C++ 係乜意思?
GPU 可以直接跑由 Standard C/C++0x 甚至係 Java, Ruby 既 Code,只要有 Compiler 就得。

目前大部份既 GPU 只可以跑 Shader Assembly/IL (HLSL, GLSL, Cg, IL...) 同埋 Modified 既高階語言。比如話 OpenCL C 或者 CUDA C 之類。除左 Modified Language 外都有 Library,例如 C++ AMP。但 C++ AMP 都有 restriction,唔係全部 C++ syntax 呀 features 呀都可以用。

換句話講,GPU 將會十卜所有高階語言所需要既 hardware features。

[ 本帖最後由 Puff 於 2012-4-15 18:58 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 18:58 發表

that will become power inefficient
How come? Coding with the same full set of language features doesn't mean that you code them together.

[ 本帖最後由 Puff 於 2012-4-15 19:03 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:03 發表

愈要power efficient, 就愈要support少d instruction set
x86比ARM曬電的主要係legacy野太多
Paging, Syscall 同埋 Virtual Function 咋喎。Target ISA 都已經唔同。

[ 本帖最後由 Puff 於 2012-4-15 19:07 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:06 發表

virtual function... 已經夠你麻煩
但我唔覺得同 power efficiency 有關係,同 execution time 有關姐。換句話講,x86 同 ARM 咪又係一樣 support standard C/C++.

[ 本帖最後由 Puff 於 2012-4-15 19:12 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:12 發表

但係同樣transistor的效能差好遠
但我真係覺得唔關事,Support C/C++0x 唔代表 GPU 要支援所有 legacy instructions 或者本來由 CPU run 既 low-level instructions。正如你 malloc 都會係 allocate 完你先開始跑 kernel 一樣咋嘛。

舉個例 virtual function,麻煩既只係 OOP implementation in assembly. GCN 已經有 Scalar Unit 做 offload。除非你所有 work-item 全部跳唔同 path,咁就真係 extremely inefficient 啦。但呢種 case... 係都用 CPU 跑啦。

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:20 發表

問題係某2間廠想將所有野用GPU跑
一間咋喎。

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:22 發表

2間

NVIDIA: GPU兼跑CPU code
Intel: CPU arch當GPU, 行CPU/GPU code
Intel... Larrabee. Hmmm... 呢個條件符合. 但佢唔係 GPU with CPU ISA,佢係 Many streamlined CPU cores running graphics pipeline,同 AMD/Nvidia 有本質上既分別。GenX Graphics 就係同 Nvidia/AMD 一樣既路。

至於 Nvidia,Nvidia 係想將所有野比 GPU 跑,但佢地係想將絕大部份 normal application 可以 parallelizable 既部份 offload 比 GPU。如果唔係開發 Denver,Echelon 依然有 Latency-optimized Core 係用黎做乜呢。

[ 本帖最後由 Puff 於 2012-4-15 19:28 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:28 發表

所以Intel GPU的問題遠遠比NVIDIA的嚴重...
但我講緊 AMD 咋喎

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:32 發表

AMD其實根本唔好攪咁多呢類野, INT留番比CPU好了

GPU要support既野愈多, 就會愈inefficient
AMD 搞緊既係 Better Programmability on GPU. Virtual Function, Exception Handling, Syscall, x86 Paging Support 諸如此類.

你講果堆 "Integer" Workload,或者話 Low-level System Feature e.g. memory management, I/O transaction GPU 大概唔會做。Syscall CPU 黎做咪得。Page fault 又係 OS 去 handle 既。OS 依然跑在 CPU 上喎。

Integer on GPU 又唔係無用既,好似 X264 呢類 Video Transcoding 咪用得著 Integer @ GPU。



[ 本帖最後由 Puff 於 2012-4-15 19:39 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-4-15 19:39 發表

呢d野都唔駛用GPU做, bulldozer就係設計來CPU handle一部分, GPU handle一部分
I object. 呢 D 野係 GPU 無得唔做,如果你要跑 standard code as you run on CPU 既話。
GPU 同 CPU integrate 又係一件唔可能既事。

[ 本帖最後由 Puff 於 2012-4-15 19:43 編輯 ]

TOP