打印

[硬件] 睇番Netburst Architecture

睇番Netburst Architecture

http://www.intel.com/Assets/en_US/PDF/manual/248966.pdf

呢個係致命傷
引用:
Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point
move μop (a floating-point stack move, floating-point exchange or floating-point
store data) or one arithmetic logical unit (ALU) μop (arithmetic, logic, branch or store
data). In the second half of the cycle, it can dispatch one similar ALU μop.

Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point
execution (all floating-point operations except moves, all SIMD operations) μop or
one normal-speed integer (multiply, shift and rotate) μop or one ALU (arithmetic)
μop. In the second half of the cycle, it can dispatch one similar ALU μop.
1個clock可以做既野
a. 2x 2 similar ADD/SUB
b. 2 similar ADD/SUB + 2 similar ADD/SUB/Logic Ops/LS Ops
c. 1 FP/SIMD Op + 1 FP LS/Move Op
d. 2 similar ADD/SUB/Logic Ops/LS Ops + Shift/Rotate

唔怪得之好慢啦... 唔同類型的code, 加埋得2個operations...

Intel optimizations仲講到明用ADD去取代IMUL...
附件: 您所在的用戶組無法下載或查看附件

TOP

引用:
原帖由 Henry 於 2011-9-2 15:57 發表

但係Nehalem某程度都係Netburst based.
好明顯唔係...
major execution都係跟Core

http://www.realworldtech.com/pag ... 40208182719&p=6



你望到同Core幾乎一樣

TOP

引用:
原帖由 Henry 於 2011-9-2 16:01 發表

咁Atom呢?
Bonnell係全新設計

TOP

Bonnell


Bobcat


好明顯bonnell同bobcat都係得2-way wide execution

TOP

引用:
原帖由 Henry 於 2011-9-2 16:15 發表

2-way wide = 2x FP + 2x INT?
咁Barcelona/Core 2/Nehalem = 3-way?
no...

Bonnell同Bobcat都係只係每個issue到2個instruction
K10係3個
Core / Nehalem都係3 simple + 1 complex (complex可以拆開做2 simple)

TOP

引用:
原帖由 Henry 於 2011-9-2 16:26 發表

instruction呢到係uOps?
K10 => 3INT + 3FP,平行
Core/Nehalem => 3 INT/FP Mixed (唔知一個port可唔可以同時做兩樣野,應該唔得)
唔係... 普通ADD / MUL之類的instructions

K7/K8/K10都係3組平行
但係其實有好多浪費, 連Bulldozer都唔再咁做

Core/Nehalem/SB都係4+1, 始終有d係Load/Store instruction
而K7/K8/K10/Bulldozer都係跟埋係integer cluster入面
而Core/Nehalem/SB都係同integer cluster分開的 (圖中無的port 2/3/4)

TOP

引用:
原帖由 Henry 於 2011-9-2 16:44 發表

見Nehalem同Core都係一個RS排曬隊,所有INT/FP都係曬一齊.
2/3/4原來係有,無Show......

Bulldozer個FP好似Intel咁排隊,而INT就3條各自排隊?
FP排隊, INT都係排隊, 不過symmetric 2條execution / L&S per core/thread

TOP

引用:
原帖由 Henry 於 2011-9-2 16:56 發表

Intel一個Port可以同時做兩樣野?
見6個裡面只有三個係計數.

BD我指排隊係一條隊feed三個FP定一條隊一個FP.
no... 1個port只可以做1樣野
所以Core之後的Intel CPU, 有番咁上下memory performance

BD係unified FP scheduler, 1條隊

TOP

引用:
原帖由 Henry 於 2011-9-2 17:12 發表

配P/G965既年代,D RAM都仲係好慢. (Dual D2 667/800)
當然CPU Inter-core速度好快.

不過AMD應該同一時間能做多好多野,點解都係咁慢?
L1/L2/L3慢Fetch唔切? ...
prefetch, cache speed, branch prediction (from Netburst)...

TOP

引用:
原帖由 Henry 於 2011-9-2 17:20 發表

3ALU/FP + 3DATA,多一個/少一個都會失平衡,跟住就會慢.
不過Prediction係Netburst都唔死得既話,只有3+3應該唔會太大問題.

但係如果Intel既Prediction/cache speed + AMD BD既ALU/FP既排列..........:devlau ...
可惜

Bonnell / Bobcat對住ARM, 係功耗方面都係

TOP

引用:
原帖由 Henry 於 2011-9-2 17:29 發表

ARM.....

個人覺得,Android咁樣無節制跑Multitasking既話,基本上無返粒i3咁上下都好難順.
當然而家Dual core ARM已經叫做好DD,但耗電實在令人擔憂.

Bonnell,睇來到係要製程.
除非有粒野再 ...
Bonnell玩到in-order
Bobcat玩到half-speed L2

下次加埋core-gating同埋TurboBoost/Core
可能有得打

TOP

引用:
原帖由 Henry 於 2011-9-2 17:36 發表

core-gating?
TurboBoost/Core唔係長治久安之計,況且x86係效能方面應該唔使擔心.
只不過係idle果陣既對策就.....
而家私家車都玩start-stop,x86如果玩到idle唔開CPU(開timer)既話應該有得打. ...
power gating

idle要低, 就要連NB/Uncore都要推低

TOP

引用:
原帖由 Henry 於 2011-9-2 17:41 發表

我指係索性關CPU/NB,只開CPU clock同receiver.
一有電話,即刻開CPU/開mon.
之前Intel好似開發到超低耗多核CPU,拎其中一個Core放落電話到,搞掂.
x86做唔到全關

TOP