打印

[硬件] First Details of Steamroller

First Details of Steamroller

http://www.anandtech.com/show/62 ... roller-architecture

Major improvements:
引用:


  • Duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. The penalties are pretty obvious: area goes up as does power consumption.

  • Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%.

  • AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty.

  • The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF.

  • The scheduling windows also increased in size, which should enable greater utilization of existing execution resources.

  • Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.

  • The shared L1 instruction cache grew in size with Steamroller, although AMD isn’t telling us by how much. Bulldozer featured a 2-way 64KB L1 instruction cache, with each “core” using one of the ways. This approach gave Bulldozer less cache per core than previous designs, so the increase here makes a lot of sense. AMD claims the larger L1 can reduce i-cache misses by up to 30%.

  • Although AMD doesn’t like to call it a cache, Steamroller now features a decoded micro-op queue. As x86 instructions are decoded into micro-ops, the address and decoded op are both stored in this queue. Should a fetch come in for an address that appears in the queue, Steamroller’s front end will power down the decode hardware and simply service the fetch request out of the micro-op queue. This is similar in nature to Sandy Bridge’s decoded uop cache, however it is likely smaller. AMD wasn’t willing to disclose how many micro-ops could fit in the queue, other than to say that it’s big enough to get a decent hit rate.

  • The L1 to L2 interface has also been improved. Some queues have grown and logic is improved.

  • Finally on the caching front, Steamroller introduces a dynamically resizable L2 cache. Based on workload and hit rate in the cache, a Steamroller module can choose to resize its L2 cache (powering down the unused slices) in 1/4 intervals. AMD believes this is a huge power win for mobile client applications such as video decode (not so much for servers), where the CPU only has to wake up for short periods of time to run minor tasks that don’t have large L2 footprints. The L2 cache accounts for a large chunk of AMD’s core leakage, so shutting half or more of it down can definitely help with battery life. The resized cache is no faster (same access latency); it just consumes less power.

  • Steamroller brings no significant reduction in L2/L3 cache latencies. According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority. Given that most consumers (read: notebooks) will only see L3-less processors (e.g. Llano, Trinity), and many server workloads are less sensitive to latency, AMD’s stance makes sense.


夠唔夠Haswell打? 唔知

4-way decoder doubling係浪費

TOP

係咪 4-way 真係唔知,除左 anandtech 之外無乜人有提係 4-way。

TOP

引用:
原帖由 Puff 於 2012-8-30 01:53 發表
係咪 4-way 真係唔知,除左 anandtech 之外無乜人有提係 4-way。
2-way一定太少
有可能3, 有可能4, 以AMD一貫做法, 4似過3 (see K7/K8/K10)

不過3其實好過4, 只係AMD唔慣咁設計

TOP

而家先有Dynamic sized Cache...

Intel's implementation : Pentium M
"All animals are equal, but some animals are more equal than others."

TOP

引用:
原帖由 BlackBird 於 2012-8-30 01:56 發表
而家先有Dynamic sized Cache...

Intel's implementation : Pentium M
因為>512KB L2 cache係舊年先開始有

TOP

引用:
原帖由 qcmadness 於 2012-8-30 01:57 發表

因為>512KB L2 cache係舊年先開始有
Q係... K8 1M L2年代又唔見加...
"All animals are equal, but some animals are more equal than others."

TOP

引用:
原帖由 qcmadness 於 2012-8-30 01:55 發表

2-way一定太少
有可能3, 有可能4, 以AMD一貫做法, 4似過3 (see K7/K8/K10)

不過3其實好過4, 只係AMD唔慣咁設計
係 power efficiency standpoint 就係 3-wide 囉。又有 decoded micro-op queue 喎...
不過 FPU 點運作又係個好問題,而且 FPU 疑似縮左下水,由 4 pipes 變做 3 pipes. 雖然我個人估 SIMD ALU 由兩組變三組啦。

TOP

引用:
原帖由 BlackBird 於 2012-8-30 02:00 發表

Q係... K8 1M L2年代又唔見加...
e, 係wor

K8因為領先太多了, 同埋果陣P-M都唔係出左好耐

TOP

引用:
原帖由 Puff 於 2012-8-30 02:00 發表

係 power efficiency standpoint 就係 3-wide 囉。又有 decoded micro-op queue 喎...
不過 FPU 點運作又係個好問題,而且 FPU 疑似縮左下水,由 4 pipes 變做 3 pipes. 雖然我個人估 SIMD ALU 由兩組變三組啦。

:bana ...
計埋有6個port差唔多

唔放咁多資源落MMX係好事, 起碼MMX/3D-Now/x87可以被SSE2完全取代

TOP

引用:
原帖由 qcmadness 於 2012-8-30 02:02 發表

計埋有6個port差唔多

唔放咁多資源落MMX係好事, 起碼MMX/3D-Now/x87可以被SSE2完全取代
MMX Unit 即係 MAL pipe,即係 Vector Integer Arithmetic, Logical Ops + Bitwise Ops... 你 Cut MMX 等於 Cut SSE Integer.



speculation: P0 [FMA|IMAC|CVT|MAL] P1 [FMA|XBAR|MAL] P2 [MAL] P3 [FSTOR]
拿,reg file 仲可以少一個 read port 同 write port.


[ 本帖最後由 Puff 於 2012-8-30 02:06 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2012-8-30 02:01 發表

e, 係wor

K8因為領先太多了, 同埋果陣P-M都唔係出左好耐
Regor都係食老本

唔怪得一講電壓 idle consumption 直線下降
"All animals are equal, but some animals are more equal than others."

TOP

引用:
原帖由 Puff 於 2012-8-30 02:03 發表

MMX Unit 即係 MAL pipe,即係 Vector Integer Arithmetic, Logical Ops + Bitwise Ops... 你 Cut MMX 等於 Cut SSE Integer.
唔係的, 個FMA unit做哂SSEx的野

http://support.amd.com/us/Proces ... 5h_sw_opt_guide.pdf
Page 37
引用:
• Two 128-bit FMAC units. Each FMAC supports four single precision or two double-precision ops.
• FADDs and FMULs are implemented within the FMAC’s.
• x87 FADDs and FMULs are also handled by the FMAC.
• Each FMAC contains a variable latency divide/square root machine.

TOP

引用:
原帖由 qcmadness 於 2012-8-30 02:06 發表

唔係的, 個FMA unit做哂SSEx的野
Nononono.

P0 有 FMA, FCVT 同 IMAC 三個執行單元. P1 有 FMA 同 XBAR 兩個執行單元. P2 & P3 就得 MAL 執行單元.
FMA 係做所有 floating-point arithmetic & logical operations e.g. CMPPS, ADDPS, MULPS,但唔包括 bitwise operations e.g. ORPS, ANDPS.
MAL 就係做所有 integer arithmetic & logical operations,同時做埋 floating-point 既 bitwise operations. 不過 MUL/MAC 係 IMAC 既事。

唔係我老作,Optimization Guide 是如此寫的,很出名的老外 Agner 也是如此說的。

TOP

唉呀,你引得 SoG 就睇下 instruction latency table 啦。雖然話唔係完全準,但係 P***D/Q/W/SW/B 既 integer arithmetic instructions 絕大部份都係歸 MAL0 同 MAL1.
你引用果個 section 個 description 更加有講 In addition to the two FMACs, the FPU also contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX and SSE packed integer data.

[ 本帖最後由 Puff 於 2012-8-30 02:17 編輯 ]

TOP

引用:
原帖由 Puff 於 2012-8-30 02:13 發表
唉呀,你引得 SoG 就睇下 instruction latency table 啦。雖然話唔係完全準,但係 P*Q/SW/B 既 integer arithmetic instructions 絕大部份都係歸 MAL0 同 MAL1.
你引用果個 section 個 description 更加有講 In addition ...
睇緊, 應該你對

不過既然Intel都reuse 128-bit unit做256-bit野
AMD都有可能係reuse FMAC做ALU野

TOP