打印

[硬件] First Details of Steamroller

First Details of Steamroller

http://www.anandtech.com/show/62 ... roller-architecture

Major improvements:
引用:


  • Duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. The penalties are pretty obvious: area goes up as does power consumption.

  • Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%.

  • AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty.

  • The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF.

  • The scheduling windows also increased in size, which should enable greater utilization of existing execution resources.

  • Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.

  • The shared L1 instruction cache grew in size with Steamroller, although AMD isn’t telling us by how much. Bulldozer featured a 2-way 64KB L1 instruction cache, with each “core” using one of the ways. This approach gave Bulldozer less cache per core than previous designs, so the increase here makes a lot of sense. AMD claims the larger L1 can reduce i-cache misses by up to 30%.

  • Although AMD doesn’t like to call it a cache, Steamroller now features a decoded micro-op queue. As x86 instructions are decoded into micro-ops, the address and decoded op are both stored in this queue. Should a fetch come in for an address that appears in the queue, Steamroller’s front end will power down the decode hardware and simply service the fetch request out of the micro-op queue. This is similar in nature to Sandy Bridge’s decoded uop cache, however it is likely smaller. AMD wasn’t willing to disclose how many micro-ops could fit in the queue, other than to say that it’s big enough to get a decent hit rate.

  • The L1 to L2 interface has also been improved. Some queues have grown and logic is improved.

  • Finally on the caching front, Steamroller introduces a dynamically resizable L2 cache. Based on workload and hit rate in the cache, a Steamroller module can choose to resize its L2 cache (powering down the unused slices) in 1/4 intervals. AMD believes this is a huge power win for mobile client applications such as video decode (not so much for servers), where the CPU only has to wake up for short periods of time to run minor tasks that don’t have large L2 footprints. The L2 cache accounts for a large chunk of AMD’s core leakage, so shutting half or more of it down can definitely help with battery life. The resized cache is no faster (same access latency); it just consumes less power.

  • Steamroller brings no significant reduction in L2/L3 cache latencies. According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority. Given that most consumers (read: notebooks) will only see L3-less processors (e.g. Llano, Trinity), and many server workloads are less sensitive to latency, AMD’s stance makes sense.


夠唔夠Haswell打? 唔知

4-way decoder doubling係浪費

TOP

引用:
原帖由 Puff 於 2012-8-30 01:53 發表
係咪 4-way 真係唔知,除左 anandtech 之外無乜人有提係 4-way。
2-way一定太少
有可能3, 有可能4, 以AMD一貫做法, 4似過3 (see K7/K8/K10)

不過3其實好過4, 只係AMD唔慣咁設計

TOP

引用:
原帖由 BlackBird 於 2012-8-30 01:56 發表
而家先有Dynamic sized Cache...

Intel's implementation : Pentium M
因為>512KB L2 cache係舊年先開始有

TOP

引用:
原帖由 BlackBird 於 2012-8-30 02:00 發表

Q係... K8 1M L2年代又唔見加...
e, 係wor

K8因為領先太多了, 同埋果陣P-M都唔係出左好耐

TOP

引用:
原帖由 Puff 於 2012-8-30 02:00 發表

係 power efficiency standpoint 就係 3-wide 囉。又有 decoded micro-op queue 喎...
不過 FPU 點運作又係個好問題,而且 FPU 疑似縮左下水,由 4 pipes 變做 3 pipes. 雖然我個人估 SIMD ALU 由兩組變三組啦。

:bana ...
計埋有6個port差唔多

唔放咁多資源落MMX係好事, 起碼MMX/3D-Now/x87可以被SSE2完全取代

TOP

引用:
原帖由 Puff 於 2012-8-30 02:03 發表

MMX Unit 即係 MAL pipe,即係 Vector Integer Arithmetic, Logical Ops + Bitwise Ops... 你 Cut MMX 等於 Cut SSE Integer.
唔係的, 個FMA unit做哂SSEx的野

http://support.amd.com/us/Proces ... 5h_sw_opt_guide.pdf
Page 37
引用:
• Two 128-bit FMAC units. Each FMAC supports four single precision or two double-precision ops.
• FADDs and FMULs are implemented within the FMAC’s.
• x87 FADDs and FMULs are also handled by the FMAC.
• Each FMAC contains a variable latency divide/square root machine.

TOP

引用:
原帖由 Puff 於 2012-8-30 02:13 發表
唉呀,你引得 SoG 就睇下 instruction latency table 啦。雖然話唔係完全準,但係 P*Q/SW/B 既 integer arithmetic instructions 絕大部份都係歸 MAL0 同 MAL1.
你引用果個 section 個 description 更加有講 In addition ...
睇緊, 應該你對

不過既然Intel都reuse 128-bit unit做256-bit野
AMD都有可能係reuse FMAC做ALU野

TOP

引用:
原帖由 Puff 於 2012-8-30 02:20 發表

Intel 有無 reuse integer vector unit 做 256-bit 野我唔知,我只知道佢係 reuse integer datapath for operand MSB delivery.
而所有 FP unit 都是 256-bit width 的。

CPU 跟 GPU 不一樣,CPU 因為 subword paralle ...
暫時來講, 256-bit width未係需要

講番SR, 如果最終係P2同P3二合為一, 應該係睇左software utilization過低 (上次咁做係VLIW-5 > VLIW-4),
先會咁cut, 因為一用到, 其實penalty應該唔細, 雖則話係deep pipelined FPU, 1次branch mis-predict足以致命

TOP

引用:
原帖由 Puff 於 2012-8-30 02:20 發表

Intel 有無 reuse integer vector unit 做 256-bit 野我唔知,我只知道佢係 reuse integer datapath for operand MSB delivery.
而所有 FP unit 都是 256-bit width 的。

CPU 跟 GPU 不一樣,CPU 因為 subword paralle ...
就咁睇latency table, 留P3好過留P2

TOP

引用:
原帖由 Puff 於 2012-8-30 02:29 發表

真心唔覺低,我嫌佢 integer pipe 少添呀
x264 IPC=2.0 inst.latency=2 仲可以 low utilization 就真係食香蕉
如果真係高, 就唔會提議話cut MMX pipeline
或者係繼續P0-P3, 但係P3得番FMISC/FSTO, P2繼續MMX

TOP

引用:
原帖由 Puff 於 2012-8-30 02:29 發表

diagram 都是拿來騙人的,正如張圖都無同你講佢有 crossbar unit.

...
咁計P0都要寫IMAC
P2都要寫FSTO/FMISC

TOP

引用:
原帖由 Puff 於 2012-8-30 02:35 發表
Was it cut? Well, there is no certain answer until the pipe mapping is out.
"shares some hardware" could be sharing the issue port.
share issue port只係會減低port number, 減少的die size唔會太多

TOP

引用:
原帖由 Puff 於 2012-8-30 02:38 發表

減少 # issue port 可以減少 scheduler complexity,外加 register & forwarding network complexity reduces.
呢個係, 但係影響唔係想像中咁大, share port一樣要加番transistor

反而攪攪個FPU scheduler可以同時收2個INT core的FPU instruction好過啦

TOP

arm arm發覺, Steamroller將會同Netburst走完全唔同的路

TOP