打印

[業界消息] AMD Announces Its First ARM Based Server SoC

引用:
原帖由 Puff 於 2014-1-30 21:10 發表

"High-level definition of core microarchitecture". There is another "ambidextrous" interconnect project supporting both x86/ARM SOC/chips. Probably ring based. probably.



Which means either ...
Beefing up Jaguar is an option.

It is already on par with K8 / Pilediver IPC wise.

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:15 發表

Beefing up Jaguar is an option. It is already on par with K8 / Pilediver IPC wise.
Literally means the same as convergence of two cores, I guess.

Anyhow, AMD already demonstrated their commitment to drive high-performance core towards Cat's automated design methodology in HC24.

TOP

引用:
原帖由 Puff 於 2014-1-30 21:18 發表

Literally means the same as convergence of two cores.

Anyhow, AMD already demonstrated their commitment to drive high-performance core towards Cat's automated design methodology in HC24.
Not convergence

Improve from Jaguar using K10.5 and Bulldozer experiences

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:19 發表

Not convergence

Improve from Jaguar using K10.5 and Bulldozer experiences
You know what I mean. A far-stronger Jaguar capable of 3-3.5GHz clock would be nice. Give it more execution resources and larger windows, stick it with a ring interconnect and overhauled cache hierarchy, and you will get a OHHH-FINALLY-COMPETITIVE Opteron MP chip. Private L2, please!

TOP

引用:
原帖由 Puff 於 2014-1-30 21:23 發表

You know what I mean. A far-stronger Jaguar capable of 3-3.5GHz clock would be nice. Give it more execution resources and larger windows, stick it with a ring interconnect and overhauled cache hierar ...
For example:

1. 3 AGUs with L/S capability
2. One more ALU
3. 6-8 3GHz+ cores


Jaguar:


AMD Hammer (K8):

TOP

Strong L/S system is preferred over more ALUs.
Say Load Queue with 64+ entries and Store Queue with 32+ entries.
Super fast L2 would be great, particularly <15 clk 1MB L2

My dream core. Imagine a 4-way decode & 32B front-end. Perhaps 2-way SMT?





P.S. AMD may adopt coarse-grained directory coherence (fallback to snoop via a null directory)

[ 本帖最後由 Puff 於 2014-1-30 21:54 編輯 ]
附件: 您所在的用戶組無法下載或查看附件

TOP

引用:
原帖由 Puff 於 2014-1-30 21:40 發表
Strong L/S system is preferred over more ALUs.
Say Load Queue with 64+ entries and Store Queue with 32+ entries.
Super fast L2 would be great, particularly  
Too big for each core

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:42 發表

Too big for each core
I guess it would be fine for automated designs... ALU won't occupy too much space, but the L/S unit will. Server workloads rely on the perf of the later tho.


P.S. Broadcom Vulcan Core

[ 本帖最後由 Puff 於 2014-1-30 21:44 編輯 ]

TOP

引用:
原帖由 Puff 於 2014-1-30 21:43 發表

I guess it would be fine for automated designs... ALU won't occupy too much space, but the L/S unit will. Server workloads rely on the perf of the later tho.


P.S. Broadcom Vulcan Core
x86 is far more transistor-hungry than ARM

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:45 發表

x86 is far more transistor-hungry than ARM
well I doubt it would be really a lot when you look at Intel's implementation. It just burns more transistors on decoding/microcode and a sophisticated load-store unit due to x86's strict memory ordering model.



[ 本帖最後由 Puff 於 2014-1-30 21:52 編輯 ]

TOP

引用:
原帖由 Puff 於 2014-1-30 21:50 發表

well I doubt it would be really a lot when you look at Intel's implementation. It just burns more transistors on decoding/microcode and a sophisticated load-store unit due to x86's strict memory orde ...
Remember Intel has the highest-density cache in the industry.
And Intel has control over the fabrication / manufacturing.

TOP

引用:
原帖由 qcmadness 於 2014-1-30 22:50 發表

Remember Intel has the highest-density cache in the industry.
And Intel has control over the fabrication / manufacturing.
no matter how it goes, overprovision is always needed for diminishing IPC improvements. The 3.1mm2 Jaguar has a plenty of room to grow IMO, particularly when we are talking about perhaps FinFET based designs beyond Excavator, if one takes it as the base design to work on.

[ 本帖最後由 Puff 於 2014-1-30 23:26 編輯 ]

TOP

引用:
原帖由 Puff 於 2014-1-30 23:11 發表

no matter how it goes, overprovision is always needed for diminishing IPC improvements. The 3.1mm2 Jaguar has a plenty of room to grow IMO, particularly when we are talking about perhaps FinFET based ...
Even at 10mm^2, it is still small compared with Steamroller and Haswell

TOP

引用:
原帖由 qcmadness 於 2014-1-30 23:30 發表

Even at 10mm^2, it is still small compared with Steamroller and Haswell
plenty of options to fill that up
- less dense for higher frequency (single turbo up to 3+ Ghz would be nice)
- 3 ALU + 3 AGU as you suggested
- Pipelined Multiplier really helps... also better divisor
- 2 LD + 1 ST port for DC
- larger load-store unit... (Jaguar: 12-entry unified queue + 20-entry store queue)
- 4-way decode, dispatch & retire
- post-decode COP queue...? uop cache?
- more scheduler entries (Jaguar: 20/12/18) & larger instruction window (Jaguar: 64/44)
- more register file entries
- 256b VFP datapath...?
- 2-way SMT?
- Private L2 cache



[ 本帖最後由 Puff 於 2014-1-30 23:47 編輯 ]

TOP

引用:
原帖由 Puff 於 2014-1-30 23:44 發表

plenty of options to fill that up
- less dense for higher frequency (single turbo up to 3+ Ghz would be nice)
- 3 ALU + 3 AGU as you suggested
- Pipelined Multiplier really helps... also better divis ...
SMT is not that useful in client processors.
And I would expect AMD will re-focus in client processors rather than server processsors.

TOP