打印

[業界消息] AMD Announces Its First ARM Based Server SoC

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

16^# 大中小發表於 2014-1-30 21:15 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:10 發表

"High-level definition of core microarchitecture". There is another "ambidextrous" interconnect project supporting both x86/ARM SOC/chips. Probably ring based. probably.

Which means either ...

Beefing up Jaguar is an option.

It is already on par with K8 / Pilediver IPC wise.

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

17^# 大中小發表於 2014-1-30 21:18 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 21:15 發表

Beefing up Jaguar is an option. It is already on par with K8 / Pilediver IPC wise.

Literally means the same as convergence of two cores, I guess.

Anyhow, AMD already demonstrated their commitment to drive high-performance core towards Cat's automated design methodology in HC24.

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

18^# 大中小發表於 2014-1-30 21:19 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:18 發表

Literally means the same as convergence of two cores.

Anyhow, AMD already demonstrated their commitment to drive high-performance core towards Cat's automated design methodology in HC24.

Not convergence

Improve from Jaguar using K10.5 and Bulldozer experiences

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

19^# 大中小發表於 2014-1-30 21:23 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 21:19 發表

Not convergence

Improve from Jaguar using K10.5 and Bulldozer experiences

You know what I mean. A far-stronger Jaguar capable of 3-3.5GHz clock would be nice. Give it more execution resources and larger windows, stick it with a ring interconnect and overhauled cache hierarchy, and you will get a OHHH-FINALLY-COMPETITIVE Opteron MP chip. Private L2, please!

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

20^# 大中小發表於 2014-1-30 21:30 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:23 發表

You know what I mean. A far-stronger Jaguar capable of 3-3.5GHz clock would be nice. Give it more execution resources and larger windows, stick it with a ring interconnect and overhauled cache hierar ...

For example:

1. 3 AGUs with L/S capability
2. One more ALU
3. 6-8 3GHz+ cores

Jaguar:

AMD Hammer (K8):

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

21^# 大中小發表於 2014-1-30 21:40 只看該作者

Strong L/S system is preferred over more ALUs.
Say Load Queue with 64+ entries and Store Queue with 32+ entries.
Super fast L2 would be great, particularly <15 clk 1MB L2

My dream core. Imagine a 4-way decode & 32B front-end. Perhaps 2-way SMT?

P.S. AMD may adopt coarse-grained directory coherence (fallback to snoop via a null directory)

[ 本帖最後由 Puff 於 2014-1-30 21:54 編輯 ]

附件: 您所在的用戶組無法下載或查看附件

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

22^# 大中小發表於 2014-1-30 21:42 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:40 發表
Strong L/S system is preferred over more ALUs.
Say Load Queue with 64+ entries and Store Queue with 32+ entries.
Super fast L2 would be great, particularly

Too big for each core

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

23^# 大中小發表於 2014-1-30 21:43 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 21:42 發表

Too big for each core

I guess it would be fine for automated designs... ALU won't occupy too much space, but the L/S unit will. Server workloads rely on the perf of the later tho.

P.S. Broadcom Vulcan Core

[ 本帖最後由 Puff 於 2014-1-30 21:44 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

24^# 大中小發表於 2014-1-30 21:45 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:43 發表

I guess it would be fine for automated designs... ALU won't occupy too much space, but the L/S unit will. Server workloads rely on the perf of the later tho.

P.S. Broadcom Vulcan Core

x86 is far more transistor-hungry than ARM

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

25^# 大中小發表於 2014-1-30 21:50 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 21:45 發表

x86 is far more transistor-hungry than ARM

well I doubt it would be really a lot when you look at Intel's implementation. It just burns more transistors on decoding/microcode and a sophisticated load-store unit due to x86's strict memory ordering model.

[ 本帖最後由 Puff 於 2014-1-30 21:52 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

26^# 大中小發表於 2014-1-30 22:50 只看該作者

引用:

原帖由 Puff 於 2014-1-30 21:50 發表

well I doubt it would be really a lot when you look at Intel's implementation. It just burns more transistors on decoding/microcode and a sophisticated load-store unit due to x86's strict memory orde ...

Remember Intel has the highest-density cache in the industry.
And Intel has control over the fabrication / manufacturing.

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

27^# 大中小發表於 2014-1-30 23:11 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 22:50 發表

Remember Intel has the highest-density cache in the industry.
And Intel has control over the fabrication / manufacturing.

no matter how it goes, overprovision is always needed for diminishing IPC improvements. The 3.1mm2 Jaguar has a plenty of room to grow IMO, particularly when we are talking about perhaps FinFET based designs beyond Excavator, if one takes it as the base design to work on.

[ 本帖最後由 Puff 於 2014-1-30 23:26 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

28^# 大中小發表於 2014-1-30 23:30 只看該作者

引用:

原帖由 Puff 於 2014-1-30 23:11 發表

no matter how it goes, overprovision is always needed for diminishing IPC improvements. The 3.1mm2 Jaguar has a plenty of room to grow IMO, particularly when we are talking about perhaps FinFET based ...

Even at 10mm^2, it is still small compared with Steamroller and Haswell

http://bbs.hk-spot.com

TOP

Puff

水王

Rank: 4 Rank: 4 Rank: 4 Rank: 4

PM
加為好友
當前離線

29^# 大中小發表於 2014-1-30 23:44 只看該作者

引用:

原帖由 qcmadness 於 2014-1-30 23:30 發表

Even at 10mm^2, it is still small compared with Steamroller and Haswell

plenty of options to fill that up
- less dense for higher frequency (single turbo up to 3+ Ghz would be nice)
- 3 ALU + 3 AGU as you suggested
- Pipelined Multiplier really helps... also better divisor
- 2 LD + 1 ST port for DC
- larger load-store unit... (Jaguar: 12-entry unified queue + 20-entry store queue)
- 4-way decode, dispatch & retire
- post-decode COP queue...? uop cache?
- more scheduler entries (Jaguar: 20/12/18) & larger instruction window (Jaguar: 64/44)
- more register file entries
- 256b VFP datapath...?
- 2-way SMT?
- Private L2 cache

[ 本帖最後由 Puff 於 2014-1-30 23:47 編輯 ]

TOP

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

30^# 大中小發表於 2014-1-30 23:46 只看該作者

引用:

原帖由 Puff 於 2014-1-30 23:44 發表

plenty of options to fill that up
- less dense for higher frequency (single turbo up to 3+ Ghz would be nice)
- 3 ALU + 3 AGU as you suggested
- Pipelined Multiplier really helps... also better divis ...

SMT is not that useful in client processors.
And I would expect AMD will re-focus in client processors rather than server processsors.

http://bbs.hk-spot.com

TOP