打印

[業界消息] AMD Announces Its First ARM Based Server SoC

Ambiguous statement as usual. Not necessarily Cats I would say, but Cat's design methodology (automation driven) + higher per-core performance yet in a smaller size and nice perf/watt. If you look at what x86 are positioned for after ARM is added into the mix, they would be APUs with low core count and scalable MP CPUs.

[ 本帖最後由 Puff 於 2014-1-30 17:26 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 17:25 發表

Intel is doing the same
thus the major diffs are the choice of heterogeneous solution and what are being used to address the bottom scaling over the spectrum. AMD may have a mid-term competitive advantage with ARM, as ARM is currently dominant in the low-power handheld space, but speaking of long run Intel has a lot of capital to burn.

TOP

引用:
原帖由 qcmadness 於 2014-1-30 17:32 發表

But Intel's manufacturing edge is less now
Whatever as it depends on whether Intel can break into that market with x86, or drastically turn the big ship towards ARM.



But the main idea is that AMD would likely converge their x86 lines, as they no longer need two cores to scale x86 spectacularly*. The converged core, if any, should be positioned like K10+++ or Haswell.


*: In the way they projected in 2006/07. Cats was intended to scale down beyond 1 watt IIRC. But now it is the job of ARM solutions. (late edit)

[ 本帖最後由 Puff 於 2014-1-30 17:49 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 17:50 發表

No...

It seems AMD will concentrate with ARM (
I don't see contradiction. Basically I meant what you mean here.


Perhaps also with a lower frequency ceiling wrt custom designs like Haswell or BD.

[ 本帖最後由 Puff 於 2014-1-30 17:58 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 17:58 發表

5W/core in x86 space is very low indeed.
If you mean SOC TDP divided by core count, then yes. 5 watts of absolute power per core in a notebook-class SOC is significant, anyway.

TOP

The argument is that it is useless to build two cores, aiming for a similar power performance and scaling, but with different ISAs. It is clear that AMD targets just dense server, embedded and handheld with ARMv8 based on the current bits of information... So it completely overlaps with Cat's initiatives (recall your memories in 07-10). As currently it is just a licensee of ARM and probably it treats the IP acquisition cost as the cost of early market entrance (so that it can release their first wave of ARMv8 product way before the current two core roadmaps, aka BD/CAT, end & the custom core, if any, is ready), it is not a huge prob at all.

However, when it comes to the stage of building a custom core, it becomes a prob. AMD has limited resources, isn't it? Going three microarchitectures, of which two overlaps with each other, is unlikely to happen. You may argue that they wouldn't build a custom ARM core, though.

[ 本帖最後由 Puff 於 2014-1-30 18:33 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 20:06 發表

AMD may not go with custon ARM designs
I got one clear messages from AMDer and three signs from LinkedIn that may point to a custom ARM microarchitecture in the pipeline. Another clear message is Bulldozer's irreversible EOL.



[ 本帖最後由 Puff 於 2014-1-30 21:02 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:05 發表

What kind of "custom" is taking place? CPU-CPU interconnect can be.
"High-level definition of core microarchitecture". There is another "ambidextrous" interconnect project supporting both x86/ARM SOC/chips. Probably ring based. probably.
引用:
Bulldozer's EOL is known already.
Which means either AMD gives up completely the top-half space of PC and server, AMD has yet another high-performance core to succeed it and leaves Cats in its place, or AMD converges to a single core beyond BD & Cat's five-year lifespan.

[ 本帖最後由 Puff 於 2014-1-30 21:16 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:15 發表

Beefing up Jaguar is an option. It is already on par with K8 / Pilediver IPC wise.
Literally means the same as convergence of two cores, I guess.

Anyhow, AMD already demonstrated their commitment to drive high-performance core towards Cat's automated design methodology in HC24.

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:19 發表

Not convergence

Improve from Jaguar using K10.5 and Bulldozer experiences
You know what I mean. A far-stronger Jaguar capable of 3-3.5GHz clock would be nice. Give it more execution resources and larger windows, stick it with a ring interconnect and overhauled cache hierarchy, and you will get a OHHH-FINALLY-COMPETITIVE Opteron MP chip. Private L2, please!

TOP

Strong L/S system is preferred over more ALUs.
Say Load Queue with 64+ entries and Store Queue with 32+ entries.
Super fast L2 would be great, particularly <15 clk 1MB L2

My dream core. Imagine a 4-way decode & 32B front-end. Perhaps 2-way SMT?





P.S. AMD may adopt coarse-grained directory coherence (fallback to snoop via a null directory)

[ 本帖最後由 Puff 於 2014-1-30 21:54 編輯 ]
附件: 您所在的用戶組無法下載或查看附件

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:42 發表

Too big for each core
I guess it would be fine for automated designs... ALU won't occupy too much space, but the L/S unit will. Server workloads rely on the perf of the later tho.


P.S. Broadcom Vulcan Core

[ 本帖最後由 Puff 於 2014-1-30 21:44 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 21:45 發表

x86 is far more transistor-hungry than ARM
well I doubt it would be really a lot when you look at Intel's implementation. It just burns more transistors on decoding/microcode and a sophisticated load-store unit due to x86's strict memory ordering model.



[ 本帖最後由 Puff 於 2014-1-30 21:52 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 22:50 發表

Remember Intel has the highest-density cache in the industry.
And Intel has control over the fabrication / manufacturing.
no matter how it goes, overprovision is always needed for diminishing IPC improvements. The 3.1mm2 Jaguar has a plenty of room to grow IMO, particularly when we are talking about perhaps FinFET based designs beyond Excavator, if one takes it as the base design to work on.

[ 本帖最後由 Puff 於 2014-1-30 23:26 編輯 ]

TOP

引用:
原帖由 qcmadness 於 2014-1-30 23:30 發表

Even at 10mm^2, it is still small compared with Steamroller and Haswell
plenty of options to fill that up
- less dense for higher frequency (single turbo up to 3+ Ghz would be nice)
- 3 ALU + 3 AGU as you suggested
- Pipelined Multiplier really helps... also better divisor
- 2 LD + 1 ST port for DC
- larger load-store unit... (Jaguar: 12-entry unified queue + 20-entry store queue)
- 4-way decode, dispatch & retire
- post-decode COP queue...? uop cache?
- more scheduler entries (Jaguar: 20/12/18) & larger instruction window (Jaguar: 64/44)
- more register file entries
- 256b VFP datapath...?
- 2-way SMT?
- Private L2 cache



[ 本帖最後由 Puff 於 2014-1-30 23:47 編輯 ]

TOP