[硬件] Contemporary CPU Architectures Compared

qcmadness

管理員

Rank: 10

吹水部屋

PM
加為好友
當前離線

1^# 大中小發表於 2007-6-16 15:13 只看該作者

Contemporary CPU Architectures Compared

Introduction
Here are some of the architectural highlights about the current and future Intel / AMD CPUs. The following CPU architectures will be compared:
1. AMD K8 / Hammer (released in 2003) - Hammer
2. Intel Core Architecture (released in 2006) - Core Arch.
3. AMD K8L (?) / K10 (?), the next generation architecture (to be released in 2007) - NGA
4. Intel Core Architecture update, the Penryn / Wolfdale family (to be released in 2007 / 2008) - Penryn

Last updated: 26th January, 2008

Special thanks to Pippero and Clue69Less for corrections.

The architectural highlights:
1. Processor manufacturing technology:
Hammer: 130nm / 90nm / 65nm SOI, 9 metal layers
Core Arch.: 65nm, 45nm in 2007 H2, 8 metal layers
NGA: 65nm SOI, 45nm SOI in mid-2008, 11 metal layers
Penryn: 45nm with high-K design in 2007 H2, unknown number of metal layers

2. Cache system
Hammer:
L1 cache: 64KB data + 64KB instruction, 2-way, latency: 3 cycles
L2 cache: 512KB, 16-way, 128-bit (32GB/s at 2GHz), latency: 12 cycles (90nm version)
L3 cache: absent
Core Arch.:
L1 cache: 32KB, 8-way, latency: 3 cycles
L2 cache: 2-4MB shared for 2 cores, 16-way, 256-bit (64GB/s at 2GHz), latency: 12-14 cycles
L3 cache: absent
NGA:
L1 cache: 64KB data + 64KB instruction, 2-way, latency: 3 cycles
L2 cache: 512KB, 16-way, 256-bit (64GB/s at 2GHz), latency: unknown
L3 cache: 2MB shared, 32-way, unknown width and latency
Penryn:
L1 cache: 32KB, 8-way, latency: 3 cycles (expected to be the same as Core Arch.)
L2 cache: 3-6MB shared for 2 cores, 24-way (?), 256-bit (96GB/s at 3GHz), latency: unknown
L3 cache: absent
Special feature: "Split Load Cache Enhancement"

3. x86 decoding ability
Hammer:
x86 decoders: 3 complex
Out-of-order execution buffer: 72 general instructions, 36 FP instructions and 24 Integer instructions
Core Arch.:
x86 decoders: 3 simple + 1 complex (the complex decoder can decode 2 simple codes in a pass)
Out-of-order execution buffer: 96 instructions
NGA:
x86 decoders: 3 complex
Out-of-order execution buffer: 72 general instructions, 36 FP instructions and 24 Integer instructions
Penryn:
x86 decoders: 3 simple + 1 complex (the complex decoder can decode 2 simple codes in a pass)
Out-of-order execution buffer: 96 instructions
(expected to be the same as Core Arch.)

4. ALU, FPU and SSE units
Hammer:
ALU units: 3
SSE units: 2 units, 64-bit
SSE versions supported: SSE, SSE2 (all Hammer versions), SSE3 (for Rev. E and later)
Core Arch.:
ALU units: 3
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSSE3 (part of SSE4)
NGA:
ALU units: 3
SSE units: 2 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4A (part of SSE4 with some Core Arch. specific codes removed)
Penryn:
ALU units: 3
SSE units: 3 units, 128-bit
SSE versions supported: SSE, SSE2, SSE3, SSE4

5. Pre-fetch and other tune-ups
Hammer:
Out-of-order loads: absent
Stack manager: absent
Pre-fetchers: 1 data, 1 instruction (to L2 cache)
Instruction fetch width: 16 byte per cycle
Core Arch.:
Out-of-order loads: present
Stack manager: present
Pre-fetchers: 2 data, 1 instruction (to core), 2 pre-fetchers (to L2 cache)
Instruction fetch width: 24 byte per cycle
NGA:
Out-of-order loads: present
Stack manager: present
Pre-fetchers: 1 data, 1 instruction (to L1 cache), 1 DRAM pre-fetcher (to dedicated buffer)
Instruction fetch width: 32 byte per cycle
Penryn:
Out-of-order loads: present, with ashuffle engine to optimize for SSEx
Stack manager: present
Pre-fetchers: 2 data, 1 instruction (to core), 2 pre-fetchers (to L2 cache)
Instruction fetch width: 24 byte per cycle

6. Memory controller
Hammer: 1x128-bit memory controller (1 operation per cycle)
Core Arch.: absent
NGA: 2x64-bit memory controller with NUMA (max 2 operations per cycle), can change back to 1x128-bit mode
Penryn: absent

7. Power management
Hammer: Cool\'n\'Quiet (min. x5 multiplier)
Core Arch.: EIST (min. x6 multiplier), switch off transistor when not in use
NGA: improved C\'n\'Q, two separate power planes for crossbar and cores, separate clocks for each core
Penryn: EIST (?), switch off transistor when not in use, C6 state, separate clocks for each core (the core frequency may exceed the rated frequency)

Reference:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2748
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2939
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2955
http://www.dailytech.com/article.aspx?newsid=7277
http://www.amd.com/us-en/assets/ ... tech_docs/25112.PDF
http://www.amd.com/us-en/assets/ ... tech_docs/40546.pdf
http://www.intel.com/design/processor/manuals/248966.pdf

http://bbs.hk-spot.com