G4WP [ETC]

G4 Architecture White Paper ; G4架构白皮书\n


型号：	G4WP
厂家：	ETC
描述：	G4 Architecture White Paper G4架构白皮书\n
文件：	总6页 (文件大小：102K)
中文：	中文翻译
下载：	下载PDF数据表文档文件

Freescale Semiconductor, Inc.

Semiconductor Products Sector

PowerPC^™G4 Architecture White Paper

Delivering Performance Enhancement in 60x Bus Mode

Susan Seale

You know the scenario: you’ve just released the

greatest whiz-bang product to the networking

Even without discussing the beneﬁts of the

AltiVec™ processing unit available in G4

processors (let’s leave that exercise for an analysis

of SIMD-intensive applications.) or enhancements

offered by G4’s MPX bus mode option, there are

many reasons for choosing a G4-series processor

for your system. For now, let’s consider only those

beneﬁts which apply to PowerPC systems using

the conventional PowerPC instruction set and the

standard 60x bus mode.

marketplace—fantastic

features,

excellent

performance, and the right price. But right away,

you have to watch out for competitors approaching

from all sides. To maintain your leadership

position in the market, your mission—should you

choose to accept it—is to upgrade your product’s

performance (and of course lower its cost) with

minimal hardware and software redesign. Where

do you begin?

May we introduce the MPC7400/7410 and the

MPC7440/7450 devices.

If your system is PowerPC-based, using the

MPC750 (G3) in particular, there are a variety of

options to consider. Some devices offer new

features. This makes the marketers happy. Most

new offerings deliver higher core frequency. Now

the software developers are happy. And many

PowerPC upgrades are drop-in replacements

because they have the same footprint as the device

you’re using today. Even the hardware team can

celebrate. Naturally, the right choice depends on

how in your current implementation your software

pushes the processor to its limits.

Beneﬁt 1. Higher Sustainable System

Bus Bandwidth

‘Peak bandwidth,’ the maximum number of bytes

that can be transferred in a single cycle, is a purely

theoretical number. By contrast, ‘maximum

bandwidth,’ the maximum number of bytes that

can be transferred over several transactions,

provides a value which takes into account the

memory system latency and the limitations

associated with the bus protocol, in this case the

60x bus. For example, the 60x bus requires one

dead cycle between address tenures and one dead

cycle between data tenures. In a real system, I/O

bandwidth is further limited by particular device

implementation constraints. (Refer to Beneﬁt 2

below for more detail on one of the architectural

constraints of the MPC750—the inability to

pipeline cache loads.) ‘Sustainable bandwidth’

means the maximum number of bytes that can be

transferred over an extended number of cycles,

taking into account all of the constraints mentioned

above.

At this point in the analysis, most embedded

developers admit to one common bottleneck in the

processor subsystem: I/O bandwidth. No matter

how high you crank up the processor speed, how

big the on-chip caches are, or how fast the core can

execute an instruction, the limitation of your

system’s performance is dependent upon how

much data the processor can move in and out (with

signiﬁcant manipulation in between).

Performance Enhancement

This paper highlights ways that the PowerPC

MPC74xx (G4) series can improve the I/O

bandwidth of your G3 system with minimal

engineering effort and can help you overcome the

barrier to best-in-class system performance.

For More Information On This Product,

Go to: www.freescale.com

Freescale Semiconductor, Inc.

Performing a sequence of cacheable data loads over

up to two outstanding instruction fetches, compared

to just one for the MPC750 and the

MPC7400/MPC7410.

a 100MHz bus, both the MPC750 and the

MPC74xx variants have a peak bandwidth of

800Mbytes per second. With the constraints of the

60x bus protocol and the same memory system

latency, both have a maximum bandwidth of

640Mbytes per second. However, in terms of

sustained bandwidth, which best represents actual

system performance, the MPC74xx devices

outperform the MPC750 by nearly 3:1.

Data

As a result of the G3’s D-cache design, once a

D-cache miss occurs, no further D-cache misses

(triggered by program loads and stores) are

propagated to the L2 or the system bus until the

original missed data is returned. This means that

back-to-back cacheable data reads are not pipelined

on the bus. Even though the bus interface unit may

be ready for more transactions, and the 60x bus

protocol can accept another pipelined address

phase, the blocking caches add latency to a

sequence of read accesses. In order to prevent one

miss from blocking the cache for subsequent

accesses, the MPC7400/MPC7410 D-cache

supports ‘miss-under-miss.’ If a miss is pending,

subsequent loads that miss in the D-cache will

propagate to the bus, rather than stalling. In fact, the

load/store unit of the MPC7400/MPC7410 can

continue to issue requests until up to six misses are

pending. The MPC7440/MPC7450 can support up

to 16 outstanding data tenures on the bus, ﬁve of

which may be data load misses. (The others may be

stores, castouts, snoop pushes, or instruction

fetches.)

Comparison of MPC750 and MPC74xx Bus

Bandwidth (Mbytes/sec.) at 100MHz

Device

Peak

Maximum

Sustained

MPC750

800

640

246

MPC74xx

640

Values assume a memory read latency of 10 bus cycles,

counted from the cycle when address is driven and TS is

asserted:

1. Processor bus to system logic

2. System logic to memory interface

3. SDRAM Activate command (assert RAS)

4. Wait for memory (activate to Read/Write = 2 cycles)

5. Read command (assert CAS)

6. Wait for memory (SDRAM Read Latency = 3 cycles)

7. Wait for memory (continued)

8. First beat on memory bus

9. Data latched into system logic (not necessarily required)

10. First beat on processor bus

Better pipelining of instruction fetches and support

for multiple outstanding data transactions add up to

better bus utilization and higher sustainable

bandwidth than the MPC750 can provide.

Peak bandwidth (MPC750 and MPC74xx) = 8 Bytes/cycle

x 100MHz = 800 MB/sec.

Maximum bandwidth (MPC750 and MPC74xx) =

[(1 cache line)/5 bus cycles] x

100MHz = 32 Bytes x 100MHz/ 5 cyc = 640 MB/sec.

Sustained bandwidth (MPC750) = [(1 cache line)/13 bus

cycles] x 100MHz = 32 Bytes x 100MHz / 13 cyc = 246

MB/sec.

Beneﬁt 3. L1 Cache Access

Improvements

Sustained bandwidth (MPC74xx) = maximum bandwidth

(MPC74xx). By pipelining transactions on the address

bus, the MPC74xx does not incur any additional penalty

beyond the limitations of the 60x bus protocol.

Load Miss Folding

In the MPC750, if there are two load misses to the

same cache block, the second load must wait until

the entire block is returned before it can access its

data. Subsequent accesses to the cache are also

stalled. When two load misses to the same cache

block occur in the MPC74xx, the stall does not

occur. Instead, as data beats return for the ﬁrst miss,

results can be provided for the next miss as well.

Furthermore, up to four subsequent misses to the

same cache block can be ‘folded’ into a Load Fold

Queue, allowing full access to the D-cache for the

following instructions while the reload is in

progress. Non-blocked access to the cache,

combined with pipelining of back-to-back data

reads on the bus, can improve the performance of a

PowerPC system limited by bus bandwidth.

Beneﬁt 2. More Back-to-Back

Transactions on the Bus

Instructions

In the G3 architecture, once an I-cache miss occurs,

no further I-cache misses are issued to the L2 or the

system bus until the cache line ﬁll updates both the

L1 and L2 caches. Thanks to an additional entry in

the

instruction

reload

table,

the

MPC7400/MPC7410 architecture allows a second

instruction fetch to start after the ﬁrst fetch has

updated the L1, but before it has updated the L2.

Going a step further in improving instruction fetch

performance, the MPC7440/MPC7450 can support

For More Information On This Product,

Go to: www.freescale.com

Freescale Semiconductor, Inc.

Store Miss Merging

Beneﬁt 5. Private Storage to Off-Load

Trafﬁc from System Bus

If the MPC750 has two store misses to the same

cache block, the second store must wait until the

entire cache block is loaded before it can write its

data. By contrast, the MPC74xx merges several

stores to the same cache block. If enough stores

merge to write all 32 bytes of the cache line, then no

data needs to be loaded from the bus, and an

address-only transaction is broadcast instead.

One enhancement introduced in the MPC755 and

featured in some G4 implementations is the option

to use a portion (or all) of the backside cache space

as private memory storage. The MPC750 does not

support this feature. When the private memory

storage feature is enabled in the L2 of a MPC7410

system or the L3 of a MPC7450 system, the

external cache memory can be partitioned, such that

some of the memory operates normally as cache

while some of the memory functions as a

direct-mapped address space. The direct-mapped

memory space is often used for storage of critical

sections of code (such as interrupt routines) or for a

data set requiring repeated manipulation. In either

case, accesses to this range of addresses do not

consume valuable bandwidth on the system bus.

Allocate on Reload

The MPC750 has a cache line replacement policy of

‘allocate on miss.’ When a miss occurs, the

MPC750 immediately identiﬁes a victim block to

be castout. If a subsequent transaction needs to

access this victim block, the block will already have

been marked invalid and the transaction must reload

the recently castout data from the bus. This

thrashing generates unnecessary trafﬁc on the bus.

The MPC74xx, however, does not identify the

victim block until after the requested block ﬁll

occurs. This cache line replacement policy of

‘allocate on reload’ applies to both the L1 and L2

caches. If a subsequent transaction to another block

in the same set occurs during the reload, the access

hits (because no block in the set has been identiﬁed

as the victim block yet), and no additional bus

access is necessary. When the goal is maximum I/O

bandwidth, keeping accesses off the bus is just as

important as reducing the latency of transactions on

the bus.

Beneﬁt 6. System Bus Improvements

While the MPC750 supports a maximum of

100MHz on the system bus, the MPC74xx supports

up to 133MHz. Using the same assumptions

described in Beneﬁt 1, we can derive the bus

bandwidth for the MPC74xx processors with a

133MHz bus and add this data to the comparison:

Comparison of Bus Bandwidths in (Mbytes/sec.)

Device and Bus Frequency Peak Maximum Sustained

MPC750

MPC74xx

100MHz

133MHz

800

640

851

246

640

851

Beneﬁt 4. Larger Backside Cache with

Better Throughput and

1064

Improved Reliability

Note that an upgrade from the MPC750 at 100MHz

to a MPC74xx at 133MHz can produce a sustained

system bus bandwidth improvement of more than

3x.

The MPC750 has access to only 1MB of backside

L2 cache, while the MPC7400/MPC7410 supports

up to 2MB of backside L2 cache (optionally

conﬁgurable as direct-mapped memory space—see

Beneﬁt 5). The MPC7450 supports 256kB of

on-chip L2 as well as up to 2MB of backside L3.

These additional cache resources maximize the hit

rate and minimize the use of the long-latency

system bus.

Another system bus improvement added to the

MPC7440/MPC7450 is support for a larger address

space via a new 36-bit extended addressing mode,

in addition to support for the 32-bit addressing

mode of the MPC750 and MPC7400/MPC7410.

For superior cache performance and reliability, the

MPC7450 adds DDR SRAM support and address

parity on the L3 bus. The MPC750 interfaces only

to synchronous burst SRAMs or late-write SRAMs

on the L2 bus and does not support L2 address

parity.

For More Information On This Product,

Go to: www.freescale.com

Freescale Semiconductor, Inc.

the Shared capability signiﬁcantly improves

performance in a symmetric multi-processing

system.

Beneﬁt 7. Dual-Ported L1 Data Cache

G4WP [ETC]

相关型号：

G5-35

G5-35/OVP-A

G50-100-3P208

G50-100-3P400

G50-100-3P480

G50-3

G500

G500-10-3P208

G500-10-3P400

G500-10-3P480

G5000AS

G500500225G