Powerpc system architecture pdf
Note the critical use of the cross-loads and cross- putation of a single element of the result involves two stores on the y vector, and the use of redundant compu- loads, an FMA, and a store. Performance is therefore tation to compensate for the relative misalignment be- limited by the speed at which data can be transferred be- tween x and y. More elaborate versions of this technique tween the memory hierarchy and the register file, and it are used when computing the dot product of two vectors is therefore important to use quadword loads and stores.
Given the substantial penalty for unaligned quadword accesses, the overriding issue in gaining efficiency for 4. Square root: branchless programming daxpy is that of being able to maintain quadword ac- cess to and from memory irrespective of the alignment The two-way SIMD nature of the PowerPC FP2 of the x and y vectors assuming that these are stored Core and the pipeline depth of five cycles means that contiguously in memory.
When both vectors Consider scheduling a collection of ten independent are quadword-aligned, it is trivial to perform quad- square root sqrt operations on this unit. The more interest- if x! We present performance results for DAXPY, The compiler schedules instructions the same way for DGEMV, matrix-matrix product, and vectorized square a user-written function declared inline as it does for root in Figure 4.
The data reveals several trends. It can therefore be worth express- 1. Figure 4 a demonstrates that the relative perfor- ing code which computes several candidate results and mance of small daxpy routines can be virtually in- then selects amongst them, rather than choosing ahead sensitive to data alignment.
In one case, the vec- of time which to evaluate. Figure 4 b shows that this trend con- tines, including reciprocal, square-root, and tinues when the vector sizes become large.
The compiler gives relatively good performance when given data alignment and pointer alias in- 5. This prototype runs at MHz, and consists 3. Fixed overheads dominate the small problem-size of compute nodes, each with MB memory. We running times of the advanced versions of the limit ourselves to describing the results obtained on a codes. Figure 4 d indicates that the architecture is capa- ing point unit.
Clearly, the performance results we have ble of sustaining very close to peak performance obtained on these computational kernels have been im- when utilizing suitably blocked BLASlike al- portant for achieving high performance on parallel ap- gorithms and the sensitivity of such algorithms to plications that use these kernels.
Their nomenclature and descriptions are as follows. Unless otherwise stated, all codes are in C, and are com- 6. At a very high level, high-performance engineering and scientific software can be viewed as an interaction 0 The source code of this version contains no among algorithms, compilers, and hardware. This paper architecture-specific optimizations.
The compiler for the unit ex- work. On the hardware front, while the paired ap- tends both the SLP algorithm for parallelism detection proach worked very well, we could effectively dou- and the Briggs register allocator to handle register pairs. On the com- involves innovative techniques to balance the memory piler front, Section 3. On the algorithm design front, Initial results show that we are able to sustain a large there are many more operations for which we plan to de- fraction of the peak performance for key floating-point sign efficient algorithms.
Agarwal, F. Gustavson, and M. Exploit- 3. PowerPC Embedded Processor performance numerical algorithms. Almasi, G. Almasi, D. Beece, R. Bellofatto, [15] S. Larsen and S. Exploiting superword G. Bhanot, R. Bickford, M. Blumrich, A. Bright, level parallelism with multimedia instruction sets. Brunheroto, C. Cascaval, J. Ceze, P. Chatterjee, D. Chen, G. Chiu, T. Cipolla, sign and Implementation, pages —, Crumley, A.
Deutsch, M. Dombrowa, W. Do- [16] C. Lawson, R. Hanson, D. Kincaid, and F. Eleftheriou, B. Fitch, J. Gagliano, A. Gara, Krogh. Basic linear algebra subprograms for Fortran us- R. Germain, M. Giampapa, M. Gupta, F. Gustavson, age. ACM Trans. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Montoye, , E. Hokenek, and S. De- Herger, D. Hoenicke, R. Jackson, T. Kopcsay, A. Lanzetta, D. Lieber, M. Lu, cution unit. Mendell, L. Mok, J. Table of Contents.
Improve Article. Save Article. Like Article. Last Updated : 09 Nov, PowerPC Architecture are microprocessor for personal computers. RISC architecture tries to keep the processor as busy as possible. Design features of PowerPC are as follows: Broad range implementation Simple processor design Superscalar architecture Multiprocessor features bit architecture Support for operation in both big-endian and little-endian mode.
The three branch target addressing modes are absolute, Computational instruction operand addressing. Operands relative, and indirect. Arithmetic and erals, but instead of using the literal directly as a target logical instructions operate on two source operands and address, they treat it as a displacement relative to the address retum the result to a target register; one source operand is of the branch instruction itself.
Indirect branches take their either the contents of a register or a literal. The architecture target address from either the link or count registers. The architecture specifies a programmer to preserve the source operands without first four ways of addressing memory operands: indirect Figure copying them to another register.
Shift instructions have a 6a, next page and indirect-indexed Figure 6b , both with third source operand to specify the shift amount, but there a base-address register update option. The large linear space allows for 16 simple pointer arithmetic using the normal fixed-point arith- I s disp 1 metic and logical computational instructions. The architecture assumes the big-endian mode6 as its nat- ural addressing model, meaning that an address points to the byte at the most significant end of multibyte-scalar data elements.
Some machines have lit- tle endian as their natural addressing model, meaning that an address points to the little, or least-significant byte of multi- byte scalars.
In general, programs written for machines of To address translation one endian type cannot read data created by programs on machines of the other type. This is an enormous source of incompatibilitybetween systems.
Base register GPR index register GPR Even worse, one cannot simply recompile programs writ- ten assuming one endian type for machines of the opposite persuasion and expect them to work. We did not wish to handicap PowerPC processors by limiting them to oper- ation only in big-endian environments.
Therefore,we includ- ed a mode switch to allow PowerPC processors to run using either a big- or little-endian address model. One simple way for processors to implement little-endian capability is to exclusive-OR XOR a few low-order bits of the physical memory address, with values that depend upon b To address translation the size of the data type being accessed. Memory operand addressing: indirect a and the bottom three bits of the address with 7 for byte access- indirect indexed b.
Figure 7 illustrates the effect of this transformation. It shows how the following C data structure would be created displacement off a base-address in a GPR to form the effec- and laid out in a little-endian machine, a big-endian machine, tive logical address.
After the processor computes this effec- and a PowerPC running in little-endian mode hooked up to tive address, the address translation facilities map it onto the a big-endian memory. This update addressing feature is useful for operations To illustrate the problem, note that an access to double- such as progressive indexing of arrays in loops. The logical address space delined returns the same data. Accessing byte 8 in doubleword b, for each program is a linear array of bytes as large as a point- however, would return the value 28 in the little-endian sys- er contained in a GPR can index.
For bit machines, and tem and 21 in the big-endian system. Now, if the structure bit machines operating in the bit mode, such a point- were created on a PowerPC in the little-endian mode, the er can address 4 gigabytes of memory. For bit machines structure layout would be as shown in Figure 7c.
Figure 7 shows that the address transformation will simi- 51 52 G larly correct halfword and word accesses into the structure as long as the accesses are aligned on their natural size 4 boundaries. Since this XOR trick works only on aligned 0 1 2 3 4 5 6 7 accesses, the architecture does not support misaligned access 00 11 12 13 14 I when operating in little-endian mode, even though it does 08 21 22 23 24 25 26 27 28 in native big-endian mode.
Subtraction is slower, how- 18 E I F IG I 51 52 I 20 61 62 63 64 ever, and the computation required across the doubleword b boundary is even more complex. But the architecture does allow this implementation if necessary. With this, we can simply recompile as opposed to port programs created for little-endian machines and run them on PowerPC processors. Figure 7. Transformationeffect: little-endianlayout a ; big- This feature allows construction of either cross-endian or bi- endian layout b ; and PowerPC in little-endian mode c.
The architecture specifies the little-endian mode switch as part of the process context. The architecture place across all 64 bits and the result always defines a number of instructions that perform arithmetic, log- updates the full 64 bits of the target register. This ability bit machines. It is another improvement The Adr column indicates the operand addressing mode over the traditional CR-style architectures.
In those architec- used for the second source operand: tures, the compiler could not freely reschedule instructions Im Operand is either an immediate value speci- without inadvertently destroying branch conditions before fied as a literal in the instruction encoding or they were used. Table 1 on the next page shows the major fixed-point Rg Operand is from a general register. Fixed-point instructions. The extended form of add uses the Add, add shifted Im carry bit CA as an implied third source operand in the addi- Add Im 0 tion.
Add extended Rg 0 0 Multiply-low instructions multiply the two source operands Add to -1 ,O extended Im 0 0 returning the least-significant half of the product in a GPR.
Subtract from Im Multiply-high instructions return the most-significant half of Subtract from Rg 0 0 the product; these are available in versions that treat the Subtract from extended Rg 0 0 operands as either signed or unsigned numbers. Divide Subtract from -1 ,O extd. These include setting and clearing arbitrary bit-fields, Rotate left and mask 32 Rg 0 inserting and extracting bit fields, and justlfying bit fields.
The processor transfers double-precision data directly from memory into the registers untouched, but expands single-precision data The Status bits column indicates the options for setting into double-precision format on the way in from memory.
CR,, is the most significant 4-bit field Once in a register, the original nature of the value as single- in the condition register.
OV is the overflow bit in the or double-precision is lost. CA is the carry bit in the XER: Double-precision arithmetic instructions treat register blank Status bit s not affected. It returns a result having only eight CR registers. This single- Table 2. SO is set, the syscall failed and r3 is the error value that normally corresponds to errno. For the scv 0 instruction, the return value indicates failure if it is All floating point and vector data registers as well as control and status registers are nonvolatile.
Syscall behavior can change if the processor is in transactional or suspended transaction state, and the syscall can affect the behavior of the transaction. If the processor is in suspended state when a syscall is made, the syscall will be performed as normal, and will return as normal.
0コメント