CS25: Computer Architecture, Lab 2

Testing the effect of CPU design optimizations.

by Jeff Kaufman and Jeff Billion

In this lab we were trying to see what effect modern CPU optimizations had on their performance. We tested this by writing C programs for each aspect that had two components. The first would be ideal for the optimization method, while the second would be unoptimizable. Optimizations tested included pipelining, caching, and superscaling.

We tested several different processors in this lab. They were:

Processor L1 Cache Size (bytes) L1 Cache Line Size (bytes) L1 Cache Lines L1 Cache Mapping
Pentium 4 12K/8K split 32 384/256 4-Way SA
Pentium III 16K/16K split 32 512 2-Way SA
Pentium II 16K/16K split 64 256 2-Way SA
UltraSparc IIi 16K/16K split 32/64 512/256 2-way SA / direct
G5 64K/32K split 32 2K/1K direct / 2-Way SA
G4 32K/32K split 32 1K 8-Way SA
Numbers as A/B for split caches indicate that A is for the instruction cache while B is for the data.

Cache Structure

We tested the average access time for a line in cache by comparing the difference in speed when accessing memory locations in order against accessing one location per line. One would expect that the average time for accessing locations in order would be significantly less since each access brings the whole line down into the cache.

CPU Baseline I Baseline II Cache Stress 2-way SA Nice
Pentium 4 2.86509e-09 4.01461e-09 1.87189e-08 4.74211e-09
Pentium III 9.57706e-09 1.24293e-08 6.10914e-08 1.18172e-08
Pentium II 3.91907e-08 4.7684e-08 6.21025e-07 4.59229e-08
UltraSparc IIi 2.74009e-08 4.03156e-08 4.76776e-07 5.49424e-08
G5 7.13457e-09 8.33436e-09 3.37755e-08 6.65204e-09
G4 2.53112e-08 3.26449e-08 2.28909e-07 3.15661e-08
Note: smaller means faster, and these numbers are useful only for relative comparison.
Source program: cache.c
View data as graph

For the P4, PIII and UltraSparc, it took approximately half the time to access memory locations in order than one per line. The superior speed is likely from the way every statement operates on both the jth and j+1th entries of an array. When these two locations are consecutive, the first request brings both down into cache. When the locations are on separate lines, both must be brought down, requiring two requests from main memory. It is interesting to note that the baseline II test takes less than two times the baseline I case, probably because the baseline I case will occasionally have to access multiple lines in a single statement when the two sequential locations are on the line break.

The cache stress program forced the cache to access lines sequentially in a block of data which was four times the size of the cache. After the first cache-sized block of data had been accessed, the cache had to clobber a line for every operation. Thus one would expect the cache stress time to be significantly greater than those of the baseline II test. In both tests, a new line of cache must be accessed for each statement, however, the cache stress test forces an access to the L2 cache for each statement as well. Thus the ratio of the two tests should indicate the relative access penalty for accessing the L2 cache.

CPU Baseline II Cache Stress Cache Stress/Baseline II
Pentium 4 4.01461e-09 1.87189e-08 4.66269
Pentium III 1.24293e-08 6.10914e-08 4.91511
Pentium II 4.7684e-08 6.21025e-07 13.0238
UltraSparc IIi 4.03156e-08 4.76776e-07 11.8261
G5 8.33436e-09 3.37755e-08 4.05256
G4 3.26449e-08 2.28909e-07 7.01209

From the table, it is clear that the presence of the cache substantially decreases memory access penalties. For example, when the Pentium II accesses a line which is not in L1 cache, the access time is thirteen times greater than that of a cache hit. Also, the data confirms the advancements made in L2 caches: the more recent the processor, the less the penalty for a cache miss.

The 2-way Set Associative test is implemented by allocating a block of memory twice the size of cache. For a block of memory of this size, four lines of the block will map to any given set in cache, which has space to hold two of these lines at any one time. For a direct mapped cache, two lines of the block will map to any given line in cache, which has space for only one line. For a 2-way set associative cache, a program that attempts to access the first line of cache and the first line greater than the number of lines in cache will encounter a cache hit every access. If the cache is direct, and if the program attempts to access those same locations in the block of data, every access will be a cache miss, and the direct mapped cache will thrash. Thus one would expect any CPU with 2-way or greater set associatively to produce results similar to the baseline II test, and a CPU with direct mapping to produce results like that of the cache stress test.

CPU Baseline II 2-Way SA Nice 2-Way/Baseline II
Pentium 4 4.01461e-09 4.74211e-09 1.18121
Pentium III 1.24293e-08 1.18172e-08 .950753
Pentium II 4.7684e-08 4.59229e-08 1.03835
UltraSparc IIi 4.03156e-08 5.49424e-08 1.36281
G5 8.33436e-09 6.65204e-09 .798146
G4 3.26449e-08 3.15661e-08 .966953

The surprise from this data is the UltraSparc. Since it is the only processor with direct mapped cache that we tested, it should stand out with a substantially larger ratio. We would expect it to be about ten times larger, a deviation which suggests further testing. Otherwise, all the set-associative caches yielded times in line with their respective Baseline II results.

The final cache test involved operations on elements of a matrix. Row major order means that while the row is held constant, every column entry on that row is accessed before moving to the next row. Conversely, Column major order indicates every value in a column is accessed before moving to the next column. Furthermore, row major order corresponds to code where the outermost for-loop controls the first set of brackets and the next most outermost for-loop controls the next set of brackets and so on.

CPU M = cacheSize/2 M = cacheSize/4 M = cacheSize/8
Pentium 4 : Row Major 8.97005 4.54158 2.2378
Pentium 4 : Column Major 8.17096 3.99482 2.11423
Pentium III : Row Major 29.547 14.5658 7.1625
Pentium III : Column Major 36.5116 18.2592 8.62962
Pentium II : Row Major 129.93 64.1937 29.1795
Pentium II : Column Major 199.515 99.8411 38.4661
UltraSparc IIi : Row Major 223.703 109.923 52.6823
UltraSparc IIi : Column Major 263.774 130.898 56.2276
G5 : Row Major 8.15395 4.14279 2.04736
G5 : Column Major 9.22615 4.59545 2.19408
G4 : Row Major 32.453 16.5027 7.71374
G4 : Column Major 33.5851 16.7484 8.10241
Note: smaller means faster, and these numbers are useful only for relative comparison within a processor type as they have not been normalized by the number of operations in a given test.
Source program: matrix.c
View data as graph

The G4 and G5 discriminated the least between the two methods of accessing elements of a matrix, but ultimately did follow the trend set by the other processors. Row major is the faster method in every test for every processor and indicates that accessing elements of a matrix in this manner best approximates accessing each location of cache in order.

Superscalar Architecture

We tested the superscalar architecture by comparing the speed of two loops, one where the operations could be executed in any order or simultaneously, and one where each was dependent on the previous. The goal was to let the CPU optimize as much as it could in the first case, and then in the second case see how much it had been able to optimize by. We got pretty good speed differences, as you can see below:

CPU Independent Last with first Third with first Each with one before
Pentium 4 7.463e-09 7.57961e-09 6.85711e-09 4.24342e-08
Pentium III 1.86584e-08 1.86314e-08 2.03295e-08 4.30736e-08
Pentium II 6.98995e-08 8.2676e-08 7.85128e-08 1.63019e-07
UltraSparc IIi 8.43348e-08 8.43147e-08 9.81005e-08 1.52674e-07
G5 1.49769e-08 1.58425e-08 1.81255e-08 4.48727e-08
G4 6.41617e-08 6.72907e-08 7.16655e-08 1.11709e-07
Note: smaller means faster, and these numbers are useful only for relative comparison.
Source program: superscal.c
View data as graph

It looks in general like the CPU is slowed by a factor of 2-4 when it goes from full superscalar optimization to none. This factor probably correlates with the number of ALUs the processor has available. One interesting bit of data is the comparison of the case where the last operation is dependent on the first and where only the third is:

CPU Last with first Third with first Last/Third
Pentium 4 7.57961e-09 6.85711e-09 1.10537
Pentium III 1.86314e-08 2.03295e-08 0.91647
Pentium II 8.2676e-08 7.85128e-08 1.05303
UltraSparc IIi 8.43147e-08 9.81005e-08 0.85947
G5 1.58425e-08 1.81255e-08 0.87404
G4 6.72907e-08 7.16655e-08 0.93896

One might expect the second case to be slower than the first because it requires more agressive re-ordering to be efficient. But on the Pentium 4 and some on the Pentium II it is faster. We think this is probably because the result of the first operation is still available in a register for the third operation, while by the time of the last operation it has been clobbered.

Not all of this speedup, however, was due to the superscalar architecture. Some was caused by pipelining. When the CPU needs the result of each operation to finish the next, in addition to not being able to re-order or execute symultaneously it also can't do the two of them in steps. It has to do one of them through, and then the next, and this might be another factor slowing it down.

Endianness

We tested the endianness of the CPU by treating a long int as an array of bytes and seeing the order it put them in.

CPU Endianness
Pentium 4 little
Pentium III little
Pentium II little
UltraSparc IIi big
G4 big
G4 big
Source program: endian.c

It looks like the chips from desktop families are little endian while those from server families are big endian. Perhaps this is because servers tend to do more comparison and desktops more addition?

The Relative Speed of Numerics

We tested doubles, floats, longs, and shorts over arithmetic operations to see which were faster on various processors.

CPU Double Float Long Short
Pentium 4 2.07216e-08 2.08809e-08 2.40424e-08 2.99891e-08
Pentium III 4.30975e-08 4.45396e-08 4.19025e-08 8.42208e-08
Pentium II 1.60737e-07 1.61894e-07 2.09417e-07 2.1729e-07
UltraSparc IIi 1.2085e-07 1.02723e-07 3.11046e-07 3.2629e-07
G5 2.45125e-08 2.38815e-08 2.53065e-08 2.57835e-08
G4 1.68076e-07 1.68076e-07 1.15052e-07 1.15558e-07
Note: smaller means faster, and these numbers are useful only for relative comparison.
Source program: types.c
View data as graph

These results were a little surprising, as integer calculations are traditionally faster than floating point ones but the P4, PII, and USIIi all prefer floating point calculations to integer ones. The G4 and PIII, though, like integer calculations better, except that the PIII for some reason holds a grudge against shorts.

Division by Zero

The specification for floating point numbers has a way of dealing with division by zero. But in many implementations the circuitry for such divisions is a different speed than the ordinary division circuitry. So we tested the speed of a very large number of divisions by zero compared to a large number of ordinary divisions.

CPU Division by nonzero Division by zero DivByZero to Div ratio
Pentium 4 6.03714e-08 1.55008e-06 25.675
Pentium III 1.31471e-07 4.68936e-07 3.5668
Pentium II 4.82517e-07 1.7264e-06 3.5779
UltraSparc IIi 2.0756e-07 1.11747e-07 0.53838
G5 5.93922e-08 1.01657e-08 0.17116
G4 2.9045e-07 5.27916e-08 0.18176
Note: smaller means faster, and these numbers are useful only for relative comparison.
Source program: divzero.c
View data as graph

Looking at these ratios, the amount variation on how to deal with division by zero is surprising. The P4 takes far longer when it needs to divide by zero than the PIII and PII which are close enough that the probably have almost the same circutry. The USIIi, G4 and G5, however, look like they probably have some circutry to detect if the second operand is zero and if so short-circut the operation. This looks like it was probably a conscious design decision, though I'm not sure why the server family chips would be more likely to need to quickly divide by zero than the desktop chips.

Pipelining

In addition to the pipelining testing that got mixed in with the Superscalar testing, we decided to look at how well the various processors did with jumps in the code that they couldn't predict.

CPU With Jumps Without Jumps With to Without Ratio
Pentium 4 2.7922e-08 1.67793e-08 1.6641
Pentium III 3.93413e-08 2.8354e-08 1.3875
Pentium II 1.79454e-07 1.23144e-07 1.4573
UltraSparc IIi 2.1191e-07 1.91523e-07 1.1065
G5 2.73307e-08 1.92871e-08 1.4171
G4 1.1953e-07 1.07525e-07 1.1117
Note: smaller means faster, and these numbers are useful only for relative comparison.
Source program: piplining.c
View data as graph

In all cases the processors didn't do as well once they had jumps that they couldn't predict.

Future Coding

Learning about what things are easy for CPUs and what things are hard is quite useful for efficient programming. Before this we would have considered
    for (int j = 0 ; j < 3 ; j++)
      for(int i = 0 ; i < points.length ; i++)
        points[i][j] += i;
and
  
    for(int i = 0 ; i < points.length ; i++)
    {
      points[i][X] += i;
      points[i][Y] += i;
      points[i][Z] += i;
    }
to be equivalent, but now after thinking about what fraction of a loop is overhead, the second is pretty clearly the better way to go.

Similarly, if we had a large array, we might write either of the following to iterate through:

    for(int i = 0 ; i < points.length ; i++)
      for(int j = 0 ; j < points[0].length ; j++)
        /* do stuff with points[i][j] */ ;
or
    for(int j = 0 ; j < points[0].length ; j++)
      for(int i = 0 ; i < points.length ; i++)
        /* do stuff with points[i][j] */ ;
But now that we have seen how much more cache optimization can be done with row major order, the difference is clear.
Jeff Kaufman : 2005
cbr at sccs dot swarthmore dot spam edu. Remove spam.

main page