In this lab we were trying to see what effect modern CPU optimizations had on their performance. We tested this by writing C programs for each aspect that had two components. The first would be ideal for the optimization method, while the second would be unoptimizable. Optimizations tested included pipelining, caching, and superscaling.
We tested several different processors in this lab. They were:
Processor | L1 Cache Size (bytes) | L1 Cache Line Size (bytes) | L1 Cache Lines | L1 Cache Mapping |
---|---|---|---|---|
Pentium 4 | 12K/8K split | 32 | 384/256 | 4-Way SA |
Pentium III | 16K/16K split | 32 | 512 | 2-Way SA |
Pentium II | 16K/16K split | 64 | 256 | 2-Way SA |
UltraSparc IIi | 16K/16K split | 32/64 | 512/256 | 2-way SA / direct |
G5 | 64K/32K split | 32 | 2K/1K | direct / 2-Way SA |
G4 | 32K/32K split | 32 | 1K | 8-Way SA |
We tested the average access time for a line in cache by comparing the difference in speed when accessing memory locations in order against accessing one location per line. One would expect that the average time for accessing locations in order would be significantly less since each access brings the whole line down into the cache.
CPU | Baseline I | Baseline II | Cache Stress | 2-way SA Nice |
---|---|---|---|---|
Pentium 4 | 2.86509e-09 | 4.01461e-09 | 1.87189e-08 | 4.74211e-09 |
Pentium III | 9.57706e-09 | 1.24293e-08 | 6.10914e-08 | 1.18172e-08 |
Pentium II | 3.91907e-08 | 4.7684e-08 | 6.21025e-07 | 4.59229e-08 |
UltraSparc IIi | 2.74009e-08 | 4.03156e-08 | 4.76776e-07 | 5.49424e-08 |
G5 | 7.13457e-09 | 8.33436e-09 | 3.37755e-08 | 6.65204e-09 |
G4 | 2.53112e-08 | 3.26449e-08 | 2.28909e-07 | 3.15661e-08 |
For the P4, PIII and UltraSparc, it took approximately half the time to access memory locations in order than one per line. The superior speed is likely from the way every statement operates on both the jth and j+1th entries of an array. When these two locations are consecutive, the first request brings both down into cache. When the locations are on separate lines, both must be brought down, requiring two requests from main memory. It is interesting to note that the baseline II test takes less than two times the baseline I case, probably because the baseline I case will occasionally have to access multiple lines in a single statement when the two sequential locations are on the line break.
The cache stress program forced the cache to access lines sequentially in a block of data which was four times the size of the cache. After the first cache-sized block of data had been accessed, the cache had to clobber a line for every operation. Thus one would expect the cache stress time to be significantly greater than those of the baseline II test. In both tests, a new line of cache must be accessed for each statement, however, the cache stress test forces an access to the L2 cache for each statement as well. Thus the ratio of the two tests should indicate the relative access penalty for accessing the L2 cache.
CPU | Baseline II | Cache Stress | Cache Stress/Baseline II |
---|---|---|---|
Pentium 4 | 4.01461e-09 | 1.87189e-08 | 4.66269 |
Pentium III | 1.24293e-08 | 6.10914e-08 | 4.91511 |
Pentium II | 4.7684e-08 | 6.21025e-07 | 13.0238 |
UltraSparc IIi | 4.03156e-08 | 4.76776e-07 | 11.8261 |
G5 | 8.33436e-09 | 3.37755e-08 | 4.05256 |
G4 | 3.26449e-08 | 2.28909e-07 | 7.01209 |
From the table, it is clear that the presence of the cache substantially decreases memory access penalties. For example, when the Pentium II accesses a line which is not in L1 cache, the access time is thirteen times greater than that of a cache hit. Also, the data confirms the advancements made in L2 caches: the more recent the processor, the less the penalty for a cache miss.
The 2-way Set Associative test is implemented by allocating a block of memory twice the size of cache. For a block of memory of this size, four lines of the block will map to any given set in cache, which has space to hold two of these lines at any one time. For a direct mapped cache, two lines of the block will map to any given line in cache, which has space for only one line. For a 2-way set associative cache, a program that attempts to access the first line of cache and the first line greater than the number of lines in cache will encounter a cache hit every access. If the cache is direct, and if the program attempts to access those same locations in the block of data, every access will be a cache miss, and the direct mapped cache will thrash. Thus one would expect any CPU with 2-way or greater set associatively to produce results similar to the baseline II test, and a CPU with direct mapping to produce results like that of the cache stress test.
CPU | Baseline II | 2-Way SA Nice | 2-Way/Baseline II |
---|---|---|---|
Pentium 4 | 4.01461e-09 | 4.74211e-09 | 1.18121 |
Pentium III | 1.24293e-08 | 1.18172e-08 | .950753 |
Pentium II | 4.7684e-08 | 4.59229e-08 | 1.03835 |
UltraSparc IIi | 4.03156e-08 | 5.49424e-08 | 1.36281 |
G5 | 8.33436e-09 | 6.65204e-09 | .798146 |
G4 | 3.26449e-08 | 3.15661e-08 | .966953 |
The surprise from this data is the UltraSparc. Since it is the only processor with direct mapped cache that we tested, it should stand out with a substantially larger ratio. We would expect it to be about ten times larger, a deviation which suggests further testing. Otherwise, all the set-associative caches yielded times in line with their respective Baseline II results.
The final cache test involved operations on elements of a matrix. Row major order means that while the row is held constant, every column entry on that row is accessed before moving to the next row. Conversely, Column major order indicates every value in a column is accessed before moving to the next column. Furthermore, row major order corresponds to code where the outermost for-loop controls the first set of brackets and the next most outermost for-loop controls the next set of brackets and so on.
CPU | M = cacheSize/2 | M = cacheSize/4 | M = cacheSize/8 |
---|---|---|---|
Pentium 4 : Row Major | 8.97005 | 4.54158 | 2.2378 |
Pentium 4 : Column Major | 8.17096 | 3.99482 | 2.11423 |
Pentium III : Row Major | 29.547 | 14.5658 | 7.1625 |
Pentium III : Column Major | 36.5116 | 18.2592 | 8.62962 |
Pentium II : Row Major | 129.93 | 64.1937 | 29.1795 |
Pentium II : Column Major | 199.515 | 99.8411 | 38.4661 |
UltraSparc IIi : Row Major | 223.703 | 109.923 | 52.6823 |
UltraSparc IIi : Column Major | 263.774 | 130.898 | 56.2276 |
G5 : Row Major | 8.15395 | 4.14279 | 2.04736 |
G5 : Column Major | 9.22615 | 4.59545 | 2.19408 |
G4 : Row Major | 32.453 | 16.5027 | 7.71374 |
G4 : Column Major | 33.5851 | 16.7484 | 8.10241 |
The G4 and G5 discriminated the least between the two methods of accessing elements of a matrix, but ultimately did follow the trend set by the other processors. Row major is the faster method in every test for every processor and indicates that accessing elements of a matrix in this manner best approximates accessing each location of cache in order.
CPU | Independent | Last with first | Third with first | Each with one before |
---|---|---|---|---|
Pentium 4 | 7.463e-09 | 7.57961e-09 | 6.85711e-09 | 4.24342e-08 |
Pentium III | 1.86584e-08 | 1.86314e-08 | 2.03295e-08 | 4.30736e-08 |
Pentium II | 6.98995e-08 | 8.2676e-08 | 7.85128e-08 | 1.63019e-07 |
UltraSparc IIi | 8.43348e-08 | 8.43147e-08 | 9.81005e-08 | 1.52674e-07 |
G5 | 1.49769e-08 | 1.58425e-08 | 1.81255e-08 | 4.48727e-08 |
G4 | 6.41617e-08 | 6.72907e-08 | 7.16655e-08 | 1.11709e-07 |
It looks in general like the CPU is slowed by a factor of 2-4 when it goes from full superscalar optimization to none. This factor probably correlates with the number of ALUs the processor has available. One interesting bit of data is the comparison of the case where the last operation is dependent on the first and where only the third is:
CPU | Last with first | Third with first | Last/Third |
---|---|---|---|
Pentium 4 | 7.57961e-09 | 6.85711e-09 | 1.10537 |
Pentium III | 1.86314e-08 | 2.03295e-08 | 0.91647 |
Pentium II | 8.2676e-08 | 7.85128e-08 | 1.05303 |
UltraSparc IIi | 8.43147e-08 | 9.81005e-08 | 0.85947 |
G5 | 1.58425e-08 | 1.81255e-08 | 0.87404 |
G4 | 6.72907e-08 | 7.16655e-08 | 0.93896 |
One might expect the second case to be slower than the first because it requires more agressive re-ordering to be efficient. But on the Pentium 4 and some on the Pentium II it is faster. We think this is probably because the result of the first operation is still available in a register for the third operation, while by the time of the last operation it has been clobbered.
Not all of this speedup, however, was due to the superscalar architecture. Some was caused by pipelining. When the CPU needs the result of each operation to finish the next, in addition to not being able to re-order or execute symultaneously it also can't do the two of them in steps. It has to do one of them through, and then the next, and this might be another factor slowing it down.
CPU | Endianness |
---|---|
Pentium 4 | little |
Pentium III | little |
Pentium II | little |
UltraSparc IIi | big |
G4 | big |
G4 | big |
It looks like the chips from desktop families are little endian while those from server families are big endian. Perhaps this is because servers tend to do more comparison and desktops more addition?
CPU | Double | Float | Long | Short |
---|---|---|---|---|
Pentium 4 | 2.07216e-08 | 2.08809e-08 | 2.40424e-08 | 2.99891e-08 |
Pentium III | 4.30975e-08 | 4.45396e-08 | 4.19025e-08 | 8.42208e-08 |
Pentium II | 1.60737e-07 | 1.61894e-07 | 2.09417e-07 | 2.1729e-07 |
UltraSparc IIi | 1.2085e-07 | 1.02723e-07 | 3.11046e-07 | 3.2629e-07 |
G5 | 2.45125e-08 | 2.38815e-08 | 2.53065e-08 | 2.57835e-08 |
G4 | 1.68076e-07 | 1.68076e-07 | 1.15052e-07 | 1.15558e-07 |
These results were a little surprising, as integer calculations are traditionally faster than floating point ones but the P4, PII, and USIIi all prefer floating point calculations to integer ones. The G4 and PIII, though, like integer calculations better, except that the PIII for some reason holds a grudge against shorts.
CPU | Division by nonzero | Division by zero | DivByZero to Div ratio |
---|---|---|---|
Pentium 4 | 6.03714e-08 | 1.55008e-06 | 25.675 |
Pentium III | 1.31471e-07 | 4.68936e-07 | 3.5668 |
Pentium II | 4.82517e-07 | 1.7264e-06 | 3.5779 |
UltraSparc IIi | 2.0756e-07 | 1.11747e-07 | 0.53838 |
G5 | 5.93922e-08 | 1.01657e-08 | 0.17116 |
G4 | 2.9045e-07 | 5.27916e-08 | 0.18176 |
Looking at these ratios, the amount variation on how to deal with division by zero is surprising. The P4 takes far longer when it needs to divide by zero than the PIII and PII which are close enough that the probably have almost the same circutry. The USIIi, G4 and G5, however, look like they probably have some circutry to detect if the second operand is zero and if so short-circut the operation. This looks like it was probably a conscious design decision, though I'm not sure why the server family chips would be more likely to need to quickly divide by zero than the desktop chips.
CPU | With Jumps | Without Jumps | With to Without Ratio |
---|---|---|---|
Pentium 4 | 2.7922e-08 | 1.67793e-08 | 1.6641 |
Pentium III | 3.93413e-08 | 2.8354e-08 | 1.3875 |
Pentium II | 1.79454e-07 | 1.23144e-07 | 1.4573 |
UltraSparc IIi | 2.1191e-07 | 1.91523e-07 | 1.1065 |
G5 | 2.73307e-08 | 1.92871e-08 | 1.4171 |
G4 | 1.1953e-07 | 1.07525e-07 | 1.1117 |
In all cases the processors didn't do as well once they had jumps that they couldn't predict.
for (int j = 0 ; j < 3 ; j++)
for(int i = 0 ; i < points.length ; i++)
points[i][j] += i;
and
for(int i = 0 ; i < points.length ; i++)
{
points[i][X] += i;
points[i][Y] += i;
points[i][Z] += i;
}
to be equivalent, but now after thinking about what fraction of a
loop is overhead, the second is pretty clearly the better way to
go.
Similarly, if we had a large array, we might write either of the
following to iterate through:
or
for(int i = 0 ; i < points.length ; i++)
for(int j = 0 ; j < points[0].length ; j++)
/* do stuff with points[i][j] */ ;
But now that we have seen how much more cache optimization can be
done with row major order, the difference is clear.
for(int j = 0 ; j < points[0].length ; j++)
for(int i = 0 ; i < points.length ; i++)
/* do stuff with points[i][j] */ ;