Bottlenecks in AMD Bulldozer by Agner:
The AMD Bulldozer is a major redesign of previous microarchitectures. Some of the most
important improvements are:
• Four pipelines giving a maximum throughput of 4 instructions per clock cycle.
• Improved floating point unit with high throughput
• Better scheduling of macro-ops to the first vacant execution unit
• Some register-to-register moves are translated into register renaming
• Branch prediction is no longer tied to the code cache and there is no limitation on the
number of branches per code cache line
• AVX instruction set with non-destructive 3-operand instructions
• Efficient fused multiply-and-add instructions (FMA4)
Various possible bottlenecks are discussed in the following paragraphs.
The power saving features are reducing the clock frequency most of the time. This often gives inconsistent results in performance tests because the clock frequency is varying. It is sometimes necessary to put a long sequence of CPU-intensive code before the code under test in order to measure the maximum performance.
The instruction fetch and decoding circuitry is shared between the two cores that make a compute unit. The branch predictor and the floating point units are also shared. Some operating systems are not aware of this so that they may put two threads into the same compute unit while another compute unit is idle.
The shared instruction fetch unit can fetch up to 32 bytes per clock cycle or 16 bytes per core. This may be a bottleneck when both cores are active or when frequent jumps produce bubbles in the pipeline.
The decode unit can handle four instructions per clock cycle. It is alternating between the two threads so that each thread gets up to four instructions every second clock cycle, or two instructions per clock cycle on average. This is a serious bottleneck in my tests because the rest of the pipeline can handle up to four instructions per clock.
The situation gets even worse for instructions that generate more than one macro-op each.The decoders cannot handle two double instructions in the same clock cycle. All instructions that generate more than two macro-ops are handled with microcode. The microcode sequencer blocks the decoders for several clock cycles so that the other thread is stalled in the meantime.
The integer out-of-order scheduler has 40 entries, the shared floating point scheduler probably has somewhat more. This is a significant improvement over previous designs.
The integer execution units are poorly distributed between the four pipes. Two of the pipes have all the execution units while the other two pipes are used only for memory read instructions, and on some models for simple register moves. This means that the Bulldozer can execute only two integer ALU instructions per clock cycle, where previous models can execute three. This is a serious bottleneck for pure integer code. The single-core throughput
for integer code can actually be doubled by doing half of the instructions in vector registers, even if only one element of each vector is used.
The floating point execution units are better distributed so that all four pipes can be used.
The most commonly used units are all doubled, including floating point addition,multiplication and division, as well as integer addition and boolean operations. All units are 128 bits wide. This gives a high throughput for 128-bit vector code which is likely sufficient to serve two threads simultaneously in many cases. All 256-bit vector instructions are split into two 128-bit operations so that there is little or no advantage in using 256-bit vectors.
The fused multiply-and-add instructions are very efficient.They are doing one addition and one multiplication in the same time that it otherwise takes to do one addition or one multiplication. This effectively doubles the throughput of floating point code that has an equal number of additions and multiplications. The incompatibility of the FMA4 instructions with Intel’s forthcoming FMA3 instructions is actually not AMD’s fault, as discussed on my blog.
Mixing operations with different latencies will cause less problems than on previous processors.
Latencies for floating point instructions and integer vector instructions are relatively long.Long dependency chains should therefore be avoided. Accessing part of a register causes a false dependence on the rest of the register.
Jumps and branches
Jumps and branches have a throughput of one taken branch every two clock cycles. The throughput is lower if there are 32-byte boundaries shortly after the jump targets. Branch prediction is reasonably good, even for indirect jumps. The branch misprediction penalty is quite high because of a long pipeline.
Memory and cache access
The cache access is reasonably fast for all three cache levels, but cache bank conflicts are very frequent and often impossible to avoid. Cache bank conflicts turned out to be a serious bottleneck in some of my tests. The code cache has only two ways which is quite low when we consider that it has to service two threads.
There is no evidence that retirement can be a bottleneck.