Level 1 cache is normally split between the instruction and data caches, both of which are 16KB on the Pentium III. This go 'round, Intel has decreased the data cache to 8KB and has re-implemented the instruction cache to store micro-ops in the path of the program execution so that results of program branches are integrated into the same cache line. Latency is eliminated because the execution engine can retrieve decoded operations from the cache directly, rather than fetching and decoding commonly used instructions over and over again. In addition, instructions that are not used do not get stored in the cache, making the Execution Trace Cache more efficient than previous implementations.
The second key to minimizing the branch mis-predict penalty lies with Intel's Dynamic Execution Engine, which keeps the Arithmetic Logic Units busy with instructions to execute. As opposed to the Pentium III, which only provided 42 instructions from which the execution units could choose, the Pentium 4 offers 126, increasing the probability that the data needed after a cache miss will be available immediately rather than having to wait to fetch it from memory. As processor frequency ramps upwards, this becomes increasingly important since system memory speed does not scale with the processor.
In addition to providing a greater window of instructions for the execution engine to choose from, enhanced branch prediction has also been provided to further reduce the number of mis-predictions. Intel estimates this number to be about 33% lower than the P6's branch prediction capabilities thanks to an enhanced prediction algorithm and a 4KB branch target buffer that stores detail on the history of past branches.
In order to further compensate for the lower IPC of the NetBurst Architecture, Intel has clocked the Arithmetic Logic Units at twice the frequency of the processor core. So, on a 1.7GHz Pentium 4, the ALU's are screaming at 3.4GHz with latency that is half the duration of the core clock.
We estimate that as processor speeds increase, the integer performance of the Pentium 4 will improve since the speed of the ALU units (which impact integer performance) escalate twice as fast.
One of the most dramatic additions to the NetBurst architecture is a quad-pumped 100MHz-system bus, delivering the equivalent of 3.2GB/s of bandwidth. The idea behind the accelerated 64-bit bus is to match the bandwidth of the dual RDRAM channels that also provide 3.2GB/s of theoretical bandwidth.
Of course the signaling scheme put in place by Intel could not be 100% efficient, so there is also a buffer to help facilitate sustained 400MHz data transfers. With such a high-speed bus in place, the Pentium 4 is able to push more than three times the amount of data as the Pentium III (which is limited to 1.06GB/s on a 133MHz bus). For the sake of comparison, AMD's 760 chipset armed with PC2100 memory is able to push a theoretical 2.1GB/s - something we do not expect to see changed in the near future since AMD's current roadmap shows the 266MHz bus in place for at least another 18 months or so.