Everything that occurs inside a PC happens to the beat of a clock signal, including all processor operations, bus transmissions and memory operations. The clock that the memory is 'listening' to is not the same one that the processor is listening to. As an example, we might have a 500MHz processor on a system with a 100MHz memory bus. Essentially this means that for every clock 'tick' of the memory bus, the processor clock ticks 5 times.
For a standard SDRAM, the latency will be the sum of the Trcd (RAS to CAS delay, or the amount of time to access the row and get ready for receiving the column address), CAS Latency Tcac (this is also called Tcac, or Column Access time. It is the amount of time from Column Address until data is on the output register) and Tac (Access Time, the amount of time from the output register until it appears on the bus) values. A 2-2-2 PC100 SDRAM will therefore have a latency of 20ns + 20ns + 6ns, or 46ns. Rounded to the next clock cycle, we have a 50ns latency, or 5 clock cycles. As mentioned previously, this latency is increased a few cycles due to chipset overhead.
From the standpoint of processor utilization, if the memory latency is 8 cycles the processor will wait 40 cycles before it can expect the first piece of data, and it won't get the next piece of data for another 5 cycles. Because of this, the processor will be vastly underutilized, resulting in very poor performance. This is the problem that cache is intended to solve, or at least minimize.
The first cache implemented was integrated into the processor, however due to the additional circuitry and larger die size, these caches were very small so the cost of the processor could stay low. By running this cache at full processor speed, and increasing the bus width to the processor to 128 bits, about 80% to 90% of the demand requests from the CPU could be satisfied from the processor cache with very low latency (typically one or two processor cycles). The processor cache (called Level 1, or L1) has grown in size to as much as 64K in some x86 processors (and even larger in higher-end processors).
In order to increase the cache size, system designers decided to use a larger amount of commodity SRAM cache, which they called Level 2, or L2. Because designers can use standard SRAM chips the cost is relatively low compared to L1 cache, but is at least 4 times as expensive as DRAM. Today, L2 caches in PCs may be from 512K to 2MB.
A typical SRAM cache chip will have a 3-cycle latency because the time to access the column is eliminated. If the request from the CPU finds the data in L2 cache, a four cycle burst would take only 6 bus cycles rather than 10 or 11 cycles, resulting in about a 40% to 45% increase in throughput. This would correspondingly reduce the number of processor cycles from 50 or more down to about 30.
Studies have shown that as many as 98% of all requests from the CPU will be satisfied from either L1 or L2 cache. Though L2 cache is able to considerably improve the memory subsystem performance, processor speeds have continued to outpace memory, forcing designers to look at new methods to get data to the processor quickly