A common technique for improving performance in computer systems (both hardware and software) is to utilize caching for frequently accessed information. This lowers the average cost of accessing the information, providing greater performance for the overall system. This applies in processor design, and in the Intel Pentium 4 Processor architecture, caching is a critical component of the system's performance. In this article we will describe the basics of the cache coherency mechanisms.
The Pentium 4 Processor Architecture includes multiple types and levels of caching:
Level 3 Cache - this type of caching is only available on some versions of the Pentium 4 Processor (notably the Pentium 4 Xeon processors). This provides a large on-processor tertiary memory storage area that the processor uses for keeping information nearby. Thus, the contents of the Level 3 cache are faster to access than main memory, but slower than other types of cached information.
Level 2 Cache - this type of cache is available in all versions of the Pentium 4 Processor. It is normally smaller than the Level 3 cache (if present) and is used for caching both data and code that is being used by the processor.
Level 1 Cache - this type of cache is used only for caching data. It is smaller than the Level 2 Cache and generally is used for the most frequently accessed information for the processor.
Trace Cache - this type of cache is used only for caching decoded instructions. Specifically, the processor has already broken down the normal processor instructions into micro operations and it is these "micro ops" that are cached by the P4 in the Trace Cache.
Translation Lookaside Buffer (TLB) - this type of cache is used for storing virtual-to-physical memory translation information. It is an associative cache and consists of an instruction TLB and data TLB.
Store Buffer - this type of cache is used for taking arbitrary write operations and caching them so they may be written back to memory without blocking the current processor operations. This decreases contention between the processor and other parts of the system that are accessing main memory. There are 24 entries in the Pentium 4.
Write Combining Buffer - this is similar to the Store Buffer, except that it is specifically optimized for burst write operations to a memory region. Thus, multiple write operations can be combined into a single write back operation. There are 6 entries in the Pentium 4.
The disadvantage of caching is handling the situation when the original copy is modified, thus making the cached information incorrect (or "stale"). A significant amount of the work done within the processor is ensuring the consistency of the cache, both for physical memory as well as for the TLBs.
In the Pentium 4, physical memory caching remains coherent because the processor uses the MESI protocol. MESI defines the state of each unique cached piece of memory, called a cache line. In the Pentium 4, a cache line is 64 bytes. Thus, with the MESI protocol, each cache line is in one of four states:
Modified - the cache line is owned by this processor and there are modifications to that cache line stored within the processor cache. No other part of the system may access the main memory for that cache line as this will obtain stale information.
Exclusive - the cache line is owned by this processor. No other part of the system may access the main memory for that cache line.
Shared - the cache line is owned by this processor. Other parts of the system may acquire shared access to the cache line and may read that particular cache line. None of the shared owners may modify the cache line, which ensures that all cached copies of the data remain valid.
Invalid - the cache line is in an indeterminate state for this processor. Other parts of the system may own this cache line, or it is possible that no other part of the system owns the cache line. This processor may not access the memory and it is not cached.
So long as all parts of the system obey the MESI protocol, memory remains coherent and works properly. Note that we've been careful not to say "multi-processor" here because the MESI protocol is important in all environments, including a multi-processor environment. For example, a DMA controller on a PCI device must be able to ensure that the changes it makes to memory are visible to the processor. This either requires explicit support from the hardware (cache coherency) or from the device driver and operating system. Note that Windows does not assume that memory is cache coherent in this fashion. Thus the device driver uses HAL functions in order to ensure the coherency of memory in the presence of DMA. Such functions are essentially no-ops when hardware memory cache coherency is supported by the underlying hardware.
In addition to the MESI protocol, each region in memory can be defined as having specific caching characteristics. The processor provides a number of different mechanisms for controlling cache policy:
Cache Disable (CD) - this is bit 30 in the CR0 register of the processor. If this bit is set, caching is not allowed for any memory.
No Writeback (NW) - this is bit 29 in the CR0 register of the processor. If this bit is set, write back caching is not allowed for any memory.
Page Cache Disable (PCD) - this bit is present in the CR3 register of the processor, as well as each Page Directory Entry and Page Table Entry in the system and controls the caching of the page tables.
Page Write Through (PWT) - this bit is present in the CR3 register of the processor, as well as each Page Directory Entry and Page Table Entry in the system. It controls the write-through policy of updates to the page tables.
Global (G) - this bit is present in the Page Directory Entry and Page Table Entry in order to determine if the particular entry is valid when the contents of CR3 change. Thus, if this bit is set, the page is "global" and hence valid even when the page table contents change.
Page Global Enable (PGE) - this bit is present in the CR4 register of the processor and enables the interpretation of the G bit in the individual page directory and page table entries.
Memory Type Range Registers (MTRRs) - each MTRR describes a region of addressable memory on the system and the specific caching characteristics of the memory. There are up to 96 memory regions. Normally this is set up by the BIOS.
Page Attribute Table (PAT) - allows control of memory on a page-by-page basis. Each Page Directory and Page Table Entry contains a single PAT bit and when interpreted with the PCD and PWT bits, chooses a specific entry in the PAT table.
Third Level Cache Disable - this is bit 6 of the IA32_MISC_ENABLE_MSR processor register. This exists only on processors with a third level cache and can be used to disable its use if present.
Thus, there are a broad range of options to consider when determining the caching policy of a given page. It is possible (e.g., the PAT for a given virtual page) to come up with multiple inconsistent caching policies for a single page of physical memory. In such cases the processor may function in unpredictable ways. In earlier versions of Windows, this was actually a problem. In Windows XP the caching of a single page must always be consistent, even when mapped by multiple page table entries. This requirement is now enforced by the operating system.
These various options allow control of the specific type of caching for all memory in the system. When talking about the caching characteristics of specific pieces of memory, we use some additional terms:
Memory is said to be cacheable when the processor may store the data from that memory region within one of the processor caches. Most memory is cacheable, but some memory (e.g., memory used to communicate with devices) is non-cacheable.
Memory is said to allow write-back when the processor may store modifications in its cache rather than store the changes immediately back to memory. This allows (for example) write combining operations and may cause read and write operations to appear "out of order" when viewed from outside the processor.
Memory is said to allow speculative reads when the processor may read memory that it might need but, depending upon the ultimate result of conditional operations (for example) it turns out is not actually needed. Most memory falls into this category, but memory where a read operation changes the state of the memory (e.g., device memory) does not allow speculative reads.
Write-combining is the case when multiple write operations are combined into a single write back to memory. This is a bit different than write-back because write combining does not require that read and write operations appear out of order. For example, if the processor allows write-combining but not write-back then a series of write operations may be combined into a single write; but then a read operation will block, the write will occur and then the read will proceed. This preserves the ordering of the read and write.
The ordering of read and write operations is particularly important in device memory, because it is essential to ensure that writes to control registers be sent to the device before the response is read from the device memory.
Normal computer memory allows caching, write-back, write-combining, and speculative reads. The memory type range registers describe to the processor the protection for each region of physical memory on the system. There are a number of different types of cacheable memory:
Write Protected - this type of memory can be cached in the processor, does not allow any write operations, allows speculative reads and does not require any ordering of the read operations.
Write Back - this type of memory can be cached in the processor, allows write-back and write-combining, allows speculative reads and does not require any ordering of the read operations.
Write Through - this type of memory can be cached in the processor, does not allow write-back, allows write-combining and speculative reads, and does not require any ordering.
Write Combining - this type of memory cannot be cached in the processor, does not allow write-back, allows write-combining and speculative reads, and imposes weak ordering (that is, reads are ordered with respect to write operations).
Uncacheable - this type of memory cannot be cached in the processor, does not allow write-back, speculative reads, or write-combining (by default). However, write-combining may be allowed by explicitly enabling it in the MTRR.
Strong Uncacheable - this type of memory does not allow any type of caching, write-back, write-combining, or speculative reads. All access is strongly ordered (that is, reads occur in order, writes occur in order).
Of course, these types of memory caching are realized by combining the various control mechanisms to yield the final type of cacheable memory. When determining the cache policy of a given region of memory, the CD bit is first considered; if it is set, caching is disabled, and no other attributes are considered. If the CD bit is clear, then the MTRR and page level cache controls are considered, with the most restrictive policy being enforced. If the write-back and write through policies conflict, then the write-through policy takes precedence. Write-combining policy takes precedence over write-through or write-back policy as well. Write-combining can only be established via the MTRR or PAT mechanisms in any case.
In addition to managing the cache policy on individual pages, there are also specific instructions in the Pentium 4 that can be used to perform operations that directly affect the cache itself. These include the cache invalidation operations (INVD and WBINVD,) the prefetch hint operations (PREFETCHh,) cache flush (CLFLUSH) and non-cache polluting move operations (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD.)
Even with all of this processor support for cache coherency, there are a number of cases when it is the responsibility of the developer (generally the operating system) to handle some cache coherency issues:
Self-modifying code: because the trace cache consists of decoded micro ops, if the original code is changed, the trace cache must be invalidated by using a serializing operation. There are several, but the documentation for the Pentium 4 clearly favors CPUID.
Dual-ported memory: because changes to dual ported memory can occur outside the ability of the cache coherency mechanism to detect. Thus, the operating system must explicitly perform cache invalidation (using an appropriate instruction).
Page Table Changes: because the Pentium 4 includes speculative execution, a change to a page table entry must be followed by a TLB invalidation to ensure that stale TLB information is not used. In other words, changes to the page table are not reflected in invalidations of the TLB that caches information retrieved from the page table.
With these various caching mechanisms in place, the overall performance of the Pentium 4 is greatly improved. By including strong cache control within the processor itself, the operating system is only minimally burdened. The architecture of the Pentium 4 also works quite well in the presence of multiple processors. However, the operating system must deal with coherency of the page table information, and ensure that cache policy is uniform across processors (e.g., the PAT must be identical on each processor).
While complicated, caching greatly improves the overall system performance. Of course, a faster processor is just the first part of the performance puzzle!