The NT Insider

Windows NT Virtual Memory (Part II)
(By: The NT Insider, Vol 6, Issue 1, Jan-Feb 1999 | Published: 15-Feb-99| Modified: 16-Aug-02)


In our earlier article (The NT Insider, Volume 5, Number 2) we promised to continue revealing the secrets of Windows NT?s virtual memory system.  In this second part of the article we?ll describe how Windows NT manages the key virtual memory data structures: the page tables, the virtual address descriptors, the working set, and the page frame database.

Page Tables

Typically, both the operating system and the hardware manage the page tables.  In Windows NT, each process has its own private page table (and thus all threads within the process share the same page table.)  Thus, as the CPU switches from one process to another process it must also switch from one page table to another page table.


Some hardware does not directly access these page tables.  For example, the MIPS processor always relied upon a translation lookaside buffer (TLB) and never accessed the page table.  When the CPU had to decode a virtual address it would look in the TLB.  If there was a corresponding physical entry in the TLB it would use it.  If there was no such entry, it would generate a page fault ? invoking the operating system to resolve the virtual to physical translation.  While the MIPS and PPC platforms are no longer supported by Windows NT, it is useful to discuss them because the memory management architecture of these systems is still reflected in the design of the Windows NT VM system.


As it turns out, using a TLB is a common technique and is used by the Intel and Alpha platforms as well.  Unlike the MIPS platform, however, these other CPU platforms will traverse the page tables to find the correct virtual-to-physical translation and then store that information in the TLB for subsequent lookups.


The use of a TLB is motivated by the requirement for performance.  The virtual-to-physical translation of addresses is done frequently within the CPU.  In fact, for every single instruction since the instruction pointer points to a virtual address that must be translated.  Thus, ensuring that this process is very fast is critical to ensuring the CPU operates at top speed.


For both the Intel and Alpha CPUs, the structure of the page tables is defined by the CPU hardware (when the entries are valid) and by Windows NT (when the entries are not valid).  Thus, the details of how these structures are laid out actually varies depending upon the state of the system.  Further complicating matters, these hardware platforms both support multiple configurations for the page tables.  In Windows NT 4.0, the operating system always chooses a two-level page table scheme (although a future version of Windows NT will support larger page tables and 64 bit virtual addresses). In the two level page table scheme we consider each virtual address as consisting of three distinct components, such as in Figure 1.



Figure 1


For the Intel platform, each entry in the page table is 32 bits (4 bytes) in size.  Thus, a single 4KB page (the page size on the Intel platform) contains 1024 entries. 


The initial page table (at the top level) is referred to as the page directory.  This page directory contains 1024 entries (called, logically enough, page directory entries).  Each page directory entry is the address of a page table.  In turn, each page table also contains 1024 entries, with each entry representing the address of a particular physical page.   Conveniently enough, the page tables themselves can be moved in memory, just like any other virtual memory.

Thus, we can think of a single virtual address as consisting of three sets of information:  the high order ten bits that represent the offset into the page directory, the next ten bits that represent the offset in the page table, and the last 12 bits that represent the byte offset onto the actual page.  The choice of a 4KB page size is thus fortuitous: each page is 4KB in size, which requires 12 bits (for byte addressing), each page table, containing 1024 entries consists of 1024 entries and hence requires 10 bits (for entry addressing).  Thus, a single page table represents 4MB (4KB x 1024) of memory.  The page directory also consists of 1024 entries and hence one page in the page directory represents 4GB (4MB x 1024).   Thus, the 4GB virtual address space can be represented using one page directory plus 1024 page tables.

Of course, on the Alpha system the page size is 8KB, so only the first 25% of the page directory is needed to represent the entire 4GB virtual address space ? or the entire page directory could be used to represent a 32GB virtual address space.  Interestingly enough, this is the size of the virtual address space provided with VLM support in Windows 2000.


One important trick in managing the page tables is that the entire 32 bit entry need not be used for describing the correct physical page.  On the Intel platform, where pages are 4KB, only 20 bits need to be used to describe the physical page.  The remaining 12 bits are used by the hardware and the operating system to maintain control information about the virtual page represented by this particular entry.


One of the most important bits is the valid bit.  This bit indicates whether or not the physical page represented by the entry in the page table is valid.  If this bit is on, the physical page reference is valid and the format of this entry is defined by the hardware (for hardware that supports direct access to the page tables, of course).  If this bit is off, however, the remaining 31 bits of the entry belong to the operating system, and Windows NT takes full advantage of them.


In Figure 2, we demonstrate one possible layout for the page table entry when the entry is not valid (and is defined by the operating system).  This layout is used when the contents of the virtual page are presently stored in the paging file (we discuss this later in this article).




Figure 2


In general, Windows NT assumes very little about the particular hardware support for page table entries.  This is essential to ensure that it is portable across a broad range of hardware platforms.  Thus, for those readers who have a detailed understanding of how their favorite hardware platform works, you should not assume that is how Windows NT uses the underlying hardware resources.


In addition to the valid bit, Windows NT may be using a number of other bits, as described in Figure 3 (not all bits are present in all page table entries ? it often depends upon the details of the type of entry).


Bit Name

Meaning for this virtual page



recently accessed

In reclaiming physical pages for reuse by the operating system


recently modified

In determining if data must be written to permanent storage (such as disk)


can be written

In determining allowed access to the page

User access

user access allowed

Copy on write

page can be written after copy

Memory Manager will copy the page before allowing a modification ? copy is private to the current process


prototype entry

Reference is to another page table entry not a real page


page is in transitional state

Used when moving a page from active state to inactive state ? can be reclaimed as needed


Figure 3


As it turns out, these few bits are sufficient to implement all of the complex functionality present within Windows NT, including protecting system memory from application programs, sharing DLLs between separate processes, even running programs themselves!


In addition to these basic bits, some formats for the page table entry may include other bits unique to the particular ?type? of entry.  Of course, these entries are defined by the OS and are only used when the page table entry itself is invalid.  In the figure earlier, some of the bits were used to denote which paging file was used to store the data contents of the page  Thus, when that virtual page is needed again, the Memory Manager must fetch the actual contents of the physical page from the paging file.  Given that there are 4 bits, the basis of Windows NT?s limitation that there can be no more than 16 paging files is clear!

Virtual Address Descriptors

While the page tables are useful for determining the current virtual-to-physical mapping for a given process, when a page table entry indicates the page is invalid, the Memory Manager must be able to determine where the data contents of the page are located.  While we showed one possible layout where the data was in the paging file, the Memory Manager must be able to determine what the format is of the PTE (when it is ?invalid? to the hardware and owned by the OS).   Thus, a virtual page might be ?valid? in that it is an acceptable address, but the data contents must be fetched from disk and placed in a physical page.


When a virtual page cannot be translated to its corresponding physical page, the hardware generates a page fault.  This transfers control to the operating system (as we mentioned in the previous article, on the Intel platform it transfers control to a function called KiTrap0E).  Once the hardware specific logic within the kernel has performed initial processing of the page fault, it transfers control to the Memory Manager function MmAccessFault (indeed, those who develop file systems see that particular function on the stack on a regular basis!).


At this point, MmAccessFault must determine what actually caused the fault.  To do this, it must first look at the page tables to analyze the contents of the page table entry.  It is possible that between the time the page fault occurred and the Memory Manager was called, some other process or CPU has already constructed the necessary PTE.  In that case, the Memory Manager can simply dismiss the page fault and restart the faulting instruction.


Of course, most of the time the Memory Manager finds that the page table entry is not valid.  In this case, it must actually figure out where to get the information, so it can allocate a new physical page and copy the correct data into that page.  Then it can fix the page table and restart the operation.  The Memory Manager figures out what should be at a particular address by consulting the virtual address descriptor (VAD) that covers the faulting address.


A single process will normally contain many VADs.  Each VAD will describe a range of virtual pages and tell the Memory Manager what those virtual pages actually represent.  For example, a typical process will consist of an executable image (the ?program?) and a set of dynamic link libraries (DLLs) that are used within that process, plus data that is unique to the program.  Each of these separate pieces exists somewhere within the address space of the program.  When each component is first loaded into the address space the Memory Manager creates a new VAD entry for each such range of addresses.


These VAD entries are in turn linked together in a special type of binary tree (a ?splay? tree) that optimizes access to the most recently accessed VAD.  This representation has a couple of key advantages: it is easy to describe a sparse address space using a tree of VADs, it is fast to find entries within the VAD tree, and it is easy to reorganize VAD entries as necessary.

Page Frame Database

Up to this point, we have described the data structures the Memory Manager uses to control the virtual address space.  In addition to managing the virtual-to-physical mappings, the Memory Manager is also responsible for managing the allocation and use of physical memory.  It does this by maintaining the page frame database.


Given that all of physical memory is managed in units called pages (recall that on the Intel platform the page size is 4KB and on the Alpha platform it is 8KB), the Memory Manager sets aside a chunk of physical memory that is used to track the state of the physical pages.  The Memory Manager refers to the individual entries within the page frame database by the page number of the physical page.  Because of this, the page frame database is frequently referred to as the page frame number (PFN) database.


Each physical page within the system is always in one of seven possible states:


Active ? the page is currently active and mapped into at least one address space

Free ? the page is currently available for use

Zero ? the page is currently available for use and zero filled

Modified ? the page is currently in transition, dirty and scheduled to be written

ModifiedNoWrite ? the page is currently in transition and dirty but not scheduled to be written

Bad ? the page has an error and is not being used

Standby ? the page is currently inactive but may be reclaimed


For each of these states, the individual page frame is stored on a particular list.  In addition, for those pages that are in use by a specific process, they are also tracked as part of that processes working set, which is described later in this article.


Whenever the Memory Manager must allocate a new physical page, it will remove that page from one of its internal lists (such as the free list) and allocate it to the given process.  In some cases, the physical page is not tracked with the process but rather via an internal memory manager control structure.  This is the case, for example, with any physical memory that is being shared between processes.  We describe this special type of memory later in this article (when we discuss ?prototype page table entries?).

Working Set

The working set is how the Memory Manager actually tracks what pages are part of the given process.  Each time a new physical page is allocated to a given process, the page is added to its working set.  Thus, at any given time the Memory Manager knows how much physical memory each distinct process within the system is consuming.


When a process is first created, it is assigned two physical memory-related quotas: a minimum working set size and a maximum working set size.  The exact values of these two quotas varies over time and can even be changed programmatically (for example, the Win32 function SetProcessWorkingSetSize demonstrates how Win32 applications can change their own working set quotas, assuming they have the necessary privileges).


Because physical memory must be shared between all processes within the system, the Memory Manager must be able to reclaim physical pages from some processes to make those physical pages available to other processes.  Thus, each time the process adds a new physical page to its working set, the Memory Manager must ensure that the process is not consuming more memory than its ?fair share?.  It does this by examining the minimum and maximum working set quota.


If the process is currently using less than its minimum working set quota, it can add new pages.  In general, the Memory Manager always tries to allow a process to maintain at least its minimum working set quota in physical pages.  Once a process is using more than its minimum quota, however, it is subject to ?trimming? ? the process by which the Memory Manager removes pages from the working set and returns them to the free list.


The maximum working set quota cannot be exceeded.  When a process tries to add more pages than allowed by its maximum working set quota, the system will trim out old pages to make room for the new pages.  Thus, no single process is allowed to consume all available physical memory.


Of course, as we mentioned before, these values are dynamic.  The Memory Manager monitors the behavior of the system and of the specific process and adjusts these quotas as necessary.  For example, a process that is generating an excessive number of page faults will be allowed to grow, in hopes of decreasing the number of page faults it is generating.

Balance Set Manager

The limits on the working set size of a process (or the Cache Manager, which has its own specific working set) are important to ensure that no single process steals too much memory (and thus makes all other processes suffer as a result).  However, by itself, such limits are not enough to guarantee that physical memory is shared fairly between processes.


To further facilitate this process, the kernel actually has a background worker thread known as the Balance Set Manager that is responsible for performing a variety of resource pruning tasks.  One of these is trimming the working set of all the processes that might be consuming physical memory.  The Balance Set Manager runs periodically as needed.  Thus, when the system isn?t even using all of the available physical memory, the Balance Set Manager will be quiescent.  However, once the demands for physical memory exceed the available pool, the Balance Set Manager will be called upon to ?trim? the working sets in order to improve overall system performance (although this might possibly come at the cost of performance of a single application).


Each process is analyzed by the Balance Set Manager and as necessary it will trim the working set of the particular process.  As we discussed earlier in this article, each process has a minimum working set quota and the Balance Set Manager may reclaim physical pages of memory from the process (although it will not normally decrease the amount of physical memory below that minimum working set quota).


In fact, the reclamation of pages is done using a two step process that attempts to ensure that only physical pages that are relatively unimportant are actually reclaimed from a process.  As the Balance Set Manager scans through the working set of the given process, each page table entry is examined, and the Balance Set Manager looks for pages that have the ?accessed? bit cleared.  Such pages can be reclaimed by the Balance Set Manager by making the page transitional (that is, no longer ?valid? but not quite free).  If the page has the ?accessed? bit set, the Balance Set Manager clears the bit but does not remove the page from the working set (the page remains valid).  This technique approximates an LRU algorithm without the burdensome overhead of implementing an actual LRU scheme.


Under certain circumstances, the Balance Set Manager will reclaim all the physical memory for a process.  This occurs when the process has not been active for a period of time.  In these circumstances, the Balance Set Manager will reclaim all of the physical pages of the process.  If that process is then later rescheduled, the Memory Manager must re-create the data for those physical pages and read it in from disk.  For example, the Winlogon program (where you type your name and password, and choose a domain to log into the system) is typically paged out of memory since it is not normally needed.

Paging Files

When a physical page is reclaimed by the operating system, something must be done with the data on that particular page.  For some pages, the data is already stored in a file on the disk.  For example, the executable code for a particular program is stored in a file located on a disk somewhere (either locally or across the network).  However, application programs also create dynamic information that must be stored in a safe location whenever the physical page containing the data is reclaimed.  Clearly, this cannot be stored in the original executable file because it is shared with all running instances of the same program.  Thus, the Memory Manager uses paging files to store such data.  A paging file is nothing more than a disk-based file under the exclusive control of the Memory Manager.


While the file systems allow the paging file to be called using any name, the Memory Manager uses the name pagefile.sys and you can see this file using Explorer, although you won?t be able to open it or copy it whenever it is in use because this file is controlled by the Memory Manager.  Indeed, the Memory Manager can even change the size of the paging file, if needed, to increase the amount of disk space (and hence the amount of ?virtual memory?) available to store such data.


As we discussed in the first part of this article (The NT Insider, Volume 5, Number 2) the page table entry for a given process may actually describe where the data is located within the paging file.  Thus, if the application program accesses that page, the page itself will be marked as invalid but the Memory Manager will notice that this page must be stored in the paging file.  In that instance, the Memory Manager can then interpret the remaining 31 bits of the page table entry to tell it where the data is located within the paging file.  This is graphically demonstrated in Figure 2.


Figure 2


Indeed, the existing Memory Manager implementation allows Windows NT to use more than a single paging file, although the file systems do limit support to one paging file per file system volume.  As shown in  Figure 2, there are four bits reserved for describing which paging file is being used for a given entry ? thus restricting the system to no more than 16 such paging files (although 16 paging files should be more than will be needed by almost anyone!).

Memory Mapped Files

As we mentioned earlier, some data is restored directly from the original file from which it came.  All such files are accessed using a memory mapping technique, where the file appears as if it were entirely in memory, even though it is paged in from disk as needed.  A memory mapped file could be read-only, as it is for executable programs, or modifiable as it is for data files. For example, the Win32 function MapViewOfFile is used by Win32 applications that wish to access their file as if it were loaded into their address space, rather than via the normal read/write interface.


There are actually a number of ways that a memory mapped file can be accessed (read-only, read-write, or for execution), but the Memory Manager groups these into two distinct types of access.  One type of access (used for executables) uses copy-on-write technology to ensure the original file is not modified (although the application can change its own data) and the other type of access (used for data files) simply writes changes back to the original file.


While changing the code in an executable program would seem to be uncommon, in fact it occurs regularly and thus it is necessary for Windows NT to support it.  This ensures that applications relying upon the ability to modify code work properly.  One example of such an application is a debugger.  Debuggers typically create breakpoints within the code by modifying the existing application code (loaded into memory).  When the breakpoint occurs, the debugger substitutes the correct code for the breakpoint and restarts execution.  Of course, it would be a terrible idea to write the breakpoint back to the original program file ? otherwise, anyone else who ran the program would encounter the breakpoint.  Worse yet, they wouldn?t have the correct code to execute, even if they were running under a different debugger!


Another example of code modification, indeed probably the most common case of code modification is for the import table associated with a DLL.  This table must be overwritten with the correct addresses of the functions (as they were loaded into memory, not as they were in the original DLL file located on disk).  Sometimes no modification is needed because the DLL was loaded into memory at the same location specified when it was linked.  Indeed, this is so much faster, Microsoft actually recommends (in the SDK documentation) that application programmers try to load their DLLs at separate locations to minimize the ?fix up? work required.


The first time an executable page is modified, it generates a page fault (because the page is marked as ?read-only? in its PTE).  The Memory Manager then:


·         Allocates a new physical page

·         Copies the current page?s data to that new page

·         Updates the PTE to point to the new physical page (and to allow writing onto that page)

·         Updates other internal data structures describing the address space of the process (such as the VAD tree)


Unlike executable images, however, changes to memory mapped data files must be written back to disk eventually.  Thus, such pages are initialized so that they may be modified by the application using the memory mapping. As changes are made to the page, the hardware (or possibly the operating system) updates the page table entry to note that the physical page has been modified.  Thus, when the Memory Manager or Balance Set Manager attempts to reclaim the page, it will be written back to disk before it can be reclaimed.  Note that if the system has enough physical memory it might be quite a long time before the Memory Manager attempts to reclaim that page.  Thus, applications must ensure that the data is flushed to disk by using the appropriate APIs.  This is different than when applications use the read/write interface, because in that case the Cache Manager actually writes the data back to disk a few seconds after it is stored into the cache.

Prototype Page Table Entries

One interesting problem that occurs with memory mapped files is that a single file could be in use by more than one process at any point in time.  For example, two programs might be using the same DLL.  In that case it would be a waste of resources to maintain two separate copies of the DLL in memory.


A more serious case would be if two applications (or two instances of the same application) opened a data file and each memory mapped it.  If there are two copies of that file in memory, there is absolutely no guarantee as to the order in which updates to the file will be written back to disk ? a situation where the ?last writer wins?.  Unfortunately, since individual pages could be written in different orders, there is no guarantee that the contents of such a file would even represent what either application thought was written to disk.



Figure 4


Of course, Windows NT has a clever solution to this problem ? guaranteeing that memory mapped files are consistent.  This is because there can only be one copy of the data. Indeed, the Memory Manager only maintains one copy of the data structure it uses to manage memory mapped files (the Section Object).  In turn, there is a set of special page tables associated with a given Section Object.  These page tables are not actually used by any particular process but are used when needed to build the real page table entries.  Because of this, these are referred to as the prototype page table entries.  Unlike ordinary page table entries that are typically used by the hardware, the Memory Manager is the only user of these prototype page table entries.  In fact, the Memory Manager uses prototype PTEs to represent any piece of shared memory because it provides a single point of control for the Memory Manager, rather than distributing the information between multiple page tables in separate address spaces.


In Figure 4 we show the situation where two page tables reference a common (shared) page via a prototype page table entry.  This is the case when the actual PTE entries for each of the two processes are invalid.  If one of the applications accesses its virtual address space, the hardware will generate a page fault. The Memory Manager will resolve this page fault by using the data in the prototype PTE.  In Figure 4, we show a prototype PTE where the data is stored in the paging file.  This might be the case for shared memory between two processes, where the paging file is used to back the shared memory region.




Figure 5


In Figure 5 we see the state of the page tables after the page has been referenced (and hence faulted into memory) by a single process.  The page table in that process address space is now marked as valid and it references the actual physical memory for the page.  In addition, the prototype PTE also references the same physical page, while the PTE in the second address space continues to reference the prototype PTE. This ensures that if the application references that invalid page the Memory Manager can fix it up to point to the correct physical page.


While this might seem complicated, it allows the Memory Manager to guarantee that shared memory is truly shared and not maintained in separate copies.  Of course, sometimes the Memory Manager deliberately breaks this sharing.  This was the case with the executable images with breakpoints or fix-ups we described earlier.  In that case, new pages were backed by the paging file and discarded when the application program stopped running.


Few parts of the Windows NT Operating System can compare in complexity with the virtual memory system.  At the same time, its operation is incredibly important to the operation of the entire system because all application programs (and even the operating system) consume virtual memory.


For those writing device drivers, it is important to understand what is happening when using programs like MmProbeAndLockPages, or MmGetSystemAddressForMdl because while they seem simple, they work in confusing and unexpected ways.  Understanding how Windows NT Virtual Memory works will ease debugging and ultimately improve the quality of the device drivers written.


For those developing file systems, a thorough understanding of the Windows NT VM system is critical to writing a properly functioning file system (or even file system filter driver).  This is because Windows NT file systems are tightly integrated with the Memory Manager.  While this article did not discuss the degree of this integration, suffice it to say that the integration is there.  Indeed, it is integration with the VM system that causes a substantial number of the complex issues surrounding file systems development.  Of course, the complexities of the VM system on Windows NT will continue to be an active topic of discussion in The NT Insider for many years to come.


This article was printed from OSR Online

Copyright 2017 OSR Open Systems Resources, Inc.