>How can it possibly do that? If the memory region is marked non-cached, then it’s not
allowed to cache it, end of story.
Well, THEORETICALLY it may do so by some form of read-ahead into some specially-designated internal register that has the size of cache line. This is not a “regular” cache, so that modifications to the target memory are not allowed. You are not subjected to cache coherence protocols either. If you think about it carefully you are (hopefully) going to realize that such an approach would not violate “uncached memory” semantics in any possible way - in the end of the day, it does not matter whether you read N dwords first and then write them back one-by-one, or write every dword immediately after having read it. As long as you are not required to follow cache coherence protocols your operation is, from all other bus agents perspective, is still uncached. Another point to consider is whether such an approach may lead to any improvement, compared to"word read-dword written"one, in the first place, but this is already another story
The most interesting thing that, after having mentioned cache coherence protocols, I started realising why the OP gets better results with MmAllocateNonCachedMemory().
Imagine the following scenario. CPU A reads a cache line from the source RAM location and
starts writing it to the destination one. Meanwhile, CPU B modifies the target memory. What is CPU A is required to do under these circumstances? As long as it follows MESI protocols, it has to
invalidate the cache line, and reload it from RAM. Every write that does not cross cache line boundaries is guaranteed to be atomic, which means that, after having reloaded cache line from the source location, CPU A has to copy it to the destination one. If destination memory is uncached, the CPU A simply has to write it to RAM entirely, effectively overwriting the modifications that it did before, i.e. do few more bus cycles. If it is cached, in addition to the above, CPU A has to make all other CPUs invalidate their corresponding cache lines and reload them, effectively increasing the bus traffic. In other words, following cache coherence protocols becomes sort of liability under these circumstances.
As you can see, making the target ranges cacheable does not seem to speed up copying, does it. In fact, it seems to have exactly the opposite effect…
Anton Bassov