Contiguous Memory Performance

So, I’m developing a PCIe based WDF driver for one of our RF transceivers. In this design I’m allocating a fairly large contiguous( at least 128MiB ) buffer that is being accessed ( as bus master ) from the FPGA on the transceiver. Logically this works as expected, but I’ve noticed that when accessing( post transceiver filled ) the data in the buffer from the driver there is an approximate 10x increase in copy time, using memcpy()( i.e. 67mS versus 6mS ) with the contiguous buffer versus that of a non-contiguous buffer when copying data from the buffer(s).

As for the API’S I’m using: MmAllocateContiguousMemorySpecifyCache() for allocating the contiguous buffer and MmAllocateNonCachedMemory() for the non-contiguous buffer. Any information on this would be greatly appreciated.

-Alfred

Interesting issue. I’ve heard reports of things like this before.

A little more information would be useful, please.

Is the timing difference you describe JUST the time of the memcpy? Does “mS” mean “milliseconds”? Cuz, you know, “that’s a lot”…

When you’re calling MmAllocateContiguousMemorySpecifyCache are you asking for cached or uncached memory?

How are you getting the address that you give to your device? Please don’t tell me you’re calling MmGetPhysicalAddress, because if you are that’s an architecture problem.

Why are you allocating physically contiguous memory at all? Is your device not capable of scatter/gather (chained DMA operations)?

Looking forward to learning more,

Peter
OSR
@OSRDrivers

Like Peter, I strongly suspect the caching type. In virtually every instance on Windows-capable machines, you want MmCached. MmNonCached would give you the performance difference you describe.

But if it is a WDF driver, why are you using MmAllocateContiguousMemorySpecifyCache instead of WdfCommonBufferCreate?

>Like Peter, I strongly suspect the caching type

Judging from what you said below, you expect cached memory to perform better, right.

Such a suggestion would be justified if we were speaking about frequent access to the same memory location(s). In this case one may,indeed, have a chance to experimentally establish the performance difference between CPU’s on-chip cache memory and external RAM. However, we are speaking about copying the LARGE buffer. With this access pattern, every memory location is accessed only once, so that the CPU has to go to off-chip RAM upon every data access anyway…

In virtually every instance on Windows-capable machines, you want MmCached.
MmNonCached would give you the performance difference you describe.

Actually, I think the opposite could be the case. If you use cached memory for such an operation
your existing data cache lines get trashed - they get invalidated and overwritten with data multiple times for no reason whatsoever. As a result, when the operation is over the CPU will have to go to off-chip RAM upon every data access, because all its cache lines are going to be filled, at this particular moment, with the data that had been copied from the buffer…

BTW, the OP made it clear that he calls MmAllocateNonCachedMemory() when he uses non-contiguous buffer (i.e.in the situation when he gets better performance), which somehow
indicates that my theory could be correct…

Anton Bassov

xxxxx@hotmail.com wrote:

> Like Peter, I strongly suspect the caching type
Judging from what you said below, you expect cached memory to perform better, right.

Without question.  I’ve tested it, as part of a patent investigation
some years ago.

Such a suggestion would be justified if we were speaking about frequent access to the same memory location(s). In this case one may,indeed, have a chance to experimentally establish the performance difference between CPU’s on-chip cache memory and external RAM. However, we are speaking about copying the LARGE buffer. With this access pattern, every memory location is accessed only once, so that the CPU has to go to off-chip RAM upon every data access anyway…

Nope, you don’t understand the issue.  With caching, when you access
that first word, the caching gear automatically pre-fetches the rest of
the cache line while the CPU goes on to do other things.  It’s almost
like a mini-DMA.  As the CPU bumps incrementally to the next words, they
are already in the cache, so it’s a 1-cycle access.  With caching turned
off, each dword access means blocking for another round-trip to the RAM
chips.

The benefit is significant.

> In virtually every instance on Windows-capable machines, you want MmCached.
> MmNonCached would give you the performance difference you describe.
Actually, I think the opposite could be the case.

Nope, in this case, you are wrong.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for the responses. I will experiment with the caching.

But if it is a WDF driver, why are you using
MmAllocateContiguousMemorySpecifyCache instead of WdfCommonBufferCreate?

More detail on the hardware architecture: The FPGA on the hardware only exposes two 64bit window registers that the driver populates with contiguous 16MiB aligned buffer physical addresses. The FPGA transfers data( bus master ) to the two addresses in a ping pong configuration. When the driver receives MSI’s from the FPGA it will update the window register addresses in an interleaving fashion with 16MiB address offsets with in the large contiguous buffer. Since the hardware does not have a DMA controller interface I wasn’t exactly sure how to setup the DMA enabler object associated with WdfCommonBuffferCreate().

-Alfred

Hi

Have below related question on this.

>
Please don’t tell me you’re calling MmGetPhysicalAddress, because if you are that’s an
architecture problem.
<<

My device is 64-bit addressable bus master common-buffer device (pushing for true SG support).

  1. Ring_buffer is allocated using WdfCommonBufferCreate()
  2. But each Ring_buffer_ENTRY_descriptor has this buffer address (for device to DMA write/read to/from) obtained through MmAllocateContiguousMemory()/MmGetPhysicalAddress().

Is this case the same as above?

http://www.osronline.com/article.cfm?article=539
?DMA is never performed directly using physical memory addresses. Not ever. ?is a serious design error to determine the physical address of a user data buffer (such as by calling MmGetPhysicalAddress) and program your DMA device with that address

MSDN - https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/nc-wdm-pallocate_common_buffer
This memory **appears contiguous** to the device?. Use this address rather than calling MmGetPhysicalAddress because the system can take into account any **platform-specific memory restrictions (?)**.

Is above referred in MSDN the exact issue? - i.e. (due to chipset/device addressing intricacies ? my words) the mem appears contiguous to device but not to host? so use device-bus-logical instead of physical.

Does this also implies below *ALWAYS* (numerically) everywhere
MmGetPhysicalAddress() != WdfCommonBufferGetAlignedLogicalAddress()

>Is this case the same as above?

It’s the case that you’re not supposed to do that.

My not allocate the location to be DMA’ed using WdfCommonBufferCreate, *instead* of MmAllocateContiguousMemory* and MmGetPhysicalAddress.

That way you’ll be architecturally correct.

Having said that, as you’ve already noticed… doing it the way you’re doing it now DOES work.

Peter
OSR
@OSRDrivers

xxxxx@clearbit.org wrote:

More detail on the hardware architecture: The FPGA on the hardware only exposes two 64bit window registers that the driver populates with contiguous 16MiB aligned buffer physical addresses. The FPGA transfers data( bus master ) to the two addresses in a ping pong configuration. When the driver receives MSI’s from the FPGA it will update the window register addresses in an interleaving fashion with 16MiB address offsets with in the large contiguous buffer. Since the hardware does not have a DMA controller interface I wasn’t exactly sure how to setup the DMA enabler object associated with WdfCommonBuffferCreate().

I’m not sure what you were expecting, but this: “The FPGA transfers
data( bus master ) to the two addresses…” is exactly a “DMA
controller”.  No more, no less.  You have a 64-bit packet-based DMA
controller, so you’d pass WdfDmaProfilePacket64.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I know what you are trying to say, but we are speaking about copying memory (i.e. REP MOVx) instruction. If we were speaking about copying data to/from CPU registers I would agree with your statement.However, here we are speaking about basically RAM-to-RAM operation. I think the CPU, probably, caches this operation internally anyway

Nope, in this case, you are wrong.

Although it may be the case, we still have a “puzzling part” - according to the OP, he gets better results when he uses non-contiguous memory that he gets with MmAllocateNonCachedMemory().
Have you got any explanation to this part?

Anton Bassov

On Mar 16, 2018, at 5:24 PM, xxxxx@hotmail.com wrote:
>
> I know what you are trying to say, but we are speaking about copying memory (i.e. REP MOVx) instruction. If we were speaking about copying data to/from CPU registers I would agree with your statement.However, here we are speaking about basically RAM-to-RAM operation. I think the CPU, probably, caches this operation internally anyway

How can it possibly do that? If the memory region is marked non-cached, then it’s not allowed to cache it, end of story. Every dword in the copy is going to require a memory round-trip.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

>How can it possibly do that? If the memory region is marked non-cached, then it’s not

allowed to cache it, end of story.

Well, THEORETICALLY it may do so by some form of read-ahead into some specially-designated internal register that has the size of cache line. This is not a “regular” cache, so that modifications to the target memory are not allowed. You are not subjected to cache coherence protocols either. If you think about it carefully you are (hopefully) going to realize that such an approach would not violate “uncached memory” semantics in any possible way - in the end of the day, it does not matter whether you read N dwords first and then write them back one-by-one, or write every dword immediately after having read it. As long as you are not required to follow cache coherence protocols your operation is, from all other bus agents perspective, is still uncached. Another point to consider is whether such an approach may lead to any improvement, compared to"word read-dword written"one, in the first place, but this is already another story

The most interesting thing that, after having mentioned cache coherence protocols, I started realising why the OP gets better results with MmAllocateNonCachedMemory().

Imagine the following scenario. CPU A reads a cache line from the source RAM location and
starts writing it to the destination one. Meanwhile, CPU B modifies the target memory. What is CPU A is required to do under these circumstances? As long as it follows MESI protocols, it has to
invalidate the cache line, and reload it from RAM. Every write that does not cross cache line boundaries is guaranteed to be atomic, which means that, after having reloaded cache line from the source location, CPU A has to copy it to the destination one. If destination memory is uncached, the CPU A simply has to write it to RAM entirely, effectively overwriting the modifications that it did before, i.e. do few more bus cycles. If it is cached, in addition to the above, CPU A has to make all other CPUs invalidate their corresponding cache lines and reload them, effectively increasing the bus traffic. In other words, following cache coherence protocols becomes sort of liability under these circumstances.

As you can see, making the target ranges cacheable does not seem to speed up copying, does it. In fact, it seems to have exactly the opposite effect…

Anton Bassov

On Mar 18, 2018, at 2:08 PM, xxxxx@hotmail.com wrote:
>
> Well, THEORETICALLY it may do so by some form of read-ahead into some specially-designated internal register that has the size of cache line.

Why are you pushing this strawman argument? X86 processors don’t do this, so the theoretical possibilities are not really relevant.

> Imagine the following scenario. CPU A reads a cache line from the source RAM location and
> starts writing it to the destination one. Meanwhile, CPU B modifies the target memory. What is CPU A is required to do under these circumstances?

I don’t much care, since this does not relate to the original poster’s situation.

> As you can see, making the target ranges cacheable does not seem to speed up copying, does it. In fact, it seems to have exactly the opposite effect…

What I believe is that you could construct a fictional scenario with a fictional CPU and a fictional memory architecture where your statement is true. However, that fable is unrelated to reality. The reality is that, with an x86-based architecture, caching increases performance, usually by a dramatic margin. End of story.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Why are you pushing this strawman argument? X86 processors don’t do this, so the theoretical > possibilities are not really relevant.

Are you saying that you know all the internal implementation details of all x86-based processors in existence? Let’s face it - in order to be in a position to make sweeping statements like that so emphatically one needs to have this info

I don’t much care, since this does not relate to the original poster’s situation.

Well, as long as the OP runs his tests on multi-core system where his target range may be accessed by multiple bus agents, this is VERY closely related to his situation…

Well, we are not speaking about the fictional CPUs, are we. In any case, once you are so sure about everything, please provide your explanation of why the OP gets better performance with memory that he allocates with MmAllocateNonCachedMemory(), and we are done with it.

Anton Bassov

> Well, THEORETICALLY it may do so by some form of read-ahead
> into some specially-designated internal register that has the
> size of cache line.

Reads from device registers may have side effects (e.g. reading from a status
register clears pending interrupts), so the processor can’t read ahead in an
uncached region.

> Reads from device registers may have side effects (e.g. reading from a status

register clears pending interrupts),

True, but normally you don’t access memory-mapped device registers in REP MOVx fashion, do you…

My (probably,flawed) theory is that REP MOVx semantics may be a bit different ROM “regular” reads and writes.

The most interesting part here is that, if you think about it carefully, you will (hopefully) realise that read-ahead in context REP MOVx would not change anything even if we are speaking about read-once registers. This also explains why accessing device registers in REP MOVx fashion would be pretty unwise…

Anton Bassov