Size of Data Copy from Memory Mapped Device to Host in Storport Miniport Driver

Hi All,

Currently working in StorportMiniport Driver, I have mapped my device BAR through StorportGetDeviceBase. I have to now do a 4KB data transfer from the mapped address of the device to host address. I use StorportCopyMemory to do the data transfer, but while doing that, i can see multiple MemRd32 with length of 2 DWORDS getting copied to the host. Basically for a single memcpy, 4KB/8PCIE transactions happened, whic is a costly operation. Is there any way to increase the memory transfer size greater than 2 DWORDS? That will reduce the number of transactions.

Regards,
Chirag

On Sep 28, 2017, at 11:30 AM, xxxxx@gmail.com wrote:
>
> Currently working in StorportMiniport Driver, I have mapped my device BAR through StorportGetDeviceBase. I have to now do a 4KB data transfer from the mapped address of the device to host address. I use StorportCopyMemory to do the data transfer, but while doing that, i can see multiple MemRd32 with length of 2 DWORDS getting copied to the host. Basically for a single memcpy, 4KB/8PCIE transactions happened, whic is a costly operation. Is there any way to increase the memory transfer size greater than 2 DWORDS? That will reduce the number of transactions.

Not from the host. Think about it from a hardware level: the CPU is just reading memory. It doesn’t have any idea that you’re talking to a bus device, so it can’t force any PCIe magic, and the PCIe root complex has no way to know that you’ll be issuing a whole set of reads, so it can’t combine them.

The only way to get larger PCIe transfers is to do bus mastered DMA from the device. Then, you can get up to the maximum packet size for the bus, usually 128 bytes.

However, I strongly suspect you are guilty of premature optimization. What makes you think this will make a measurable difference?

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

How about modifying PCI Express capability -> Device control register -> “Max_Payload_size” of a TLP . Will it help?

> How about modifying PCI Express capability -> Device control register ->
“Max_Payload_size” of a TLP . Will it help?

This affects only DMA transfers. Not helpful for PIO accesses.

– pa

On Oct 1, 2017, at 7:55 PM, xxxxx@gmail.com wrote:
>
> How about modifying PCI Express capability -> Device control register -> “Max_Payload_size” of a TLP . Will it help?

Besides Pavel’s answer, which is correct, you are not allowed to modify that register. The max payload size is negotiated at enumeration time. The root complex has an upper limit, and the device has an upper limit. The register is set to the smaller of the two values, and the connection won’t work if you force it larger.

I tried to explain why this can’t work from a host. At the bus transaction level, no one knows that this is going to be a long transfer, so no one can make a bigger transfer. If you need this, you will have to do DMA from the device.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Can I do a system mastered DMA to the device to read the data from the device into a host buffer?

On Oct 4, 2017, at 2:03 AM, xxxxx@gmail.com wrote:
>
> Can I do a system mastered DMA to the device to read the data from the device into a host buffer?

In general, no. The DMA controller on most motherboards today is a legacy from the old ISA bus, useless for any general-purpose activities.

I understand that one of the newer Intel chipsets includes a system DMA controller, but I can’t find the reference now, and I don’t know if there is operating system support for it.

Do you have any real evidence that your performance is not good enough as is? Or are you just guessing?

I am not a disk driver guy, but the Microsoft documentation seems to say that Storport drivers are required to support DMA-based I/O. Does your hardware really not do DMA?

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

The reason why we don’t want to do DMA from device is because we are doing some analysis for some cases in which host does the transfer and not the device.

This is the actual Problem that we are facing:
Data copy is done in the DPC. During data copy from device to host, using StorPortMoveMemory, the DPC runs for a long time than its scheduled maximum of 100 microsecs. This causes the OS to hang. For this reason wanted to do a DMA from host so as to avoid stalling in the processor.

Approach 1 :

  1. Using StorportAllocateMDL to create MDL for Device buffer.
  2. Using StorPortBuildScatterGatherList to do DMA.

Approach 2:

  1. Using IoGetDmaAdapter to get DmaAdapter from PDO
  2. Using InitializeDmaTransferContext to initialize the DMA Transfer
  3. Using AllocateAdapterChannelEx to allocate resources for DMA Transfer
  4. Using MapTransferEx to do the DMA

Is any of the approach correct? If yes, is any of the approaches feasible in StorPort Miniport Driver? If not, are there any more DMA API’s which can be used to do the same?

Thanks!
Chirag

xxxxx@gmail.com wrote:

The reason why we don’t want to do DMA from device is because we are doing some analysis for some cases in which host does the transfer and not the device.

Well, I think your analysis is complete.  When the host does the
transfer, the TLP size will never be larger than 4.

 

This is the actual Problem that we are facing:
Data copy is done in the DPC. During data copy from device to host, using StorPortMoveMemory, the DPC runs for a long time than its scheduled maximum of 100 microsecs. This causes the OS to hang. For this reason wanted to do a DMA from host so as to avoid stalling in the processor.

That has very little to do with TLP size and more to do with raw
bandwidth.  Regardless of TLP size, 100us is only enough time to copy
100kB over a PCIe 3.0 x1 device.  If you need to copy more, you either
need to defer it to a lower-priority thread, or do it with device DMA.

Approach 1 :

  1. Using StorportAllocateMDL to create MDL for Device buffer.
  2. Using StorPortBuildScatterGatherList to do DMA.

I assume you are just abbreviating here, and that you realize
StorPortBuildScatterGatherList doesn’t actually do DMA.  It just gives
you the page numbers you need to create DMA descriptors for your
hardware’s DMA engine.  You still have to set up and fire that operation
yourself.

Assuming the device is doing the DMA, you don’t need an MDL for the
device buffer.  The device knows its own destination, and device memory
is always physically contiguous.  The scatter/gather list is for the
system memory buffer.

Approach 2:

  1. Using IoGetDmaAdapter to get DmaAdapter from PDO
  2. Using InitializeDmaTransferContext to initialize the DMA Transfer
  3. Using AllocateAdapterChannelEx to allocate resources for DMA Transfer
  4. Using MapTransferEx to do the DMA

My same “abbreviation” comment applies to MapTransferEx.  What you’ve
described here is essentially what StorPorBuildScatterGatherList does,
but without using kernel-specific APIs.

Is any of the approach correct? If yes, is any of the approaches feasible in StorPort Miniport Driver? If not, are there any more DMA API’s which can be used to do the same?

StorPortBuildScatterGatherList should be fine.  You just need to build
your descriptors and fire the DMA.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.