NDIS lightweight filter driver problems and questions

J_M · July 22, 2016, 11:19am

Hi,
We are building an NDIS lightweight filter driver to communicate user mode applications with miniports in the most direct and efficient way we can (both for sending and receiving).

We have already used other available drivers for the same purpose, but we have faced performance problems or limitations, difficult to deal with without a deep understanding of the inner workings of those drivers.

The aim of our driver is to share a buffer in reception and transmission (one for each direction) between the kernel and the application, mapped only once to kernel and application virtual address spaces, and then use them in the most efficient way in terms of context switches and throughput, to be able to capture and send Ethernet frames at line speed in 1G and 10G network interfaces (at first attempt, over Windows Teaming).

We have two problems at this stage:

– We don’t know how to avoid, in reception, the copy from the miniport reception buffer to our (“lightweight filter and application” shared) buffer. Specifically, we would like to avoid the call to NdisGetDataBuffer to copy bytes from the miniport buffer (inaccessible to us) to ours.

– We have a problem in transmission, by which, if the operating system is Windows Server 2012 or 2012 R2 (it doesn’t happen in Windows 10), send operations randomly fail when we send a frame which is located passed the first 100 KB of the shared buffer. We have a single MDL that describes the whole buffer, and use NET_BUFFER and NET_BUFFER_LIST structures to build a view of a portion of the buffer. That adapted view is what we pass NDIS to be sent. We know of the problem because we receive Status other than NDIS_STATUS_SUCCESS (usually NDIS_STATUS_FAILURE) inside the NET_BUFFER_LIST passed to the call to the FILTER_SEND_NET_BUFFER_LISTS_COMPLETE callback routine. The error code is unspecific.

What follows is a more detailed description of the sending process:

The application sends an IOCTL to the driver requesting a Tx channel.
The driver reserves some non-paged memory pages with MMAllocatePagesForMdlEx, maps them to virtual addresses for both the user and kernel spaces and returns the information to the application in response to the IOCTL. We?ve also tried (with similar results) reserving memory from the non-paged memory pool using NdisAllocateMemoryWithTagPriority and then using IOAllocateMdl and MMBuildMdlForNonPagedPool to create an MDL describing the memory.
The buffer is large enough to hold many messages (each of them with its preamble) so we can send batches of messages instead of individual ones. A number of NetBufferLists are pre-allocated for this purpose, all of them pointing to the same MDL.
When the application needs to send a message, it is copied to the shared buffer including a self-defined preamble with the message length and a few control flags.
An IOCTL is sent to the driver to indicate the presence of data in the shared buffer.
The driver starts reading messages from the buffer and mapping them to the pre-allocated NetBufferLists (taking care that the Offset, Length and Next properties of the NetBufferLists and NetBuffers contain correct values).
The complete NetBufferList chain is passed to NDIS for sending.
When the SendNetBufferListCompleteHandler is invoked by NDIS, the NetBufferLists are returned to the pool and the preamble of the messages in the buffer are flagged as success/error depending on the status reported by the operation.

This system seems to be working fine in windows 10. In windows server 2012, for buffer sizes up to 100K everything works fine, but if we try to use larger buffers, there appears to be a problem sending messages stored beyond this point. Upon completion, the status of these NetBufferLists just indicates a failure status (0xC0000001).

Basically, for a buffer of 131072 bytes it works, for a buffer of 262144 it doesn?t.

We understand how strange this sounds and we?ve tried experimenting by removing the interaction with the application layer (just in case) and the results were the same. We?ve also tried reserving a huge memory buffer and then mapping only a 100K section of that buffer for send operations and in this experiment everything works fine.

We also understand that the buffer used to send a message via NDIS must only be large enough to hold the message being sent, which for an Ethernet frame would be around 1500 bytes. Our buffer is larger because we want the application to be able to go on producing frames on free portions of the shared buffer while previous frames are being sent.

For the first problem, our questions are: Can we avoid extra copies in NDIS lightweight filter driver reception? For example, interacting directly with the DMA process (setting the DMA buffer from the filter driver or something like that, and then sharing that buffer with the application)? We are trying to build a zero copy solution.

For the second problem, our questions are: Is it possible that NDIS has problems handling messages inside NetBufferLists that reference large MDLs? Has anyone found anything like this before or has any clues on where the problem might be?

Thank you very much.
J.M.

Jan_Bottorff · July 22, 2016, 4:04pm

As I’m currently working on a 25/50 Gbps NIC driver, let me speak from the NIC driver viewpoint.

You realize there is already a user mode API for doing much of what you describe, Registered IO (RIO). See https://technet.microsoft.com/en-us/library/hh997032(v=ws.11).aspx

Your shared buffer architecture has some issues.

First, the buffers the NIC hardware DMAs received data into are allocated and owned by the NIC. When it indicates received data, it temporarily lends the buffers to the upper layers, but expects the upper layers to give them back “soon”. Soon is often defined in terms of current receive speed, but a really rough rule of thumb might be if you the transport are holding on to a thousand buffers, you’re holding on to them too long. The NIC can also indicate packets to the transport and demand you return the buffers immediately, like when the NIC thinks he is running low on buffers.

There is absolutely no way you can permanently map the NIC receive buffers to user mode, unless you are writing the NIC driver too. If you were writing the NIC driver, you could have an OID/IOCTL that told it to allocate and map the receive buffers to some shared place. A transport could take indicated packets and temporarily map them to a user mode process, but my guess is that would perform terribly. If you generically want to process received buffers without a copy, you will have to do it from kernel mode. This is one of the reasons SMB/NFS file servers are kernel mode components and not just user mode apps.

Modern servers can copy memory at 20 GBytes/sec or so. Copying 1 Gbps or even 10 Gbps/sec is not unrealistic. Copying becomes a bigger problem at 40/100 Gbps. If you really want zero copy receive you will need to use the RDMA interface, or process received data as fragments in kernel mode during the DISPATCH_LEVEL packet indication. A kernel sockets component doing TCP will get a callback with the fragmented uncopied receive buffers.

On the send side, if you really wanted, you could have a big shared buffer and a driver that built NET_BUFFER’s pointing into some slice of that buffer. I’d recommend against having giant MDLs that were passed to the NIC send side. For example, I could imagine a NIC driver that takes a send packet MDL and feeds the whole thing through DMA mapping to a physical address, and then applies the starting offset. In this case, the physical scatter gather lists would be large and many cycles would be wasted. The NIC driver might also preallocate the scatter-gather list, which means it would have a limit on how many fragments it could have. I’m not surprised you see issues sending packets with 256K MDL mappings. It sounds like your shared send buffer is already a single copy. The Windows NDIS architecture is very oriented toward sending data with zero copies, from arbitrary fragments of memory. Perhaps in your particular case, the application is calculating the send buffer, and can do so in the shared buffer.

If your packets are going to be virtually contiguous in this shared buffer, it seems like a simple matter of have a bunch of overlapping MDLs that map 12K slices of the buffer every 4K, so you just pick the correct MDL and the offset will always be < 4K. I picked 12K slices assuming you are talking about a media that only supports 9K packets, like typical Ethernet.

If you’re looking at high speed packet processing, you might also look at the poll mode PacketDirect interface, which is a new feature in Windows 10/Server 2016.

Jan

From: on behalf of “J. M.”
Reply-To: Windows List
Date: Friday, July 22, 2016 at 8:17 AM
To: Windows List
Subject: [ntdev] NDIS lightweight filter driver problems and questions

Hi,

We are building an NDIS lightweight filter driver to communicate user mode applications with miniports in the most direct and efficient way we can (both for sending and receiving).

We have already used other available drivers for the same purpose, but we have faced performance problems or limitations, difficult to deal with without a deep understanding of the inner workings of those drivers.

The aim of our driver is to share a buffer in reception and transmission (one for each direction) between the kernel and the application, mapped only once to kernel and application virtual address spaces, and then use them in the most efficient way in terms of context switches and throughput, to be able to capture and send Ethernet frames at line speed in 1G and 10G network interfaces (at first attempt, over Windows Teaming).

We have two problems at this stage:

– We don’t know how to avoid, in reception, the copy from the miniport reception buffer to our (“lightweight filter and application” shared) buffer. Specifically, we would like to avoid the call to NdisGetDataBuffer to copy bytes from the miniport buffer (inaccessible to us) to ours.

– We have a problem in transmission, by which, if the operating system is Windows Server 2012 or 2012 R2 (it doesn’t happen in Windows 10), send operations randomly fail when we send a frame which is located passed the first 100 KB of the shared buffer. We have a single MDL that describes the whole buffer, and use NET_BUFFER and NET_BUFFER_LIST structures to build a view of a portion of the buffer. That adapted view is what we pass NDIS to be sent. We know of the problem because we receive Status other than NDIS_STATUS_SUCCESS (usually NDIS_STATUS_FAILURE) inside the NET_BUFFER_LIST passed to the call to the FILTER_SEND_NET_BUFFER_LISTS_COMPLETE callback routine. The error code is unspecific.
What follows is a more detailed description of the sending process:

1. The application sends an IOCTL to the driver requesting a Tx channel.

2. The driver reserves some non-paged memory pages with MMAllocatePagesForMdlEx, maps them to virtual addresses for both the user and kernel spaces and returns the information to the application in response to the IOCTL. We’ve also tried (with similar results) reserving memory from the non-paged memory pool using NdisAllocateMemoryWithTagPriority and then using IOAllocateMdl and MMBuildMdlForNonPagedPool to create an MDL describing the memory.

3. The buffer is large enough to hold many messages (each of them with its preamble) so we can send batches of messages instead of individual ones. A number of NetBufferLists are pre-allocated for this purpose, all of them pointing to the same MDL.

4. When the application needs to send a message, it is copied to the shared buffer including a self-defined preamble with the message length and a few control flags.

5. An IOCTL is sent to the driver to indicate the presence of data in the shared buffer.

6. The driver starts reading messages from the buffer and mapping them to the pre-allocated NetBufferLists (taking care that the Offset, Length and Next properties of the NetBufferLists and NetBuffers contain correct values).

7. The complete NetBufferList chain is passed to NDIS for sending.

8. When the SendNetBufferListCompleteHandler is invoked by NDIS, the NetBufferLists are returned to the pool and the preamble of the messages in the buffer are flagged as success/error depending on the status reported by the operation.

This system seems to be working fine in windows 10. In windows server 2012, for buffer sizes up to 100K everything works fine, but if we try to use larger buffers, there appears to be a problem sending messages stored beyond this point. Upon completion, the status of these NetBufferLists just indicates a failure status (0xC0000001).

Basically, for a buffer of 131072 bytes it works, for a buffer of 262144 it doesn’t.

We understand how strange this sounds and we’ve tried experimenting by removing the interaction with the application layer (just in case) and the results were the same. We’ve also tried reserving a huge memory buffer and then mapping only a 100K section of that buffer for send operations and in this experiment everything works fine.

We also understand that the buffer used to send a message via NDIS must only be large enough to hold the message being sent, which for an Ethernet frame would be around 1500 bytes. Our buffer is larger because we want the application to be able to go on producing frames on free portions of the shared buffer while previous frames are being sent.

For the first problem, our questions are: Can we avoid extra copies in NDIS lightweight filter driver reception? For example, interacting directly with the DMA process (setting the DMA buffer from the filter driver or something like that, and then sharing that buffer with the application)? We are trying to build a zero copy solution.

For the second problem, our questions are: Is it possible that NDIS has problems handling messages inside NetBufferLists that reference large MDLs? Has anyone found anything like this before or has any clues on where the problem might be?

Thank you very much.
J.M.

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

J_M · July 28, 2016, 8:22am

Thanks for your quick response, Jan.

Unfortunately we can?t use Registered IO because AFIK it works over sockets, which is a part of the windows architecture that we need to bypass, including raw sockets. It has to do with how windows selects the network adapter to use for sending when there?re several available (routing table) and also with how some special ports/protocols are filtered out.

In the end we?ve gone with your proposal of using several smaller MDLs of about the maximum size of a simple Ethernet package (1522 bytes) to describe the Tx buffer instead of a single large one. So far it?s working fine.

We?re also considering using even smaller MDLs of a configurable size (in the assumption that NDIS could be copying to the device the whole buffer described by the MDL before applying offset and length). This way (if the size is appropriately chosen) most messages would fit in a single MDL; but if a larger message must be sent, several MDLs can be chained inside the same NET_BUFFER.

It’s a bit disappointing that such a limitation on large MDL usage, or details about the data exchange algorithm between NDIS and the device are not well described or easily found in NDIS’s documentation. We are now protecting ourselves against an unpleasant experimental result, without a authoritative statement that such a usage is forbidden.

Thanks a lot for your help!

J.M.