multiple MDL's referring to the same memory...

James_Harper · January 20, 2009, 5:01am

I’m looking for ways to improve the performance of my Xen network
drivers, particularly on the RX path where there is quite a lot of
copying going on.

The way it works is as follows:
. Allocate pages of memory and put them on a ring buffer, 1 page per
ring slot.
. When an interrupt occurs, packets will be waiting on the ring.
. A packet may be a normal 1500 byte packet (maybe consisting of the
header and subsequent data being on separate pages) in which case I can
give it straight to windows, unless the header is too small for windows

windows and linux have differing requirements on buffer layout.
. Alternatively, the packet may have originated from another DomU on the
same physical machine which may be using Large Send Offload. Linux can
accept ‘large’ packets just fine, but I need to break these up in
Windows for this to work, which involves copying.

It’s the last point that I want to try and optimise. Assuming an MSS of
1460 bytes, the ‘broken up’ packets might look like this:

1: 54 bytes of new header + bytes 54-1513 of the original packet
2: 54 bytes of new header + bytes 1514-2973 of the original packet
3: 54 bytes of new header + bytes 2974-4433 of the original packet
And so on up to around 60k of data

What I have been doing is just allocating new buffers and copying the
data from the large packet data into the new packet before giving the
new packet to Windows, but obviously that involves an allocation and a
copy operation.

So what about if I created some new NDIS buffers (eg MDL’s) that map
different parts of the original buffer? I can’t see any reason why that
wouldn’t work but maybe it’s documented as a no-no somewhere?

Any other comments on this idea appreciated too.

Thanks

James

James_Harper · January 20, 2009, 5:51am

> -----Original Message-----

From: xxxxx@lists.osr.com [mailto:bounce-351546-
xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Tuesday, 20 January 2009 21:01
To: Windows System Software Devs Interest List
Subject: [ntdev] multiple MDL’s referring to the same memory…

I’m looking for ways to improve the performance of my Xen network
drivers, particularly on the RX path where there is quite a lot of
copying going on.

The way it works is as follows:
. Allocate pages of memory and put them on a ring buffer, 1 page per
ring slot.
. When an interrupt occurs, packets will be waiting on the ring.
. A packet may be a normal 1500 byte packet (maybe consisting of the
header and subsequent data being on separate pages) in which case I
can
give it straight to windows, unless the header is too small for
windows

windows and linux have differing requirements on buffer layout.
. Alternatively, the packet may have originated from another DomU on
the
same physical machine which may be using Large Send Offload. Linux can
accept ‘large’ packets just fine, but I need to break these up in
Windows for this to work, which involves copying.

It’s the last point that I want to try and optimise. Assuming an MSS
of
1460 bytes, the ‘broken up’ packets might look like this:

1: 54 bytes of new header + bytes 54-1513 of the original packet
2: 54 bytes of new header + bytes 1514-2973 of the original packet
3: 54 bytes of new header + bytes 2974-4433 of the original packet
And so on up to around 60k of data

What I have been doing is just allocating new buffers and copying the
data from the large packet data into the new packet before giving the
new packet to Windows, but obviously that involves an allocation and a
copy operation.

So what about if I created some new NDIS buffers (eg MDL’s) that map
different parts of the original buffer? I can’t see any reason why
that
wouldn’t work but maybe it’s documented as a no-no somewhere?

Any other comments on this idea appreciated too.

On second thought, there doesn’t appear to be a mechanism to allow an
NDIS buffer to have a start address that is not aligned to PAGE_SIZE,
which makes my optimisation impossible (or at least unsupported)…
d’oh.

James

Maxim_S_Shatskih · January 20, 2009, 7:11am

>So what about if I created some new NDIS buffers (eg MDL’s) that map

different parts of the original buffer?

Why not? you have IoBuildPartialMdl for this.

BTW - are you really sure Windows requires splitting?

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · January 20, 2009, 7:12am

>On second thought, there doesn’t appear to be a mechanism to allow an

NDIS buffer to have a start address that is not aligned to PAGE_SIZE,

NDIS_BUFFER is MDL. Plain and simple. Surely ->ByteOffset can be used.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

David_R_Cattley · January 20, 2009, 9:14am

As Max has pointed out, if you have a MDL (NDIS_BUFFER) describing the
memory, you can use NdisCopyBuffer() to allocate a new NDIS_BUFFER that
describes a range in the original NDIS_BUFFER (MDL). So if you ring buffer
pages are allocated as a slab or as multiple slabs, it should not be an
issue.

As for NDIS not accepting large packets - well, that is not true per-se.
You have likely reported the buffer sizes and maximum transfer unit sizes to
NDIS via responses to OID_GEN_xxx that have *told* bound protocols above how
large a packet you can handle.

The devil in the details is that if your virtual switch connects to an
external link (or links) then the external link MTU will result in an upper
bound MTU that the other (virtual) links must match. The whole ‘selective
jumbo packet’ thingy is a bit tricky. I don’t think the TCP/IP
implementations are well adapted to dealing with that much choice at the
datalink layer and this really is a different thing than large-send offload.

Good Luck,
Dave Cattley
Consulting Engineer
Systems Software Development

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Tuesday, January 20, 2009 7:12 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] multiple MDL’s referring to the same memory…

On second thought, there doesn’t appear to be a mechanism to allow an
NDIS buffer to have a start address that is not aligned to PAGE_SIZE,

NDIS_BUFFER is MDL. Plain and simple. Surely ->ByteOffset can be used.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · January 20, 2009, 3:02pm

> As Max has pointed out, if you have a MDL (NDIS_BUFFER) describing the

memory, you can use NdisCopyBuffer()

Yes.

BTW - NdisCopyBuffer is IoAllocateMdl+IoBuildPartialMdl. It is interesting to reverse-engineer this tiny function, since this gives a clear sample on how to call the very useful function of IoBuildPartialMdl.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

James_Harper · January 20, 2009, 4:36pm

>

As Max has pointed out, if you have a MDL (NDIS_BUFFER) describing the
memory, you can use NdisCopyBuffer() to allocate a new NDIS_BUFFER
that
describes a range in the original NDIS_BUFFER (MDL). So if you ring
buffer
pages are allocated as a slab or as multiple slabs, it should not be
an
issue.

Wow. I had glanced at the ‘NdisCopyBuffer’ function in the help so many
times and just assumed that it simply memcpy’d memory around. Thankyou
for your enlightenment - it turns out that it’s exactly what I want!

Now that I have multiple Buffers/MDL’s pointing to the same area of
memory, does NDIS take care of the fact that multiple NdisFreeBuffer
calls will take place for different MDL’s but the same underlying page
of memory?

As for NDIS not accepting large packets - well, that is not true
per-se.
You have likely reported the buffer sizes and maximum transfer unit
sizes
to
NDIS via responses to OID_GEN_xxx that have *told* bound protocols
above
how
large a packet you can handle.

The devil in the details is that if your virtual switch connects to an
external link (or links) then the external link MTU will result in an
upper
bound MTU that the other (virtual) links must match. The whole
‘selective
jumbo packet’ thingy is a bit tricky. I don’t think the TCP/IP
implementations are well adapted to dealing with that much choice at
the
datalink layer and this really is a different thing than large-send
offload.

Exactly. The problem is getting windows to receive a 60kb packet without
then trying to send one (except of course that it can send one if it’s
TCP and it specifies the MSS value).

Now that I know what NdisCopyBuffer is all about, I should to be able to
reduce 2 memory copies, which only leaves me with the (in theory
unnecessary) checksum operation - see Annie Li’s post about this.

James

David_R_Cattley · January 20, 2009, 5:18pm

When you call NdisFreeBuffer() you are only freeing the MDL not the memory.
NdisFreeBuffer() should only free a NDIS_BUFFER allocated by
NdisAllocateBuffer() or NdisCopyBuffer() but really, it is just IoFreeMdl()
but as a good NDIS citizen, you are supposed to ignore you know this

If you are referring to the issue of ensuring that the lifetime of the
allocated virtual memory being managed such that it ‘exceeds’ the lifetime
of the NDIS_BUFFER, well, yeah, you should ensure that. I’m not sure if
any “bad things” happen when you free NP pool described by an MDL built by
NdisAllocateBuffer()/NdisCopyBuffer() but I try to ensure I don’t find out.

Oh, and just for complete disclosure - even though the Docs have some
hopeful mumbojumbo about buffer chains and copying them which reads rathing
confusingly, the function really does only copy (page descriptions from) the
*one* NDIS_BUFFER (MDL) that it is passed. It does not act on the entire
chain, aka, it is not NdisCopyBufferChain() (which is a rather simple and
useful routine to write).

And as far as it sucks that fragmentation and defragmentation processing
must occur, getting rid of the copies is a good start.

Good Luck,
-Dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Tuesday, January 20, 2009 4:36 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

As Max has pointed out, if you have a MDL (NDIS_BUFFER) describing the
memory, you can use NdisCopyBuffer() to allocate a new NDIS_BUFFER
that
describes a range in the original NDIS_BUFFER (MDL). So if you ring
buffer
pages are allocated as a slab or as multiple slabs, it should not be
an
issue.

Wow. I had glanced at the ‘NdisCopyBuffer’ function in the help so many
times and just assumed that it simply memcpy’d memory around. Thankyou
for your enlightenment - it turns out that it’s exactly what I want!

Now that I have multiple Buffers/MDL’s pointing to the same area of
memory, does NDIS take care of the fact that multiple NdisFreeBuffer
calls will take place for different MDL’s but the same underlying page
of memory?

As for NDIS not accepting large packets - well, that is not true
per-se.
You have likely reported the buffer sizes and maximum transfer unit
sizes
to
NDIS via responses to OID_GEN_xxx that have *told* bound protocols
above
how
large a packet you can handle.

The devil in the details is that if your virtual switch connects to an
external link (or links) then the external link MTU will result in an
upper
bound MTU that the other (virtual) links must match. The whole
‘selective
jumbo packet’ thingy is a bit tricky. I don’t think the TCP/IP
implementations are well adapted to dealing with that much choice at
the
datalink layer and this really is a different thing than large-send
offload.

Exactly. The problem is getting windows to receive a 60kb packet without
then trying to send one (except of course that it can send one if it’s
TCP and it specifies the MSS value).

Now that I know what NdisCopyBuffer is all about, I should to be able to
reduce 2 memory copies, which only leaves me with the (in theory
unnecessary) checksum operation - see Annie Li’s post about this.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

James_Harper · January 20, 2009, 5:54pm

Thanks a lot for the reply David. I’m a whole lot wiser now

James

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-351705-
xxxxx@lists.osr.com] On Behalf Of David R. Cattley
Sent: Wednesday, 21 January 2009 09:18
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

When you call NdisFreeBuffer() you are only freeing the MDL not the
memory.
NdisFreeBuffer() should only free a NDIS_BUFFER allocated by
NdisAllocateBuffer() or NdisCopyBuffer() but really, it is just
IoFreeMdl()
but as a good NDIS citizen, you are supposed to ignore you know this

If you are referring to the issue of ensuring that the lifetime of the
allocated virtual memory being managed such that it ‘exceeds’ the
lifetime
of the NDIS_BUFFER, well, yeah, you should ensure that. I’m not sure
if
any “bad things” happen when you free NP pool described by an MDL
built by
NdisAllocateBuffer()/NdisCopyBuffer() but I try to ensure I don’t find
out.

Oh, and just for complete disclosure - even though the Docs have some
hopeful mumbojumbo about buffer chains and copying them which reads
rathing
confusingly, the function really does only copy (page descriptions
from)
the
*one* NDIS_BUFFER (MDL) that it is passed. It does not act on the
entire
chain, aka, it is not NdisCopyBufferChain() (which is a rather simple
and
useful routine to write).

And as far as it sucks that fragmentation and defragmentation
processing
must occur, getting rid of the copies is a good start.

Good Luck,
-Dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Tuesday, January 20, 2009 4:36 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

>
> As Max has pointed out, if you have a MDL (NDIS_BUFFER) describing
the
> memory, you can use NdisCopyBuffer() to allocate a new NDIS_BUFFER
that
> describes a range in the original NDIS_BUFFER (MDL). So if you
ring
> buffer
> pages are allocated as a slab or as multiple slabs, it should not be
an
> issue.

Wow. I had glanced at the ‘NdisCopyBuffer’ function in the help so
many
times and just assumed that it simply memcpy’d memory around. Thankyou
for your enlightenment - it turns out that it’s exactly what I want!

Now that I have multiple Buffers/MDL’s pointing to the same area of
memory, does NDIS take care of the fact that multiple NdisFreeBuffer
calls will take place for different MDL’s but the same underlying page
of memory?

> As for NDIS not accepting large packets - well, that is not true
per-se.
> You have likely reported the buffer sizes and maximum transfer unit
sizes
> to
> NDIS via responses to OID_GEN_xxx that have *told* bound protocols
above
> how
> large a packet you can handle.
>
> The devil in the details is that if your virtual switch connects to
an
> external link (or links) then the external link MTU will result in
an
> upper
> bound MTU that the other (virtual) links must match. The whole
‘selective
> jumbo packet’ thingy is a bit tricky. I don’t think the TCP/IP
> implementations are well adapted to dealing with that much choice at
the
> datalink layer and this really is a different thing than large-send
> offload.
>

Exactly. The problem is getting windows to receive a 60kb packet
without
then trying to send one (except of course that it can send one if
it’s
TCP and it specifies the MSS value).

Now that I know what NdisCopyBuffer is all about, I should to be able
to
reduce 2 memory copies, which only leaves me with the (in theory
unnecessary) checksum operation - see Annie Li’s post about this.

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · January 21, 2009, 2:33am

>Now that I have multiple Buffers/MDL’s pointing to the same area of

memory, does NDIS take care of the fact that multiple NdisFreeBuffer
calls will take place for different MDL’s but the same underlying page
of memory?

There is no need of taking care of this - NdisFreeBuffer is just IoFreeMdl. The underlying memory is not involved in this.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

James_Harper · January 21, 2009, 4:47am

>

>Now that I have multiple Buffers/MDL’s pointing to the same area of
>memory, does NDIS take care of the fact that multiple NdisFreeBuffer
>calls will take place for different MDL’s but the same underlying
page
>of memory?

There is no need of taking care of this - NdisFreeBuffer is just
IoFreeMdl. The underlying memory is not involved in this.

I must be missing something… what if my code goes like this:

Set buf = NdisAllocateBuffer
Put buf on the ring
… an interrupt occurs …
Get buf off the ring - it contains two packets worth of data via the
Large Send Offload in Linux which I have to break up before Indicating
to NDIS
Set pktbuf1 = NdisCopyBuffer(buf, packet 1 offset & len)
Make a packet out of pktbuf1
Set pktbuf2 = NdisCopyBuffer(buf, packet 2 offset & len)
Make a packet out of pktbuf2
Indicate the packets

Obviously I’m skipping a few steps there, but at the end of that I have
1 memory page allocated from the pool, and 3 NDIS_BUFFER’s that describe
it or parts of it. pktbuf1 and pktbuf1 will be freed as part of the
return packets call, but when can I free buf? I can think of the
following scenarios:

a. When I free buf, the underlying memory area is available to allocate
to a following call to NdisAllocateBuffer, so obviously I can’t free it
until pktbuf1 and pktbuf2 have been free’d.

b. When I free buf, NDIS knows that pktbuf1 and pktbuf2 are both still
using it (eg by reference counting), so I can free buf between steps 7
and 8 above, and NDIS will ensure that the page won’t be reused until I
subsequently free pktbuf1 and pktbuf2

c. I’m missing something stupidly obvious and should be hit over the
head with a cardboard tube

Thanks

James

David_R_Cattley · January 21, 2009, 8:16am

Ok, so here is the cardboard tube (duck!)

Well, no. NDIS does not keep track of any such thing. In fact, NDIS has
its head in the sand with respect to what you are doing with respect to
having NDIS_BUFFERs describe the same virtual (and physical) memory. So
given that, you need to realize that only your code is moderating the ‘use’
and ‘reuse’ semantics you are implying.

The thing you might be missing is that the NDIS_BUFFERs (MDLs) are *not* the
resource (or are not the only resource). These only *describe* the memory
to the OS for purposes of tracking virtual to physical page mapping in an
opaque way. When you allocate (or copy) an NDIS_BUFFER you are *not*
allocating the memory backing the buffer, just the memory backing the OS
*descriptor* datastructure for the buffer. Somewhere in the life of your
system something must have allocated the actual memory into which you are
getting or removing packet.

So let’s say that this occurred at the same time your ‘buf’ structures were
allocated. It would be something like this:

NdisAllocateMemoryWithTag(&bufData, MY_BUF_SIZE, MY_BUF_TAG);
NdisAllocateBuffer(&status, &buf, &g_BufferPool, bufData, MY_BUF_SIZE);

… etc.

You have a slab of virtual memory pointed at by bufData and now ‘described’
by the NDIS_BUFFER buf. (Ok, I can hear all of you out their desperately
wanting to find the pun about ‘in the buf’ - forget about it).

The *resource* you are placing into your pool (ring buffer of buffers in
this case) is still the virtual memory! It is just described by the
NDIS_BUFFER and you are (apparently) using the NDIS_BUFFER to provide the
additional descriptor features of Virtual Address, Size, and a Next pointer.

What I think you are missing from your system is a more purposed
datastructure to describe the true (complete) state of your logical buffer
slab (the thingy containing zero or more packets). Basically, you need a
reference count so that when all of the packets that have been derived from
a single buffer have been returned, you can detect this and put the ‘buf’
back into the pool.

For that you will need your own datastructure and management routines. NDIS
does not do this behind the scenes for you, nor does the OS through the MDL
mechanisms.

Good Luck,
-dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Wednesday, January 21, 2009 4:47 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

>Now that I have multiple Buffers/MDL’s pointing to the same area of
>memory, does NDIS take care of the fact that multiple NdisFreeBuffer
>calls will take place for different MDL’s but the same underlying
page
>of memory?

There is no need of taking care of this - NdisFreeBuffer is just
IoFreeMdl. The underlying memory is not involved in this.

I must be missing something… what if my code goes like this:

Set buf = NdisAllocateBuffer
Put buf on the ring
… an interrupt occurs …
Get buf off the ring - it contains two packets worth of data via the
Large Send Offload in Linux which I have to break up before Indicating
to NDIS
Set pktbuf1 = NdisCopyBuffer(buf, packet 1 offset & len)
Make a packet out of pktbuf1
Set pktbuf2 = NdisCopyBuffer(buf, packet 2 offset & len)
Make a packet out of pktbuf2
Indicate the packets

Obviously I’m skipping a few steps there, but at the end of that I have
1 memory page allocated from the pool, and 3 NDIS_BUFFER’s that describe
it or parts of it. pktbuf1 and pktbuf1 will be freed as part of the
return packets call, but when can I free buf? I can think of the
following scenarios:

a. When I free buf, the underlying memory area is available to allocate
to a following call to NdisAllocateBuffer, so obviously I can’t free it
until pktbuf1 and pktbuf2 have been free’d.

b. When I free buf, NDIS knows that pktbuf1 and pktbuf2 are both still
using it (eg by reference counting), so I can free buf between steps 7
and 8 above, and NDIS will ensure that the page won’t be reused until I
subsequently free pktbuf1 and pktbuf2

c. I’m missing something stupidly obvious and should be hit over the
head with a cardboard tube

Thanks

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · January 22, 2009, 4:11am

>Obviously I’m skipping a few steps there, but at the end of that I have

1 memory page allocated from the pool, and 3 NDIS_BUFFER’s that describe
it or parts of it. pktbuf1 and pktbuf1 will be freed as part of the
return packets call,

All is correct.

but when can I free buf?

At any time after both pktbuf1 and pktbuf2 will be freed. So, probably you will need to refcount “buf”.

The partial MDL (Ndis copied buffer) cannot outlive its master MDL and be still alive when the master is destroyed. Doing so quickly results in PFN_LIST_CORRUPT BSOD.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

James_Harper · January 22, 2009, 4:29am

>

>Obviously I’m skipping a few steps there, but at the end of that I
have
>1 memory page allocated from the pool, and 3 NDIS_BUFFER’s that
describe
>it or parts of it. pktbuf1 and pktbuf1 will be freed as part of the
>return packets call,

All is correct.

>but when can I free buf?

At any time after both pktbuf1 and pktbuf2 will be freed. So,
probably
you will need to refcount “buf”.

The partial MDL (Ndis copied buffer) cannot outlive its master MDL and
be
still alive when the master is destroyed. Doing so quickly results in
PFN_LIST_CORRUPT BSOD.

Thanks for that, and thanks also to David for his explanation.

My drivers maintain a list of buffers that already have the necessary
config to allow the page to be passed to Dom0 in Xen (the grant ref,
which is tacked onto some extra space allocated to the end of the mdl),
so I may be able to use the same sort of logic to also include a
reference count.

When I allocate the partial MDL, I will need some extra space tacked
onto the end of the MDL itself (for a pointer to the original Mdl). I
guess I’ll have to use IoBuildPartialMdl instead of the NDIS functions
which don’t support that facility (can’t specify the size of the
buffer/mdl). Microsoft isn’t going to like it from a WHQL point of view,
but the impression I get is that you can break WHQL rules (to some
extent) if sticking to the WHQL rules for their own sake would produce
an inferior driver…

Anyway, I’ll go with the IoBuildPartialMdl method for now.

Thanks

James

Maxim_S_Shatskih · January 22, 2009, 4:40am

>When I allocate the partial MDL, I will need some extra space tacked

onto the end of the MDL itself (for a pointer to the original Mdl).

Allocate a second structure which will associate partial MDL with the master one.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

David_R_Cattley · January 22, 2009, 9:18am

James,

Tacking stuff to the end of an MDL seems a bit dicey. How many of these
buffers are we talking about? 100? 1000?

Allocate your own data structure that contains a Reference Count, pointer to
the NDIS_BUFFER, etc. and get it all working reliably. Use a lookaside list
if you are worried about alloc/free cycle time. Let the system level the
lookaside depth based on usage. You can’t possibly be talking about so
many of these things that doing so would be a problem (for memory
consumption). You are already committed to allocating an MDL for them.

In my unsolicited opinion, trying to be so clever is inviting complexity to
overwhelm usefulness. The MDL is a system datastructure that is does not
have ‘extensibility’ designed in. Or at least that is true of the more
opaque NDIS_BUFFER. What happens when the next Verifier starts checking
MDLs for validity by looking at the allocated size in the heap or some such?
Boom. Your driver breaks.

If you determine (from poolmon) that your little datastructure is eating
away terribly at NP pool, you can look to optimize it. Slab allocating a
‘block’ of smaller objects and sub-allocating is a well understood solution
to dealing with heap header size swamping the size of a dynamically
allocated datastructure. But again, why build it until you measure you need
it?

Good Luck,
-Dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Thursday, January 22, 2009 4:29 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

>Obviously I’m skipping a few steps there, but at the end of that I
have
>1 memory page allocated from the pool, and 3 NDIS_BUFFER’s that
describe
>it or parts of it. pktbuf1 and pktbuf1 will be freed as part of the
>return packets call,

All is correct.

>but when can I free buf?

At any time after both pktbuf1 and pktbuf2 will be freed. So,
probably
you will need to refcount “buf”.

The partial MDL (Ndis copied buffer) cannot outlive its master MDL and
be
still alive when the master is destroyed. Doing so quickly results in
PFN_LIST_CORRUPT BSOD.

Thanks for that, and thanks also to David for his explanation.

My drivers maintain a list of buffers that already have the necessary
config to allow the page to be passed to Dom0 in Xen (the grant ref,
which is tacked onto some extra space allocated to the end of the mdl),
so I may be able to use the same sort of logic to also include a
reference count.

When I allocate the partial MDL, I will need some extra space tacked
onto the end of the MDL itself (for a pointer to the original Mdl). I
guess I’ll have to use IoBuildPartialMdl instead of the NDIS functions
which don’t support that facility (can’t specify the size of the
buffer/mdl). Microsoft isn’t going to like it from a WHQL point of view,
but the impression I get is that you can break WHQL rules (to some
extent) if sticking to the WHQL rules for their own sake would produce
an inferior driver…

Anyway, I’ll go with the IoBuildPartialMdl method for now.

Thanks

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

James_Harper · January 23, 2009, 6:57am

> James,

Tacking stuff to the end of an MDL seems a bit dicey. How many of
these
buffers are we talking about? 100? 1000?

The ring can hold 256 pages, needing one MDL each. Each page could hold
3 packets, needing another 3 MDLs, and on top of that, when I refill the
ring I may not have received all the MDL’s back from NDIS (via
ReturnPacket), so potentially a thousand under extreme load.

Allocate your own data structure that contains a Reference Count,
pointer
to
the NDIS_BUFFER, etc. and get it all working reliably. Use a lookaside
list
if you are worried about alloc/free cycle time. Let the system level
the
lookaside depth based on usage. You can’t possibly be talking about
so
many of these things that doing so would be a problem (for memory
consumption). You are already committed to allocating an MDL for
them.

It is using nonpaged pool so I am supposed to be frugal with memory
usage, but high throughput equates to high memory usage. If memory runs
low then throughput will suffer, but that’s just the way it is.

That is probably the direction I’m looking towards. The NDIS_PACKET
structure contains 3 pointers that I can use however I want, so I just
need one of these to point to my structure with the list of original
MDLs and the reference count. The pointer is NULL when I don’t use
partial MDLs.

I was hoping I could do it per MDL rather than per Packet. Worst case, I
will get a 60K TCP packet (15 pages) from Linux that gets broken down
into 40 NDIS packets, each with two or three MDLs (one MDL for the
header, one for the data, and one more if the data spans two pages), so
I could return the physical memory when all the MDL’s are freed. This
way I can only return all the memory at once, and only when all the NDIS
packets I created are returned.

In my unsolicited opinion, trying to be so clever is inviting
complexity
to
overwhelm usefulness. The MDL is a system datastructure that is
does
not
have ‘extensibility’ designed in. Or at least that is true of the
more
opaque NDIS_BUFFER. What happens when the next Verifier starts
checking
MDLs for validity by looking at the allocated size in the heap or some
such? Boom. Your driver breaks.

Well… only when using the verifier

If you determine (from poolmon) that your little datastructure is
eating
away terribly at NP pool, you can look to optimize it. Slab
allocating a
‘block’ of smaller objects and sub-allocating is a well understood
solution
to dealing with heap header size swamping the size of a dynamically
allocated datastructure. But again, why build it until you measure
you
need it?

Your unsolicited opinion and ideas are greatly appreciated. Thanks

James

Maxim_S_Shatskih · January 23, 2009, 7:20am

>will get a 60K TCP packet (15 pages) from Linux

Can you tune Linux to ban jumbo frames on this virtual network instead of playing tricks with packet splitting on Windows side?

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

James_Harper · January 23, 2009, 7:28am

>

>will get a 60K TCP packet (15 pages) from Linux

Can you tune Linux to ban jumbo frames on this virtual network instead
of
playing tricks with packet splitting on Windows side?

No, unless I wanted to disable sending of large frames too, which I
don’t

There are still efficiencies to be had memory/ring wise anyway - a 60K
TCP packet is about 15 ‘slots’ (a slot maps to one memory page) on the
ring, but if it arrived as 40 packets it could be 80 slots (if the
header and body were in separate pages, as Windows tends to do). There
are only 256 slots on the ring, so you can see why this matters. I think
I can manage it much more efficiently by dealing with it on the Windows
side.

Thanks again for all your input, I think the end result of all this will
be a much nicer driver.

James

David_R_Cattley · January 23, 2009, 9:39am

James,

You may have put to many multipliers into the factoring of how many
management structures you need.

The NDIS_PACKET or NET_BUFFER is perfectly capable of keeping track of the
NDIS_BUFFER or MDLs you create (as copies or partial MDLs) - one private set
for each indication.

The only ‘resource’ you have that is shared between packets and thus needs
to be reference counted is the ‘page’ object in your ring buffer.

For those, you have (apparently) a fixed number of 256. I would just
allocate a slab of 256 simple structures that contain the control fields you
need (refCnt, PMDL, etc.)

You do not need to control ‘sharing’ of the partial MDLs because they are
only created and ‘owned’ by a single packet.

The packet (net-buffer, whatever) needs only a single ‘reference’ (pointer)
back to the ‘page’ that contains it’s packet data. Actualy, it may need two
because you mention a single packet might span two pages. The general case
may be it needs ((MAX_PACKET_SIZE / PAGE_SIZE) + 1) page references. Since
you get to allocate the NDIS_PACKETs yourself, you can always size them in
the ProtocolReserved area as large as you want. Anything *after*
ProtocolReserved[MIN_PROTOCOL_RESERVED] is actually the packet creators to
play with and can be thought of as (in your case) an arbitrary sized
MiniportReserved2 area.

Alternatively you could ‘chain’ the page structures together and manage the
reference counting with a ‘cascade’ so that when you process the returned
packet you release a page reference for every page in the chain that the
packet describes a portion of. You have the MDLs which point into the pages
and you have a reference to the ‘first’ page effected by the packet. More
‘work’ but less ‘memory’.

But I really don’t see the calculus that supports having fan-up of the
object count you outlined below. That could well be because I don’t
understand the guts of your driver but it seems to me you need to manage the
sharing of 256 objects and that?s about it.

Good Luck,
Dave

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Friday, January 23, 2009 6:57 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] multiple MDL’s referring to the same memory…

James,

Tacking stuff to the end of an MDL seems a bit dicey. How many of
these
buffers are we talking about? 100? 1000?

The ring can hold 256 pages, needing one MDL each. Each page could hold
3 packets, needing another 3 MDLs, and on top of that, when I refill the
ring I may not have received all the MDL’s back from NDIS (via
ReturnPacket), so potentially a thousand under extreme load.

Allocate your own data structure that contains a Reference Count,
pointer
to
the NDIS_BUFFER, etc. and get it all working reliably. Use a lookaside
list
if you are worried about alloc/free cycle time. Let the system level
the
lookaside depth based on usage. You can’t possibly be talking about
so
many of these things that doing so would be a problem (for memory
consumption). You are already committed to allocating an MDL for
them.

It is using nonpaged pool so I am supposed to be frugal with memory
usage, but high throughput equates to high memory usage. If memory runs
low then throughput will suffer, but that’s just the way it is.

That is probably the direction I’m looking towards. The NDIS_PACKET
structure contains 3 pointers that I can use however I want, so I just
need one of these to point to my structure with the list of original
MDLs and the reference count. The pointer is NULL when I don’t use
partial MDLs.

I was hoping I could do it per MDL rather than per Packet. Worst case, I
will get a 60K TCP packet (15 pages) from Linux that gets broken down
into 40 NDIS packets, each with two or three MDLs (one MDL for the
header, one for the data, and one more if the data spans two pages), so
I could return the physical memory when all the MDL’s are freed. This
way I can only return all the memory at once, and only when all the NDIS
packets I created are returned.

In my unsolicited opinion, trying to be so clever is inviting
complexity
to
overwhelm usefulness. The MDL is a system datastructure that is
does
not
have ‘extensibility’ designed in. Or at least that is true of the
more
opaque NDIS_BUFFER. What happens when the next Verifier starts
checking
MDLs for validity by looking at the allocated size in the heap or some
such? Boom. Your driver breaks.

Well… only when using the verifier

If you determine (from poolmon) that your little datastructure is
eating
away terribly at NP pool, you can look to optimize it. Slab
allocating a
‘block’ of smaller objects and sub-allocating is a well understood
solution
to dealing with heap header size swamping the size of a dynamically
allocated datastructure. But again, why build it until you measure
you
need it?

Your unsolicited opinion and ideas are greatly appreciated. Thanks

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer