Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results
The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.
Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/
Upcoming OSR Seminars | ||
---|---|---|
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead! | ||
Kernel Debugging | 9-13 Sept 2024 | Live, Online |
Developing Minifilters | 15-19 July 2024 | Live, Online |
Internals & Software Drivers | 11-15 Mar 2024 | Live, Online |
Writing WDF Drivers | 20-24 May 2024 | Live, Online |
Comments
An evolution of this approach to reduce the latency and increase throughput is issuing multiple asynchronous IOCTLS and waiting for their completion ( overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING. DPC removes an IRP from the list, fills data and completes the IRP.
There is another solution which probably has the lowest latency and highest throughput - circular buffer. A user application allocates a buffer and an event. Sends them to a driver via IOCTL. The driver locks the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to an event object ( ObReferenceObjectByHandle ). Both the driver and the user application implement a circular buffer. The driver writes in it and the application reads from the buffer. A single reader/single write circular buffer can be implemented w/o any lock as long as the pointer arithmetic is atomic, which is a case on IA-32 and AMD-64 architectures for aligned pointers, i.e. when (address mod sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer and sets the event in a signal state. The user application waits on the event. When WaitForSingleObject returns it reads data from the buffer until it becomes empty and returns to waiting on the event.
> Hi everyone !
>
> I am working on a virtual audio driver to send audio data from kernel
> to a user space application in charge of streaming received audio data
> over a network.
>
> I am actually using a socket to send audio data from kernel to user space.
>
> Because some of you told me I should use IOCTL instead of WINSOCK, I
> would like to have your opinion about IOCTL and be sure it will be a
> better solution than using WINSOCK.
>
> So, here is what I do :
> * I am developping a virtual audio driver, playback only, and WaveCyclic based
> * I want to send audio data as soon as possible from kernel space to user space
> * I am actually copying audio data and scheduling a DPC each time the
> IDMAChannel::CopyTo is called
> * I am sending audio data using a kernel socket each time my custom DPC
> is executed
> * the communication is essentially from kernel to user space
>
> What is the best strategy, in my case, to send audio data as soon as
> possible, using IOCTL ? Do I still need to use a DPC ? Is sending data
> from kernel to user space synchronous ?
>
> Is there any other alternative ?
>
> Thanks in advance.
>
> Matt
I found the following interesting article about IOCTL and the Inverted
Call Model
biggest cause of unreliable drivers. Start with the IOCTL approach, and
only after you have measured performance and if it is unacceptable profiled
the driver to determine that the IOCTL mechanism is the problem should you
consider other mechanisms.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of [email protected]
Sent: Tuesday, August 30, 2016 10:59 AM
To: Windows System Software Devs Interest List <[email protected]>
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?
If you want to continue with IOCTLs then a simplest solution is to issue an
IOCTL from a user application and either block IRP in the driver or return
STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP's
buffer and call IoCompleteRequest for the pending/blocked IRP.
An evolution of this approach to reduce the latency and increase throughput
is issuing multiple asynchronous IOCTLS and waiting for their completion (
overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
DPC removes an IRP from the list, fills data and completes the IRP.
There is another solution which probably has the lowest latency and highest
throughput - circular buffer. A user application allocates a buffer and an
event. Sends them to a driver via IOCTL. The driver locks the
buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
an event object ( ObReferenceObjectByHandle ). Both the driver and the user
application implement a circular buffer. The driver writes in it and the
application reads from the buffer. A single reader/single write circular
buffer can be implemented w/o any lock as long as the pointer arithmetic is
atomic, which is a case on IA-32 and AMD-64 architectures for aligned
pointers, i.e. when (address mod sizeof(void*)) == 0 or
address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
and sets the event in a signal state. The user application waits on the
event. When WaitForSingleObject returns it reads data from the buffer until
it becomes empty and returns to waiting on the event.
---
NTDEV is sponsored by OSR
Visit the list online at:
<http://www.osronline.com/showlists.cfm?list=ntdev>
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at <http://www.osr.com/seminars>
To unsubscribe, visit the List Server section of OSR Online at
<http://www.osronline.com/page.cfm?name=ListServer>
> If you want to continue with IOCTLs then a simplest solution is to
> issue an IOCTL from a user application and either block IRP in the
> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
> DPC you fill the IRP's buffer and call IoCompleteRequest for the
> pending/blocked IRP.
>
> An evolution of this approach to reduce the latency and increase
> throughput is issuing multiple asynchronous IOCTLS and waiting for
> their completion ( overlapped IO ). The driver puts IRPs in a list and
> returns STATUS_PENDING. DPC removes an IRP from the list, fills data
> and completes the IRP.
>
> There is another solution which probably has the lowest latency and
> highest throughput - circular buffer. A user application allocates a
> buffer and an event. Sends them to a driver via IOCTL. The driver locks
> the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
> pointer to an event object ( ObReferenceObjectByHandle ). Both the
> driver and the user application implement a circular buffer. The driver
> writes in it and the application reads from the buffer. A single
> reader/single write circular buffer can be implemented w/o any lock as
> long as the pointer arithmetic is atomic, which is a case on IA-32 and
> AMD-64 architectures for aligned pointers, i.e. when (address mod
> sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine
> writes into the circular buffer and sets the event in a signal state.
> The user application waits on the event. When WaitForSingleObject
> returns it reads data from the buffer until it becomes empty and
> returns to waiting on the event.
>
Hi !
Multiple solution to try. I'll probably start with the two first ones
which seems simpler.
Thanks for your answser !
> Premature optimization, such as the circular buffer suggested below, is the
> biggest cause of unreliable drivers. Start with the IOCTL approach, and
> only after you have measured performance and if it is unacceptable profiled
> the driver to determine that the IOCTL mechanism is the problem should you
> consider other mechanisms.
>
>
> Don Burn
> Windows Driver Consulting
> Website: http://www.windrvr.com
Hi !
By IOCTL approach you refer to the two first solutions suggested by slavaim ?
I have not yet read the article about the Inverted Call Model, what's
your opinion about that ?
Thanks for you help !
>
>
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of [email protected]
> Sent: Tuesday, August 30, 2016 10:59 AM
> To: Windows System Software Devs Interest List <[email protected]>
> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
> data path to user space ?
>
> If you want to continue with IOCTLs then a simplest solution is to issue an
> IOCTL from a user application and either block IRP in the driver or return
> STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP's
> buffer and call IoCompleteRequest for the pending/blocked IRP.
>
> An evolution of this approach to reduce the latency and increase throughput
> is issuing multiple asynchronous IOCTLS and waiting for their completion (
> overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
> DPC removes an IRP from the list, fills data and completes the IRP.
>
> There is another solution which probably has the lowest latency and highest
> throughput - circular buffer. A user application allocates a buffer and an
> event. Sends them to a driver via IOCTL. The driver locks the
> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
> an event object ( ObReferenceObjectByHandle ). Both the driver and the user
> application implement a circular buffer. The driver writes in it and the
> application reads from the buffer. A single reader/single write circular
> buffer can be implemented w/o any lock as long as the pointer arithmetic is
> atomic, which is a case on IA-32 and AMD-64 architectures for aligned
> pointers, i.e. when (address mod sizeof(void*)) == 0 or
> address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
> and sets the event in a signal state. The user application waits on the
> event. When WaitForSingleObject returns it reads data from the buffer until
> it becomes empty and returns to waiting on the event.
>
>
> ---
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> <http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
> drivers!
> Details at <http://www.osr.com/seminars>
>
> To unsubscribe, visit the List Server section of OSR Online at
> <http://www.osronline.com/page.cfm?name=ListServer>
inverted call if you use multiple IOCTL and STATUS_PENDING.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Matthieu Collette
Sent: Tuesday, August 30, 2016 11:20 AM
To: Windows System Software Devs Interest List <[email protected]>
Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?
On 2016-08-30 15:12:55 +0000, Don Burn said:
> Premature optimization, such as the circular buffer suggested below, is
the
> biggest cause of unreliable drivers. Start with the IOCTL approach, and
> only after you have measured performance and if it is unacceptable
> profiled the driver to determine that the IOCTL mechanism is the
> problem should you consider other mechanisms.
>
>
> Don Burn
> Windows Driver Consulting
> Website: http://www.windrvr.com
Hi !
By IOCTL approach you refer to the two first solutions suggested by slavaim
?
I have not yet read the article about the Inverted Call Model, what's your
opinion about that ?
Thanks for you help !
>
>
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> [email protected]
> Sent: Tuesday, August 30, 2016 10:59 AM
> To: Windows System Software Devs Interest List <[email protected]>
> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
> direct data path to user space ?
>
> If you want to continue with IOCTLs then a simplest solution is to
> issue an IOCTL from a user application and either block IRP in the
> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
> DPC you fill the IRP's buffer and call IoCompleteRequest for the
pending/blocked IRP.
>
> An evolution of this approach to reduce the latency and increase
> throughput is issuing multiple asynchronous IOCTLS and waiting for
> their completion ( overlapped IO ). The driver puts IRPs in a list and
returns STATUS_PENDING.
> DPC removes an IRP from the list, fills data and completes the IRP.
>
> There is another solution which probably has the lowest latency and
> highest throughput - circular buffer. A user application allocates a
> buffer and an event. Sends them to a driver via IOCTL. The driver
> locks the
> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
> pointer to an event object ( ObReferenceObjectByHandle ). Both the
> driver and the user application implement a circular buffer. The
> driver writes in it and the application reads from the buffer. A
> single reader/single write circular buffer can be implemented w/o any
> lock as long as the pointer arithmetic is atomic, which is a case on
> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
> (address mod sizeof(void*)) == 0 or
> address%sizeof(void*) == 0 . The DPC routine writes into the circular
> buffer and sets the event in a signal state. The user application
> waits on the event. When WaitForSingleObject returns it reads data
> from the buffer until it becomes empty and returns to waiting on the
event.
>
>
> ---
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> <http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at <http://www.osr.com/seminars>
>
> To unsubscribe, visit the List Server section of OSR Online at
> <http://www.osronline.com/page.cfm?name=ListServer>
---
NTDEV is sponsored by OSR
Visit the list online at:
<http://www.osronline.com/showlists.cfm?list=ntdev>
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at <http://www.osr.com/seminars>
To unsubscribe, visit the List Server section of OSR Online at
<http://www.osronline.com/page.cfm?name=ListServer>
> Yes, the IOCTL approach is what should be used, and is basically the
> inverted call if you use multiple IOCTL and STATUS_PENDING.
Ok
>
>
> Don Burn
> Windows Driver Consulting
> Website: http://www.windrvr.com
>
>
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Matthieu Collette
> Sent: Tuesday, August 30, 2016 11:20 AM
> To: Windows System Software Devs Interest List <[email protected]>
> Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
> data path to user space ?
>
> On 2016-08-30 15:12:55 +0000, Don Burn said:
>
>> Premature optimization, such as the circular buffer suggested below, is
> the
>> biggest cause of unreliable drivers. Start with the IOCTL approach, and
>> only after you have measured performance and if it is unacceptable
>> profiled the driver to determine that the IOCTL mechanism is the
>> problem should you consider other mechanisms.
>>
>>
>> Don Burn
>> Windows Driver Consulting
>> Website: http://www.windrvr.com
>
> Hi !
>
> By IOCTL approach you refer to the two first solutions suggested by slavaim
> ?
>
> I have not yet read the article about the Inverted Call Model, what's your
> opinion about that ?
>
> Thanks for you help !
>
>>
>>
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of
>> [email protected]
>> Sent: Tuesday, August 30, 2016 10:59 AM
>> To: Windows System Software Devs Interest List <[email protected]>
>> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
>> direct data path to user space ?
>>
>> If you want to continue with IOCTLs then a simplest solution is to
>> issue an IOCTL from a user application and either block IRP in the
>> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
>> DPC you fill the IRP's buffer and call IoCompleteRequest for the
> pending/blocked IRP.
>>
>> An evolution of this approach to reduce the latency and increase
>> throughput is issuing multiple asynchronous IOCTLS and waiting for
>> their completion ( overlapped IO ). The driver puts IRPs in a list and
> returns STATUS_PENDING.
>> DPC removes an IRP from the list, fills data and completes the IRP.
>>
>> There is another solution which probably has the lowest latency and
>> highest throughput - circular buffer. A user application allocates a
>> buffer and an event. Sends them to a driver via IOCTL. The driver
>> locks the
>> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
>> pointer to an event object ( ObReferenceObjectByHandle ). Both the
>> driver and the user application implement a circular buffer. The
>> driver writes in it and the application reads from the buffer. A
>> single reader/single write circular buffer can be implemented w/o any
>> lock as long as the pointer arithmetic is atomic, which is a case on
>> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
>> (address mod sizeof(void*)) == 0 or
>> address%sizeof(void*) == 0 . The DPC routine writes into the circular
>> buffer and sets the event in a signal state. The user application
>> waits on the event. When WaitForSingleObject returns it reads data
>> from the buffer until it becomes empty and returns to waiting on the
> event.
>>
>>
>> ---
>> NTDEV is sponsored by OSR
>>
>> Visit the list online at:
>> <http://www.osronline.com/showlists.cfm?list=ntdev>
>>
>> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>> software drivers!
>> Details at <http://www.osr.com/seminars>
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> <http://www.osronline.com/page.cfm?name=ListServer>
>
>
>
> ---
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> <http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
> drivers!
> Details at <http://www.osr.com/seminars>
>
> To unsubscribe, visit the List Server section of OSR Online at
> <http://www.osronline.com/page.cfm?name=ListServer>
> Some words on circular buffer implementation. The solution with
> IRP_MJ_CLEANUP is not perfect because of handle duplication ( if
> somebody decides to attack the system ). The safer solution is holding
> an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal
> process termination in it. But it looks more elaborate.
Ok
> Isn't a device's driver supposed to transfer data from it's ... EvtIoRead callback routine ?
No. A device driver transfers its data at whatever point it actually
has data. If you happen to have data already queued up, then you can
certainly transfer it in EvtIoRead or EvtIoDeviceControl, but most
drivers aren't that lucky. They have to tuck those requests into a
queue somewhere. Later on, when the driver actually receives data,
whether from a device, or a bus, or another driver, it can pop the next
waiting request and complete it.
--
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
Tim Roberts, [email protected]
Software Wizard Emeritus
The system does all of the work for you. You don?t need to worry about tracking when it is safe to use the buffer or when it should be mapped / unmapped and handling all of those corner cases and race conditions. The engineers at Microsoft have done this for you and what use would it be to re-implement their work ? even assuming you could without spending years working on it.
In general, IMHO, shared buffer schemes that are implemented correctly are generally not any more efficient than simply using IRPs as the overhead of doing it right is exactly what the Microsoft engineers have coded ? do you think they purposely make these calls slow? Having said that the long lived IRP can be an exception to that rule and is used safely in specific cases, but as others have said, this should not be your first design regardless of what performance you think you need
Sent from Mail for Windows 10
From: [email protected]
Sent: August 30, 2016 11:21 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?
Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.
---
NTDEV is sponsored by OSR
Visit the list online at:
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at
To unsubscribe, visit the List Server section of OSR Online at
> their work ? even assuming you could without spending years working on it.
As it usually happens, anything that gets blown out of proportion starts looking/sounding ridiculous, and the above statement is not an exception. Surely sharing a buffer gives you few more things to worry about, and, as it had been already pointed out by other posters, in most cases this extra pain is simply unnecessary. However, "spending years working on it" is just a gross exaggeration that sounds more of propaganda from our "Professor Joe Flounder" .
> In general, IMHO, shared buffer schemes that are implemented correctly are generally
>not any more efficient than simply using IRPs as the overhead of doing it right is exactly
>what the Microsoft engineers have coded ?
Yet another piece of nonsense. It is well known fact that sharing memory between an app and driver may offer dramatic performance enhancement, compared to file IO operations interface (including ioctl()). This is what mmap() was designed for.
> do you think they purposely make these calls slow?
Well, they just had no option other than following the instructions of someone who is known to think of UNIX (as well as of GUI) as of a life-long foe. Do you really think they would be allowed to implement a single and uniform interface for both disk file and driver IO operations under these circumstances???? As a result, now you have to invent various mechanism(like sharing events) for working around various shortcomings and limitations arising from the lack of mmap() system call. As someone said, "Those who don't understand UNIX are doomed to re-invent it. Poorly"......
Anton Bassov
In general, IMHO, shared buffer schemes that are implemented correctly are
generally not any more efficient than simply using IRPs as the overhead of doing
it right is exactly what the Microsoft engineers have coded ?
</QUOTE>
Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application. Shared buffer is mapped in system and process space, its physical pages are locked. So no page faults generated. As the buffer pages are locked the data is hold in the CPU cache( at least L2 ) when a user process is woken up on the same CPU. There is no cache thrashing as in case of buffers copying when CPU cache has to evict two times more data to accommodate the copied data. IOCTL implementation in addition to the same KeSetEvent in IRP completion requires entering the kernel for file IO, which is not cheaper than entering the kernel to wait for event(WaitForSingleObject) as it encompasses it. IOCTL requires multiple memory allocations(Irp+buffer), memory releasing and copying buffers between kernel and user buffer OR the Memory Management involvement to lock and map a user buffer and then unlock and unmap it. IOCTL implementation involves a couple of order of magnitude more code being executed to transfer data from DPC to a user application in addition to the overhead of a shared buffer implementation. IOCTL also has a burden on CPU cache( both code and data caches), PTE and TLB management.
>memory between an app and driver may offer dramatic performance
>enhancement, compared to file IO operations interface (including ioctl()).
Note the "MAY OFFER" in this statement, unfortunately plenty of
implementations
don't offer performance improvements. In fact Anton, please give some
references
to your "well known fact", the few I know of have a lot of qualifiers, that
make that
statement closer to urban legend than fact.
>Well, they just had no option other than following the instructions of
someone who
>is known to think of UNIX (as well as of GUI) as of a life-long foe
A number of the senior folks who worked on the I/O subsystem over the years
came
from a UNUX background. For example Nar Ganapathy who was one of the
architects
of KMDF was previously an architect as Sun. Anton your conspiracy theories
are
becoming ridiculous.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
Shared buffer has nearly zero overhead compared to IOCTL implementation,
the only calls are a DPC call to KeSetEvent that results in a scheduler
being called and WaitForSingleObject by an application.
</QUOTE>
Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a
DeviceIoControl
for all situations and approaches. You may be surprised to find that the
overhead is similar
and with some of the optimizations provided for user space applications for
I/O calls that
DeviceIoControl can actually be made faster.
Of course this whole discussion is based on the assumption that the OP needs
an extremely
fast data path, in almost all cases when you really pin people down what
they think is fast for
a modern OS is actually pretty much average.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a DeviceIoControl for all situations and approaches.
</QUOTE>
It is amazing that I should remember you that DeviceIoControl solution includes exactly the same calls to KeSetEvent and WaitForSingleObject/KeWaitForSingleObject.
<QUOTE>
Of course this whole discussion is based on the assumption that the OP needs an extremely fast data path
</QUOTE>
Actually the OP needs a real time data path for audio processing. Contrary to common believe RT is not about fast but about predictable and nothing beats shared buffer in this.
It is amazing that I should remember you that DeviceIoControl solution
includes exactly the same calls to KeSetEvent and
WaitForSingleObject/KeWaitForSingleObject.
</QUOTE>
So you have read the Windows source and seen this? And of course calls like
SetFileCompletionNotificationModes or the
use of completion ports do not impact this mechanism at all? You obviously
have limited knowledge of the capabilities of
Windows in the I/O path.
<QUOTE>
Actually the OP needs a real time data path for audio processing. Contrary
to common believe RT is not about fast but about predictable and nothing
beats shared buffer in this.
</QUOTE>
Actually, share buffers have little or nothing to do with predictability.
This is a scheduling issue which is dependent on a number of things, but way
that data is copied from kernel to user space is not one of them. Don't
apply the biases of another OS to Windows, each OS is different and assuming
that they will react the same without testing the hypothesis just shows
ignorance.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
So you have read the Windows source and seen this?
</QUOTE>
Yes, I did.
<QUOTE>
You obviously have limited knowledge of the capabilities of Windows in the I/O path.
</QUOTE>
You obviously trying to start personal attacks( not the first time ) to cover your limited knowledge of OS design in general. When you are going to wait for data you have to wait. Waiting means either polling or releasing CPU the latter means a thread is removed from a running queue and inserted into some event's queue.
<QUOTE>
Actually, share buffers have little or nothing to do with predictability.
</QUOTE>
Actually it does. I am not going educate you on RT design here.
are mechanisms for getting the event out of the I/O path, and other
optimizations. In fact I have implemented a number of high speed I/O
models on Windows for clients including shared memory models, and none of
these use a user space event directly since it was not efficient.
As far as my OS knowledge, I've worked on OS'es and system software for over
40 years. In that time I have been involved with 5 commercal OS'es four of
which I was part of the original team.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of [email protected]
Sent: Wednesday, August 31, 2016 8:22 AM
To: Windows System Software Devs Interest List <[email protected]>
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?
<QUOTE>
So you have read the Windows source and seen this?
</QUOTE>
Yes, I did.
<QUOTE>
You obviously have limited knowledge of the capabilities of Windows in the
I/O path.
</QUOTE>
You obviously trying to start personal attacks( not the first time ) to
cover your limited knowledge of OS design in general. When you are going to
wait for data you have to wait. Waiting means either polling or releasing
CPU the latter means a thread is removed from a running queue and inserted
into some event's queue.
<QUOTE>
Actually, share buffers have little or nothing to do with predictability.
</QUOTE>
Actually it does. I am not going educate you on RT design here.
---
NTDEV is sponsored by OSR
Visit the list online at:
<http://www.osronline.com/showlists.cfm?list=ntdev>
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at <http://www.osr.com/seminars>
To unsubscribe, visit the List Server section of OSR Online at
<http://www.osronline.com/page.cfm?name=ListServer>
In Windows blocking event mechanism is implemented by objects containing DISPATCHER_HEADER, so it it doesn't matter whether you are talking about event or something else like process, thread, semapthore, mutex, timer. You are actually talking about DISPATCHER_HEADER and all blocking synchronization ends up in manipulating with DISPATCHER_HEADER. I believe it is not a secret for you that KEVENT is just a wrapper for DISPATCHER_HEADER.
<QUOTE>
Sorry you did not read the source well,
</QUOTE>
You will always correct me if I am wrong.
<QUOTE>
In fact I have implemented a number of high speed I/O
models on Windows for clients including shared memory models, and none of
these use a user space event directly
</QUOTE>
A good man knows his limitations.
Actually, share buffers have little or nothing to do with predictability.
</QUOTE>
They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.
//Daniel
I have seen implementations of share buffers that used pageable memory, and
I have seen IOCTL mechanisms where the user space program through one of
several mechanisms handed the kernel non-paged buffers.
There are ways to limit the scheduler impact in the IOCTL path (FastIO,
using SetFileCompletionNotificationModes and completion ports, etc).
Overall there are a lot of tools, assuming shared buffers is going to be the
fastest is a poor design decision, and that what lead to this long thread.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
[email protected]
Sent: Wednesday, August 31, 2016 9:27 AM
To: Windows System Software Devs Interest List <[email protected]>
Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?
<QUOTE>
Actually, share buffers have little or nothing to do with predictability.
</QUOTE>
They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.
//Daniel
---
NTDEV is sponsored by OSR
Visit the list online at:
<http://www.osronline.com/showlists.cfm?list=ntdev>
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at <http://www.osr.com/seminars>
To unsubscribe, visit the List Server section of OSR Online at
<http://www.osronline.com/page.cfm?name=ListServer>
> Shared buffers do not equal locked down memory, they are independent items.
</QUOTE>
We are talking here about locked buffers in case you didn't notice the word DPC.
<QUOTE>
> I have seen implementations of share buffers that used pageable memory
</QUOTE>
We all have seen this. This is called file system cache or do you want to tell us that you have seen a driver that mapped a user space address range backed by a pagefile/file to a system space and the pages were not locked. Think twice before answering.
<QUOTE>
> I have seen IOCTL mechanisms where the user space program through one of
several mechanisms handed the kernel non-paged buffers.
</QUOTE>
We all have seen this. This is called buffered IO or do you want to tell that a driver used a locked user space address range in arbitrary context (we are talking about DPC here).
<QUOTE>
> and that what lead to this long thread.
</QUOTE>
Your personal attacks lead to this long thread.
> Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application.
Yes, but how is that any different than a long-term IRP? You have the
exact same transactions: set an event and wake the user-mode process.
The shared buffer FEELS like it ought to be more efficient, but the
overhead is essentially identical
--
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
Tim Roberts, [email protected]
Software Wizard Emeritus
As far as my OS knowledge, I've worked on OS'es and system software for over 40 years. In that time I have been involved with 5 commercal OS'es four of which I was part of the original team.
</quote>
Whenever you see a statement like that from Mr.Burn you can be at least 500% sure his previous statements on a given topic are, softly speaking, "not-so-sound" from the technical standpoint, which had been pointed out to him. Whenever he puts a foot into his mouth he starts referring to his 50+ years experience( as well as throwing personal attacks and ad hominem arguments). The best option in such situation is simply to ignore this part, and to request him to back up his statements from the technical standpoint. At this point it becomes plainly obvious that he is just full of shit We've been here on quite a few occasions, Don, don't you think (if a backup of my statements is needed, links to at least dozen of threads on NTDEV are available upon the request). As we are going to see shortly, this particular thread is not an exception to above mentioned general rule.
<quote>
Shared buffers do not equal locked down memory, they are independent items. I have seen implementations of share buffers that used pageable memory, I have seen IOCTL mechanisms where the user space program through one of several mechanisms handed the kernel non-paged buffers.
</quote>
You must have implemented them yourself then....
Look - we are speaking about writing to memory buffer in context of DPC. If a driver does not lock an MDL that describes shared memory buffer in advance its very first page fault is going to result in BSOD. What are you arguing about here???
In general, " I have seen XYZ" is a pretty weak argument in itself. For example, if you speak to any ambulance or emergency departments doctor you will hear multiple stories about things
that were _intentionally_ used in "very non-conventional" ways by the patients - from peas in the nose to electrical bulbs up the arses. However, it does not necessarily imply that peas and bulbs are inherently dangerous items per se, don't you think. The situation is exactly the same - if XYZ has been misused by someone it doesnot necessarily mean that it is bad.
<quote>
... the "MAY OFFER" in this statement, unfortunately plenty of implementations don't offer performance improvements. In fact Anton, please give some references to your "well known fact", the few I know of have a lot of qualifiers, that make that statement closer to urban legend than fact.
</quote>
Well, your statement may make sense only if we assume that it was you who had provided these implementations, effectively washing away all the potential performance benefits of shared memory. If you need more precise explanations/proof of above mentioned benefits I would suggest reading post N 16 on this thread carefully - I simply don't want to replicate the work that has already been done on this thread
<quote>
A number of the senior folks who worked on the I/O subsystem over the years came from a UNUX background. For example Nar Ganapathy who was one of the architects of KMDF was previously an architect as Sun.
</quote>
The "only" problem here is that the whole NT IO subsystem was _architecturally_ defined
in the very first version of NT, i.e. more than a decade before these folks had turned up, and had not changed since. The only thing that these folks could do was improving certain implementation details, which, however, cannot change the original design choices. For example, AFAIK, FastIO was introduced exactly for the purpose of fixing the deficiencies and shortcoming of IRP-based IO model that became obvious at the very first stages of NT's life (IIRC, it was mentioned in Rajeev Nagar's book).
> Anton your conspiracy theories are becoming ridiculous.
Sorry, Don, but the only ridiculous thing on this thread in so far is your participation in it - unfortunately, practically all the statements that you made here are nonsensical. You seem to be arguing for the very sake of arguing about something....
Anton Bassov
Yes, but how is that any different than a long-term IRP? You have the
exact same transactions: set an event and wake the user-mode process.
The shared buffer FEELS like it ought to be more efficient, but the
overhead is essentially identical
</QUOTE>
The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.
Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.
If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.
> If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case.
All of which are MANY orders of magnitude less than any I/O operation
that was involved. The difference is insignificant.
> This has its toll on CPU caches, TLB and branch prediction table in CPU.
Insignificant. CPU cycles are not the only metric of efficiency in the
world today. If it costs me additional design, programming, debugging,
and maintenance time to get a result that micro-optimizes the branch
prediction tables, then my work was a fool's errand.
--
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
Tim Roberts, [email protected]
Software Wizard Emeritus
All of which are MANY orders of magnitude less than any I/O operation that was involved.
</QUOTE>
Not so. If this was considered as a generally acceptable idea Windows NT would have been implemented as a mini-kernel as it was originally thought of.
<QUOTE>
Insignificant. CPU cycles are not the only metric of efficiency in the world today.
</QUOTE>
When we are talking about code execution overhead this is one of the most important metrics. If it was not significant we don't have superscallar CPUs with more than 100 instructions in flight, huge register files, huge reorder buffers, out of order execution, branch prediction tables, huge caches.
You can develop over metrics depending on the goal. For example man hours spent on design and development or number of pizzas eaten by the team.
<QUOTE>
If it costs me additional design, programming, debugging,
and maintenance time to get a result that micro-optimizes the branch
prediction tables, then my work was a fool's errand.
</QUOTE>
Actually it doesn't cost much than IOCTL implementation. All you need is to implement a single reader/single write lock free circular buffer over an allocated and locked user buffer. Such buffer implementations are available for no charge from many sources.