Communication between minifilter and userland app too slow !

Hi !

I would like to have some advices on how to fix my problem because i’m kind of lost…

I developped a minifilter driver in order to trace the execution of a binary (something like Process Monitor but more advanced) and I’m using the filter communication port (FltSendMessage, FilterGetMessage…) to send the logs to the userland application.

It works in the way that the userland app receive the logs but i have huge messages losses !

status = FltSendMessage(filter, &clientPort, buf, sizeBuf, NULL, 0, NULL);

when I use that function without timeout, sometimes it freezes, and when I’m using a timeout of 0.5sec

status = FltSendMessage(filter, &clientPort, buf, sizeBuf, NULL, 0, &timeout);
if(status == STATUS_TIMEOUT)
DbgPrint(“STATUS_TIMEOUT !!\n”);

I can have ~ 500 “STATUS_TIMEOUT” displayed…

The userland app is multithreaded (64 threads), the whole is running in a virtual machine.

Thanks !

xxxxx@gmail.com wrote:

I developped a minifilter driver in order to trace the execution of a binary (something like Process Monitor but more advanced) and I’m using the filter communication port (FltSendMessage, FilterGetMessage…) to send the logs to the userland application.

It works in the way that the userland app receive the logs but i have huge messages losses !
status = FltSendMessage(filter, &clientPort, buf, sizeBuf, NULL, 0, NULL);
when I use that function without timeout, sometimes it freezes, and when I’m using a timeout of 0.5sec

Well, how many messages are you sending? This is a synchronous API –
it blocks until the application responds. If your application is busy
and doesn’t get back to its message loop, then you’re going to wait. If
you send thousands of messages in a second, it’s going to take time.

The userland app is multithreaded (64 threads), the whole is running in a virtual machine.

There is little point to an application with 64 threads, unless you have
a virtual machine with 64 processors. Are they all waiting for
something? If they’re all active, then you’re going to get thrashing.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Unfortunately, the OP doesn’t say anything about what those 64 threads are
doing. If they are in the critical path of returning control to the
kernel, and do anything that has potentially long blocking, the design is
just plain wrong. A single thread that accepts the message from the
filter driver, enqueues it in a local queue, and returns immediately
without blocking, is all that is necessary. I am not sure there is any
gain by using multiple threads, even if the filter driver is executing
concurrently on multiple cores.

I tend to use IOCPs for interthread messaging, but to “roll your own”
queue in app space should use a CRITICAL_SECTION for synchronization
instead of a mutex. This is quite fast, since it is done as a
spin-lock-with-timeout; it only calls the kernel if it spins too long.

One of my articles shows performance graphs using OpenMP, where I measure
performance of compute-bound threads. It should come as no surprise to
most readers that performance peaks when the number of threads equals the
number of cores, and begins to fall off as the number of threads
increases. For threads which can block, however, having more than N
threads, for N the number of (logical) cores, can improve performance.
When using IOCPs, the max concurrency count can be set to N, and you can
have more than N threads waiting (2*N is frequently a good “first guess”).
A thread that has been released from its IOCP wait is charged against the
concurrency count N, but if the thread blocks, it is no longer charged
against the concurrency count of the IOCP and an additional thread can be
released. I leave instrumenting to determine the optimum number of
threads to use as An Exercise For The Reader.

64 threads might not be so bad on even a 16-core system if the threads can
block during the processing of a notification from the filter driver. But
the thread that receives the notification should not have any significant
blocking going on, especially any kernel calls on synchronization objects,
which is why CRITICAL_SECTION or IOCPs, or even “worker threads” with
message loops (known as “UI threads”, a confusing name that gets
programmers into All Kinds Of Trouble). However, the finite size of
PostMessage queues generally discourages the use of UI threads.

Perhaps the folks at OSR could answer the question of how big a boost (if
there is any boost) a thread waiting for a filter message gets when it is
released. Ideally, you would like a really good boost to prevent serious
priority inversion from happening. If this becomes a problem, the
multimedia thread scheduler should be considered. This will let a user
process that is Properly Registered run fast-turnaround threads at
priority up to 26 (The MMSMS, or some similar acronym with a lot of Ms and
Ss in it) runs at level 27, so it can’t schedule a thread that locks it
out).

Building high-performance apps that interact between the kernel space and
app space is a lot more complicated than using threads like pixie
dust–sprinkle enough on, and you can fly. And, as indicated, you need
NUMBERS, such as how many messages per second you are getting, the
round-trip time (the length of time FltSendMessage will block) and so on.
That round-trip time has to be held as low as possible, meaning that if
all you are doing is logging, you do NO logging in that round trip
time–you just queue up a request to your app to do the logging and
return. To support analysis, you may need to include the core # and a
64-bit timestamp. Note that the only relationship you can depend on is
timestamps from the same core, unless you get a global time stamp that is
independent of RDTSC clock skews between cores.
Joe

xxxxx@gmail.com wrote:
> I developped a minifilter driver in order to trace the execution of a
> binary (something like Process Monitor but more advanced) and I’m using
> the filter communication port (FltSendMessage, FilterGetMessage…) to
> send the logs to the userland application.
>
> It works in the way that the userland app receive the logs but i have
> huge messages losses !
> status = FltSendMessage(filter, &clientPort, buf, sizeBuf, NULL, 0,
> NULL);
> when I use that function without timeout, sometimes it freezes, and when
> I’m using a timeout of 0.5sec

Well, how many messages are you sending? This is a synchronous API –
it blocks until the application responds. If your application is busy
and doesn’t get back to its message loop, then you’re going to wait. If
you send thousands of messages in a second, it’s going to take time.

> The userland app is multithreaded (64 threads), the whole is running in
> a virtual machine.

There is little point to an application with 64 threads, unless you have
a virtual machine with 64 processors. Are they all waiting for
something? If they’re all active, then you’re going to get thrashing.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer