Hello everyone,
I’ll start off with a quick rundown of the situation: We have a USB device from which we are reading data over a bulk endpoint. The data being read is generated by a timed process on the device itself. If this endpoint is not serviced with IN tokens for an extended period of time, the data accumulates in and eventually overflows the onboard buffer on the device. This is an error and is reported immediately to the user. (In other words, we cannot tolerate any missing data.)
Allow me first to pre-empt what some of you may be thinking: Yes, we are aware that host controllers make no guarantee about servicing bulk endpoints. If we want deterministic service, we should use interrupt or isochronous endpoints, etc. etc. etc. For various reasons, we decided to go down the bulk road many years ago, and it has served (and continues to serve) us well. It’s only on a recent crop of machines that we are beginning to notice some issues.
On certain machines, mostly those running Vista, USB analysis shows long periods of time (upwards of several milliseconds) during which the host controller is not sending out IN tokens. This all happens despite the fact that we always have several URBs queued to the USB driver stack, each of which holds 32 KB of data from the device.
From the software side of things, our URB completion DPC copies the data out of the buffer associated with the URB to the buffer the user ultimately reads from, does some accounting, then resubmits the IRP to be reused. We instrumented our DPC, which normally run once every (32 KB / device_data_rate_in_bytes_per_second) seconds. In the failure case, that is, during the long periods of no IN tokens issuing from the host controller, we notice our DPCs being delayed by up to 3 or 4 milliseconds from when they normally run (for the most part, they complete in a timely fashion).
A colleague pointed me to the DPC Latency Checker (http://www.thesycon.de/deu/latency_check.shtml), which sure enough shows a gigantic spike in system DPC latency every 7 or 8 seconds. By ‘gigantic’, I mean over two orders of magnitude. Baseline DPC latency averages about 40 us with periodic spikes as high as 9 ms. Not surprisingly, our software failures correlate very strongly with the spikes.
I thought I might find the offending actor in the system by using tracemon/tracerpt, but the system doesn’t appear to be spending inordinately long amounts of times in ISRs or DPCs according to the histograms.
So if my DPCs are not being delayed by ISRs or other DPCs taking a long time, what would be causing them to run so much later than they should? Could it be a thread of execution in the kernel raising its IRQL and spinning on something, or performing some otherwise very long operation at that elevated IRQL? I have no idea how I might narrow now the offending driver on this system. I’ve already tried disabling as many drivers as I can, but I can’t get this spike to stop occurring, and I’m just about at the end of my toolbox trying to determine what is ultimately causing the periodic long DPC latencies. (By the way, there are a number of online reports of DPC latency spikes in recent machines. There seems to be no consensus as to what is causing these spikes.)
Any advice or guidance would be appreciated.
Thanks,
Curtis Petersen
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs