USB Bulk transfer delay

A few years ago I created a UMDF driver that’s used with a CCD camera. An application will do a readfile, the driver will create a new bulk request to allow the read request to be a multiple of the packet size. It pends the original request for later completion and then forwards the newly created request to the input pipe.

In the completion callback, based on the pContext the original request is located in the manual queue and completed.

This has been working fine, however with a recent upgrade to a larger format CCD chip, artifacts are starting to show up in the read data. The artifacts (brighter pixels) occur at regular intervals in the received data every 4MB.

The bulk read itself is for several MB of data that represent a single full frame readout of the CCD pixels. The hardware has limited buffering available (4KB) and reads data out of the chip based on the clocks the USB chip on the device generates. The current feeling is that after 4MB of transfer, the host usb chip clock is halted for a small amount of time before resuming transfer which causes artifacts on the following few pixels read.

After a bit of searching, I found the following link on the MS website

http://msdn.microsoft.com/en-us/library/windows/hardware/ff538112(v=vs.85).aspx

which notes transfer sizes for high speed devices on windows 7 are limited to 4MB. It may be a coincidence, but this ties up with the amount of data transfered before each cycle of the artifacts occur. Perhaps due to the latency involved in setting up another transfer?

Assuming this is indeed the issue (suggestions welcome if you think it may be something else), can anyone suggest a way to work around this? I considered a switch to Isochronous, but the loss of guaranteed delivery makes this unsuitable.

If something below the UMDF code is splitting the 12+MB transfer into 3+ transfers, is there any way to allow the next transfer to begin prior to the previous transfer been ack’d?

Driver and firmware can be modified as needed, the only thing that’s not really an option is hardware changes, such as adding RAM to pre-buffer the entire CCD frame prior to sending, such that any USB delays wouldn’t impact the data readout.

Regards,

Gary

Are you only pending one read at a time? Try pending a few to make sure there is always a read pending. This way the underlying implementation of how a transfer is split up doesn’t matter, there is always another transfer ready to go to put on the schedule.

d

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@figmentgames.com
Sent: Friday, December 19, 2014 10:06 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] USB Bulk transfer delay

A few years ago I created a UMDF driver that’s used with a CCD camera. An application will do a readfile, the driver will create a new bulk request to allow the read request to be a multiple of the packet size. It pends the original request for later completion and then forwards the newly created request to the input pipe.

In the completion callback, based on the pContext the original request is located in the manual queue and completed.

This has been working fine, however with a recent upgrade to a larger format CCD chip, artifacts are starting to show up in the read data. The artifacts (brighter pixels) occur at regular intervals in the received data every 4MB.

The bulk read itself is for several MB of data that represent a single full frame readout of the CCD pixels. The hardware has limited buffering available (4KB) and reads data out of the chip based on the clocks the USB chip on the device generates. The current feeling is that after 4MB of transfer, the host usb chip clock is halted for a small amount of time before resuming transfer which causes artifacts on the following few pixels read.

After a bit of searching, I found the following link on the MS website

http://msdn.microsoft.com/en-us/library/windows/hardware/ff538112(v=vs.85).aspx

which notes transfer sizes for high speed devices on windows 7 are limited to 4MB. It may be a coincidence, but this ties up with the amount of data transfered before each cycle of the artifacts occur. Perhaps due to the latency involved in setting up another transfer?

Assuming this is indeed the issue (suggestions welcome if you think it may be something else), can anyone suggest a way to work around this? I considered a switch to Isochronous, but the loss of guaranteed delivery makes this unsuitable.

If something below the UMDF code is splitting the 12+MB transfer into 3+ transfers, is there any way to allow the next transfer to begin prior to the previous transfer been ack’d?

Driver and firmware can be modified as needed, the only thing that’s not really an option is hardware changes, such as adding RAM to pre-buffer the entire CCD frame prior to sending, such that any USB delays wouldn’t impact the data readout.

Regards,

Gary


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

4K buffer may be too small. If two requests has to be used, they may be separated by a frame or two, which is a millisecond or two worth of data. You you cannot pause data, you need a buffer big enough to hold it.

Doron: From the UMDF level of the driver there’s only one read request incoming for the full 12MB of data. Are you suggesting instead of forwarding that as a single 12MB request lower down the stack, I instead split it into several smaller transfer requests and issue all those reads at once?

I assumed whatever it is lower down the stack that’s currently splitting the transfer up at 4MB bounds would already be pending multiple reads at the time it splits the transfer but it’s likely worth giving a go just in case I assumed wrong.

Alex: The small buffer plus a few ms is what I believe the issue is, although not yet confirmed.

What I am suggesting is that your UMDF driver use a continuous reader and send multiple reads to the device on its own, not in response to the app sending a read to the driver. This means you can read data before the app is ready so you would need to put the read data into a ring buffer and complete the apps read request with the enqueued data.

d

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@figmentgames.com
Sent: Friday, December 19, 2014 11:57 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] USB Bulk transfer delay

Doron: From the UMDF level of the driver there’s only one read request incoming for the full 12MB of data. Are you suggesting instead of forwarding that as a single 12MB request lower down the stack, I instead split it into several smaller transfer requests and issue all those reads at once?

I assumed whatever it is lower down the stack that’s currently splitting the transfer up at 4MB bounds would already be pending multiple reads at the time it splits the transfer but it’s likely worth giving a go just in case I assumed wrong.

Alex: The small buffer plus a few ms is what I believe the issue is, although not yet confirmed.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> What I am suggesting is that your UMDF driver use a continuous reader and send multiple reads to

the device on its own, not in response to the app sending a read to the driver. This means you can
read data before the app is ready so you would need to put the read data into a ring buffer and
complete the apps read request with the enqueued data

Wouldn’t that only reduce the latency between app requesting and receiving data?

Unless I’ve missunderstood, the delay from the ack phase between each 4MB transfer with a continuous reader setup would be the same as the delay between the 2nd and subsequent transfers if a single app->driver read request was split by the driver into multiple 4MB requests?

If the artifacts are (I may be wrongly assuming) due to the few ms delay alex mentioned, wouldn’t a continuous reader approach (and a multi read approach for that matter) still suffer from this?

Not sure this changes anything, but to provide a little more context, the app requests an exposure which could be anything from a few miliseconds to hours, then reads back the 12MB of data. Latency for the setup of the first 4MB transfer request isn’t a concern, but for the second two is.

What advantage would a continuous reader approach have over splitting a single app->driver request into several simultaneously submitted request be? Do you have any idea whether the code currently doing the 12MB split into 3 x 4MB transfers is pending multiple reads or setting up a new request only after the prior request is ACK’d which would add further delay?

Hopefully I’ve not confused matters further, thanks for your help,

Gary

Forgot to add, I understand what you’re saying about pending multiple requests to remove the underlying implementation of transfer splitting from the state of play. What I wasn’t sure about is the advantage a continuous reader would have over submitting several read requests at the time a single read comes in from the app beyond reducing the inital app->driver read latency.

xxxxx@figmentgames.com wrote:

I assumed whatever it is lower down the stack that’s currently splitting the transfer up at 4MB bounds would already be pending multiple reads at the time it splits the transfer but it’s likely worth giving a go just in case I assumed wrong.

No one is splitting the transfer up into 4MB chunks. If you submit a
12MB read, it will go down to the host controller driver as a 12MB
read. The host controller driver will then chop it up into 512-byte
packets for transfer on the wire.

So, just chop your 12MB request into, for example, 12 requests of 1MB
each. Submit three at a time. As each completes, go submit the next
one, so you always maintain three pending at a time. That will get you
the maximum throughput.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@figmentgames.com wrote:

Wouldn’t that only reduce the latency between app requesting and receiving data?

How could it possibly do that? Your app can’t continue until the last
byte of the transfer arrives in the driver. Anything that happens in
between is irrelevant to the application.

Unless I’ve missunderstood, the delay from the ack phase between each 4MB transfer with a continuous reader setup would be the same as the delay between the 2nd and subsequent transfers if a single app->driver read request was split by the driver into multiple 4MB requests?

You can’t think about “ack phase” in these terms. On the wire, the
packets are all 512 bytes, and each packet is entirely separate. They
are acked and retried separately. Your hardware doesn’t have any idea
that the packet is part of a larger “transfer”. That is strictly a
software concept.

I considered a switch to Isochronous, but the loss of guaranteed delivery makes this unsuitable.

OK, allow me to point out just how silly your position is.

When they say that “bulk pipes have guaranteed delivery”, they don’t
mean every packet is delivered immediately and without error. What they
mean is that, if there IS a transmission error, the two ends of the pipe
will keep retrying that packet until it transfers.

What you’re saying is that your hardware does not have enough buffering
to survive even a relatively short delay in the processing. If a retry
happens, the stream is going to pause. No new data can go until the
packet is retransmitted successfully. In the meantime, your FIFOs are
going to fill up and overflow, and you’ll get data errors.

So, in actual fact, your device is the PERFECT candidate for isochronous
processing. You have a device that cannot tolerate delays. It is
time-sensitive. That’s what “isochronous” means. If you get a packet
error, rather than get retries that cause a delay of unknown period,
your driver will get a notification that “packet X failed”. You can
then take some intelligent action based on that knowledge, instead of
hoping for the best.

Now, in reality, packet errors don’t really happen. I’ve run
isochronous streams for a week at a time, and never saw a packet dropped.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> How could it possibly do that? Your app can’t continue until the last byte of the transfer

arrives in the driver. Anything that happens in between is irrelevant to the application.

I may have the wrong mental model for this, but my train of thought went something like this:

App requests data, driver prepares/forwards the request and data transfer begins.

compared to,

driver pends a few read requests and data transfer starts, app requests data, driver starts transfering data to app.

Whilst the time to actually transfer the total data should be the same (packet loss not withstanding), there should be a reduction from the apps pov (however small) due to the removal of whatever delay there is between app issuing a read and the driver stack preparing and sending the read request. That said, whether that’s true or not, it has little bearing on the issue we’re facing.

No one is splitting the transfer up into 4MB chunks. If you submit a 12MB read, it will go down to
the host controller driver as a 12MB read. The host controller driver will then chop it up into
512-byte packets for transfer on the wire.

Perhaps it’s just a coincidence we’re seeing artifacts every 4MB and the msdn link mentioning a software limitation of 4MB transfer size is a red-herring? This is sadly a part where there’s a gap in my knowledge that I’m trying to plug as I’m sure these posts show :slight_smile:

Taking the link at face value, it suggested to me that there’s some kind of software limitation at the 4MB boundary and without knowing anything about the lower layers it seemed like a plausible source of the artifacts.

However, if as you suggest, there’s nothing splitting at 4MB and that the software stack is treating the transfer of the first 4MB worth of 512 sized packets as it does the second 4MB worth, then that would suggest pending multiple reads or switching to a continual reader approach would not solve the problem.

OK, allow me to point out just how silly your position is.

When they say that “bulk pipes have guaranteed delivery”, they don’t mean every packet is
delivered immediately and without error. What they mean is that, if there IS a transmission error,
the two ends of the pipe will keep retrying that packet until it transfers.

Yes I understand re-transmission and how guaranteed/non-guaranteed delivery works in general.

My reasoning for not wanting to use non-guaranteed, is that loss of a full packet of pixel data may be sufficient to render the entire frame of data as unusable for its intended purpose, however losing a few pixels due to artifacts caused by any resend delays whilst still an issue, is the lesser of the two.

That said, you’ve convinced me to re-investigate isochronous as a possible option. I’ll need to check whether loss of any data is acceptable. It may turn out knowing some data is bad is preferable to believing it’s all good and having some unknown bad data due to artifacts.

If packet loss is quite low, this may end up a better option that I’d anticipated.

But… you will lose it one way or the other! From your description, it’s unavoidable, so it MUST be acceptable. If your device will overflow in the case of a Bulk data transfer retry, it will lose data. Given that I don’t know how your device works, I don’t know how MUCH data, sure… but you said you WILL lose data in this case.

As Mr. Roberts said, your device sounds like a good candidate for use of an Isochronous endpoint. AND, in the “real world” unless you’re using a a poor quality set of long USB cables that are subject to electromagnetic interference, errors on USB connections are exceedingly rare.

As far as other matters… if you *are* going to use bulk transfers, it sounds to ME like using a continuous reader is exactly what you want. You can have as many read requests simultaneously outstanding on the host controller as you want, process the completions, and buffer them internally (or return them to your user) as you see fit.

Peter
OSR
@OSRDrivers

>But… you will lose it one way or the other! From your description, it’s unavoidable, so it MUST be

acceptable. If your device will overflow in the case of a Bulk data transfer retry, it will lose data.
Given that I don’t know how your device works, I don’t know how MUCH data, sure… but you said
you WILL lose data in this case.

Whilst it’s unavoidable with the current driver, both cases potentially mean the entire frame has to be discarded so neither was acceptable. However, my reason for stating artefacts were preferable was that I had not realised how rare packet loss in USB was at the time. I was contrasting trying to solve the problem of a small 3-4 pixel loss every 4MB vs randomly losing a chunk of pixels that make up the lost 512 packet. Not knowing enough about USB transfers, I’d assumed packet loss was far worse than it apparently is.

Given that false assumption on packet loss, I thus felt it better to try to find a solution to the artefacts whilst continuing to use the guaranteed delivery of bulk transfers, such as switching to a continuous reader approach.

Based on what yourself and Tim have said about the rarity of packet loss though and assuming a continuous reader approach doesn’t help, isochronous would be a viable option now. The need to throw away the odd frame now and then would be preferable to artefacts continuing to occur on every frame.

I’ll likely give Doron’s suggestion a try first as it will require the smallest change to the driver. Failing that, I’ll take a look into the Isochronous approach.

Thanks everyone for your help/suggestions and Tim for making me reconsider the use of isochronous transfers.

Moving to isoch endpoints will force you out of umdf as well

d

Bent from my phone


From: xxxxx@figmentgames.commailto:xxxxx
Sent: ?12/?20/?2014 4:04 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] USB Bulk transfer delay

>But… you will lose it one way or the other! From your description, it’s unavoidable, so it MUST be
>acceptable. If your device will overflow in the case of a Bulk data transfer retry, it will lose data.
>Given that I don’t know how your device works, I don’t know how MUCH data, sure… but you said
>you WILL lose data in this case.

Whilst it’s unavoidable with the current driver, both cases potentially mean the entire frame has to be discarded so neither was acceptable. However, my reason for stating artefacts were preferable was that I had not realised how rare packet loss in USB was at the time. I was contrasting trying to solve the problem of a small 3-4 pixel loss every 4MB vs randomly losing a chunk of pixels that make up the lost 512 packet. Not knowing enough about USB transfers, I’d assumed packet loss was far worse than it apparently is.

Given that false assumption on packet loss, I thus felt it better to try to find a solution to the artefacts whilst continuing to use the guaranteed delivery of bulk transfers, such as switching to a continuous reader approach.

Based on what yourself and Tim have said about the rarity of packet loss though and assuming a continuous reader approach doesn’t help, isochronous would be a viable option now. The need to throw away the odd frame now and then would be preferable to artefacts continuing to occur on every frame.

I’ll likely give Doron’s suggestion a try first as it will require the smallest change to the driver. Failing that, I’ll take a look into the Isochronous approach.

Thanks everyone for your help/suggestions and Tim for making me reconsider the use of isochronous transfers.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

On Dec 20, 2014, at 4:05 PM, xxxxx@figmentgames.com wrote:

Whilst it’s unavoidable with the current driver, both cases potentially mean the entire frame has to be discarded so neither was acceptable. However, my reason for stating artefacts were preferable was that I had not realised how rare packet loss in USB was at the time. I was contrasting trying to solve the problem of a small 3-4 pixel loss every 4MB vs randomly losing a chunk of pixels that make up the lost 512 packet.

OK, let’s say your FIFO overflows and you lose some data. How will you know how many pixels was lost? Do you have markers at the end of each scanline? If not, then the entire remainder of the frame is ruined. If you lose 10 pixels, then the rest of the image will be wrapped 10 pixels to the left.

It’s almost impossible to come up with strategies to recover from lost data in streaming scenarios. That’s why most people choose isochronous, where I can at least KNOW how much data was lost.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tim: I’m not aware of the specifics of what is going on in the hardware, but pixels are not lost in the sense the data is gone/missing, they’re reading out “hotter” due to delays in the read out. They’re lost in the sense that the pixels exist but the data is unusable.

This occurs in a cycle every 4MB during readout. When I said “3-4 pixel loss” I meant in terms of usable data as opposed to that data never been received by the PC. A different kind of loss to overflows or packet loss, apologies if I confused matters, I’d tried to avoid saying loss in my posts and used artefact instead, but one slipped in anyway :slight_smile:

Doron: Yes it would mean switching from user mode. Another reason to try continuous reader first just in case. I’ll give this a go in the new year and failing that, investigate further/look into an Isochronous driver.

It looks like the hardware pauses the CCD readout clock when the 4K buffer is full, and this causes the CCD or ADC to misread the charge.
I suggest you size your partial requests so that they are broken on scan boundary (possibly minus 4K). This way any corrupt pixels will be on the edge of the frame.