FAT32 small files write speed on removable drive

I am trying to write many files to USB flash removable drive FAT32 formatted.
I am not expert in Windows internal things so I asking for your experience and advices.
While big files write speed is limited by USB and the USB device, the small files write speed is ridiculously slow.
Using DiskMon I can see what Win7 is writing to USB drive sectors per file.
Some facts:
Writes to Directory is done by 8x 512 bytes sectors blocks equal 4KB with 4KB boundary.
Writes to FAT is done by 1-8x 512 bytes sectors blocks up to 4KB with 4KB boundary.
Writes FAT sectors amount is depending on file size, cluster size, sector size and crossing of FAT sector boundary.
Some equations:
filetotalclusters= (filesize+clustersize-1)/clustersize
clusterspersector=sectorsize/4
fileFATsectors=filetotalclusters/clusterpersector and optionally +1 sector when FAT sector border crossed.
So for 32KB cluster and 512 bytes sector, 1 FAT sector covers 4MB of file size, 8 sectors (4KB) covers 32MB file.

not big file single block write (<32MB and FAT chain doesnt cross sector boundary)
create(),write(),close() 6 writes per file

  1. create() write dir 4KB (name of file, default attributes and current times)
  2. write() write file data
  3. write dir 4KB (first cluster of chain)
  4. write FAT1 up to 4KB (chain of clusters)
  5. write FAT2 up to 4KB (chain of clusters)
  6. close() write dir 4KB (I guess it writes new file size)

bigger file or cross of FAT sector boundary and single block write
create(),write(),close() 4+2x(ceiling of filesize/32MB) writes per file

  1. create() write dir 4KB (name of file, default attributes and current times)
  2. write() write file data
  3. write dir 4KB (first cluster of chain)
  4. write FAT1 up to 4KB (chain of clusters)
  5. write FAT2 up to 4KB (chain of clusters)
  6. write FAT1 up to 4KB (continue chain of clusters)
  7. write FAT2 up to 4KB (continue chain of clusters)
  8. close() write dir 4KB (I guess it writes new file size)
    So for long files added 2x(filesize/32MB) writes to the FAT.

with file times set
create(),setfiletime(),write(),close() 7 writes per file

  1. create() write dir 4KB (name of file, default attributes and current times)
  2. setfiletime() write dir 4KB (new file times)
  3. write() write file data
  4. write dir 4KB (first cluster of chain)
  5. write FAT1 up to 4KB (chain of clusters)
  6. write FAT2 up to 4KB (chain of clusters)
  7. close() write dir 4KB (I guess it writes new file size)

with file length set before write to avoid fragmentation.
create(),setfiletime(),setlength(),write(),close() 7 writes per file

  1. create() write dir 4KB (name of file, default attributes and current times)
  2. setfiletime() write dir 4KB (new file times: creation,lastaccess,lastwrite)
  3. setlength() write dir 4KB (first cluster of chain)
  4. write FAT1 up to 4KB (chain of clusters)
  5. write FAT2 up to 4KB (chain of clusters)
  6. write() write file data
  7. close() write dir 4KB (I dont know what it writes as file times and file length already set)

if multiple block writes, 13 writes for 2 blocks write (<32MB and no FAT sector crossed) and more.
create(),setfiletime(),setlength(),write1(),write2(),writeX(),close()

  1. create() write dir 4KB (name of file, default attributes and current times)
  2. setfiletime() write dir 4KB (new file times: creation,lastaccess,lastwrite)
  3. setlength() write dir 4KB (first cluster of chain)
  4. write FAT1 up to 4KB (chain of clusters)
  5. write FAT2 up to 4KB (chain of clusters)
  6. write1() write file data block1
  7. write2() write file data block2 (and it triggers write of buffered FAT and Dir sectors)
    8-9…) write multiple FAT1 and FAT2 depending on file size which should be already written on step 4 and 5.
  8. write dir 4KB which should be already written on step 3.
  9. writeX() write file data blockX (and it triggers write of buffered FAT and Dir sectors)
  10. write dir 4KB which should be already written on step 3.
    13-14…) write multiple FAT1 and FAT2 depending on file size which should be already written on step 4 and 5
  11. close() write dir 4KB (I dont know what it writes as file times and file length already set)
    16-17…) write multiple FAT1 and FAT2 depending on file size which should be already written on step 4 and 5
    I guess write file in multiple blocks somehow makes Dir and FAT sectors dirty in Windows Cache.
    So it rewrites them again and again regardless of its contents didnt change.
    How to avoid such behavior?

So to reduce amount of writes per file seems need to use:

  1. cluster size as big as possible, 32KB or 64KB, but small files will eat much space.
  2. write file in single data block write or as less amount of blocks as possible to reduce triggering of excessive FAT and Dir writes.
    But it is hard to allocate big contiguos block of memory so might be 64MB is upper limit.

If I write files in multiple threads i see amount of writes to FAT and Dir is twice less.
Seems Windows Cache allows to modify FAT and Dir buffers twice before it rushes to flush on disk.
Only two modifications regardless of amount of parallel threads.
I dont know if it possible to increase that flush time for USB removable drive.
So slowness of writing of many small files is related to many duplicate writes to FAT and Dir.
If files are 1KB size, 1 FAT 4KB block shared by 1024 files and 1 Dir 4KB block shared by 128 files.
And if per file OS writes Dir and FAT multiple times you can see how it waste resources: time and flash life.

I can enforce write cache on USB removable drive, it really speed up write of first files but total time is even worse.
I dont know yet a reason for it, might be multiple writes also “poisons” FAT and Dir cached sectors so they start writes again and again mixing with data.