Optimized Memory Dumps

Kern_Mode · November 19, 2017, 9:48am

Hello OSR community,

I am trying to implement an optimized memory dump mechanism.
The basic idea is, that I want to dump READ/WRITE memory form a user space process (e.g. firefox.exe) on specific events. The events on which my dumping mechanism is triggered,
are specific packets within the WSPSend/WSPRecv calls from the mswsock.dll.

I have implemented a DLL which intercepts the send and recv calls inside the target process and performs the dumping when the dumping condition is triggered.

I use VirtualQueryEx to filter the memory regions. As mentioned before I am interested in
READ/WRITE memory. I also skip the thread stacks because I don’t need them even if they are READ/WRITE. The problem with this approach is that every dump is about 90 mb (firefox.exe).

To optimize the dump, the idea would be to track the memory which was written since the last dump. This should lead to smaller dumps on subsequent triggers.
The idea now would be to implement a kernel driver and utilize the additional information from kernel mode.
The previously mentioned DLL should now send a dump request to the driver (on the same trigger events) and the driver would would peform the dumping.

What i have done so far is to traverse the process VAD tree, and to retrieve the PTE while looping through the region start until region end by adding 0x1000 to the start address. I encountered a few problems. First i have to suspend all the threads except the running one within the user process (firefox.exe), otherwise the VAD traversal will result BSOD.

How can i implement a proper synchonization solution for the VAD traversal, such that the structure is not changed during the traversal. I have tried to use the eprocess->AddressCreationLock with the macro LOCK_ADDRESS_SPACE which passes the EX_PUSH_LOCK to KeAcquireGuardedMutex, but the types seems not to be equal and it results in a BSOD. The operating system i am using for testing is Windows 7 64 Bit SP1.
The macro LOCK_ADDRESS_SPACE uses the KeAcquireGuardedMutex function, which requires a PKGUARDED_MUTEX as parameter but the AddressCreationLock is EX_PUSH_LOCK in Win 7 (dumped with windbg).

I would like to get the pages modified since the last dump and only dump the changed ones, such that the dump size is minimal.

Does anybody know if this can be done in another way and if my approach may have any wrong assumptions?

Thanks in advance!

Tim_Roberts · November 19, 2017, 1:48pm

On Nov 19, 2017, at 6:47 AM, xxxxx@gmail.com wrote:
>
> I use VirtualQueryEx to filter the memory regions. As mentioned before I am interested in
> READ/WRITE memory. I also skip the thread stacks because I don’t need them even if they are READ/WRITE. The problem with this approach is that every dump is about 90 mb (firefox.exe).

90 MB is not very big. What do you expect to DO wth these memory dumps? If you are going through them by hand, then the task is hopeless. You couldn’t go through 90 MB even once. If you are going to do some automated processing to search for stuff, then 90 MB is a trivial amount.

> To optimize the dump, the idea would be to track the memory which was written since the last dump. This should lead to smaller dumps on subsequent triggers.
> The idea now would be to implement a kernel driver and utilize the additional information from kernel mode.

What information do you think you can get from kernel mode that isn’t available in user mode?

Have you investigated the debugger APIs? You can exercise relatively complete control over another process using them.
—
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

anton_bassov · November 19, 2017, 2:01pm

> How can i implement a proper synchonization solution for the VAD traversal,

such that the structure is not changed during the traversal.

You may not - as simple as that. You simply cannot synchronise an access to the resources that you don’t own.Full stop. The only thing left to do is to corral CPUs if you run in the KM, or suspend all the threads (or change their contexts and make them go blocking on the synch resource that you DO own) of the target process in the userland. These tricks would hardly qualify for a “proper” solution - for example, consider what happens if one of the target threads already owns the resource that you are trying to grab behind the scenes.

Some approaches may be looking as a proper solution (in fact, be the one in the vast majority of cases, so that a subtle"Heisenbug" that they present may go undetected for quite a while). For example, in your particular scenario suspending all threads of the target process from the userland may work just fine, at least at the first glance. The actual suspension may take place only at the time when the target thread is about to return to the userland, and at this point it may not already own any kernel resources, for understandable reasons. Seems to be a perfect solution, does not it.

Unfortunately, not - it does not resolve the potential conflict of your KM code with any other thread in the system that makes some MM-related call that tries to access you target structures.
Once all the threads in the target process are suspended they cannot make MM-related calls, so that you seem to be safe. However,consider what happens if some other process calls, say
VirtualQuery() or ReadProcessMemory() on your target process. As you can see, the possibilityof a conflict is still there.

Therefore, no matter what you do, you just cannot develop a solution that is absolutely safe.

In any case, I don’t see any reason why doing things from a driver may be beneficial in your case.

Anton Bassov

Slava_Imameev · November 21, 2017, 6:35am

Just for curiosity - how are you going to prevent the Memory Manager(MM) and the Working Set Manager ( a part of the MM ) from toggling a modified flag in a PTE between two snapshots?

Looks like your design is based on a wrong assumption that you can use the PTE modified flag as an indicator.

Kern_Mode · November 30, 2017, 2:07pm

Thank you for the reply,

The 90 MB dumps are generated per network connection (session),
and firefox is generating many sessions. Which would result in
multiple dumps around 90 mb. The dumped data is later used for automated processing,
where specific data should be extracted. By minimizing the dump there would be less
false positives within the extracted data.

The reason for the kernel mode driver is that I wanted to be able to determine
memory pages which were written (dirty bit set) after the trigger mechanism was triggered.
This way I would only dump the dirty pages which would result in a smaller dump.

I’m not sure if this could also be implemented from the user mode.
I would guess that by utilizing vectored exception handling and modifying the page protections of R/W pages,
to “no access” this should also be possible.