Multicore / HT / multi cpu scheduling

bank_kus · March 30, 2008, 4:52pm

How are threads assigned to cores. Is this a static assignment? Is there a notion of core migration?

Is it a push based model ( per core scheduler realizes that it is simply overloaded, queries other cores and migrates work ) or a pull based model ( single work queue pull work when you have nothing else to do in your core ) or perhaps something far more innovative.

banks

Maxim_S_Shatskih · March 30, 2008, 5:45pm

The Windows scheduler was developed in a time where there was no notion of
“core”, only of “CPU”.

It has 2 main functions - KiFindReadyThread and KiReadyThread

KiFindReadyThread is called when the current thread on the current CPU
cannot run anymore - due to termination or waiting on the event or similar
object.
This function chooses the new thread for this CPU, and the CPU later
switches to it.
KiReadyThread is called when the new ready thread appears in the system -
due to thread creation, or due to wait satisfaction (event is signaled).
This function decides the fate of the new thread on whether it will be
chosen for execution on the current CPU, or on some other CPU, or not chosen
for execution at all and is put to “ready lists”.

Also, there is a quantum end path. It is invoked when the thread is running
on the CPU for too long time. It just picks the new thread for execution on
this CPU from the ready lists and cannot retain the current thread to continue
running. Similar to KiFindReadyThread, but different a bit.

KiFindReadyThread logic is something like:

scan the ready list, find the threads of the maximum priority for which
the affinity does not prohibit this CPU. Then choose the “best” of these
threads.

KiReadyThread logic is something like:

compare the thread with the current thread of all CPUs allowed by the
thread’s affinity
if it better then the current thread of the current CPU - then switch the
current CPU to it.
if it is better then the current thread of the another CPU - then set it
as “standby” for this another CPU and send an IPI to it to switch the thread.
otherwise, if the thread is worse then any current thread - put it to the
ready list of the particular thread priority (they are per-priority).

The notion of “best” is:

if one thread has ideal processor set and it is == CPU in question, and
another has no ideal processor set or it is != CPU in question - then the first
thread is better
if one thread was last executed on the CPU in question and another was
not - then the first thread is better.

The second rule is to avoid cache thrashing.

All of this is NT4-timeframe logic described IIRC by Solomon/Russinovich.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · March 30, 2008, 6:34pm

Banj_KS,

Maxim’s explanation is good. There are some more complications here since NT4 (actually since 2k3). The ready lists are kept per-core and tasks are migrated from one to the other when a core goes idle while there are tasks waiting to run. This migration is accomplished through a new state called ‘Deferred Ready’ in which a thread is on a per-processor queue and that processor chooses where to place it based on some of the criteria Maxim specified (along with things such as hyperthreading set and NUMA nodes). Idle cores also try to pull tasks as they’re transitioning to and from idle.

The scheduler is protected by a few locks and it has become a lot more complicated (and scalable) than NT4 - WinXP. A global lock does protect some state transitions (involving most wait objects), but this is subject to change. The scheduler has a designed-in lock ordering, so trying to lock the scheduler from the outside (such as Anton’s DPC trick) will occasionally deadlock the system if you (or your customers) happen to get unlucky.

To answer your question about one core locking the other: yes… If you lock up one core, you could halt the thread scheduling of another core. This isn’t unique to the scheduler though; the same thing can happen if you halt a core while holding any non-scheduler spinlock above dispatch-level while someone else is trying to acquire it.

Neeraj Singh
Kernel Services Test

bank_kus · March 30, 2008, 9:02pm

Thanks for that explanation Maxim/Neeraj.

> This migration is accomplished through a new state called ‘Deferred
> Ready’ in which a thread is on a per-processor queue and that processor
> chooses where to place it based on some of the criteria Maxim specified
> (along with >> things such as hyperthreading set and NUMA nodes). Idle
> cores also try to pull tasks as they’re transitioning to and from idle.

So w/o getting too much into the implementation details, it does appear
that when a core has 40 ready threads and one of the other cores is idle it
will pull out of these 40 ready threads. That is nice.

**Anyways irrespective of who/when does the migration, if a thread is
being migrated from core 0 to core 3 ( in a quad core ) it does appear that
core 2’s KiFindReadyThread ( from Maxim’s description ) *might* block on a
lock even though the ready queues are per core. And that this lock is
because of internal kernel data structures. Is that a correct summary?

How many separate scheduler code variants are there. Is there any reason
to believe that b/w XP/ Vista/ Server 2003 / Server 2008 one will provide
better core utilization than the other.

I asked similar questions to some of the Linux kernel developers and was
pointed to 2.6.x.x code which seemed to indicate that scheduling of one
core is lockless and will not impact the other core’s scheduling. I intend
to read through their load balancer code to see how far that is true. Also
need to check with Solaris folks. Trying to determine which OS has the best
load distribution/ load balancer and the ability to scale as the #cores
increase. Much appreciate your responses.

Regards
Banks

“Neeraj Singh” wrote in message
news:xxxxx@ntdev…
Banj_KS,

Maxim’s explanation is good. There are some more complications here since
NT4 (actually since 2k3). The ready lists are kept per-core and tasks are
migrated from one to the other when a core goes idle while there are tasks
waiting to run. This migration is accomplished through a new state called
‘Deferred Ready’ in which a thread is on a per-processor queue and that
processor chooses where to place it based on some of the criteria Maxim
specified (along with things such as hyperthreading set and NUMA nodes).
Idle cores also try to pull tasks as they’re transitioning to and from idle.

The scheduler is protected by a few locks and it has become a lot more
complicated (and scalable) than NT4 - WinXP. A global lock does protect
some state transitions (involving most wait objects), but this is subject to
change. The scheduler has a designed-in lock ordering, so trying to lock
the scheduler from the outside (such as Anton’s DPC trick) will occasionally
deadlock the system if you (or your customers) happen to get unlucky.

To answer your question about one core locking the other: yes… If you
lock up one core, you could halt the thread scheduling of another core.
This isn’t unique to the scheduler though; the same thing can happen if you
halt a core while holding any non-scheduler spinlock above dispatch-level
while someone else is trying to acquire it.

Neeraj Singh
Kernel Services Test**

anton_bassov · March 30, 2008, 10:04pm

>… asked similar questions to some of the Linux kernel developers and was pointed to 2.6.x.x

What is "2.6.x.x:?? Please note that, starting from the kernel version 2.6.23.x, Linux abandoned O(1) scheduling algorithm that it relied upon in earlier versions , so that description of scheduling that you can find in “Understanding Linux 2.6 kernel” is already outdated. Starting from the kernel version 2.6.23.x, Linux relies upon CFS (Completely Fair Scheduler) algorithm -AFAIK, in addition to replacing run-queues with “red-black trees” and eliminating heuristics in determining interactivity of a process, it took steps to avoid a single dispatcher spinlock…

Anton Bassov