I pity you for being assigned to teach a C++ course; I avoid it like the
plague because in my experience the thinks those short-cuts hide are exactly
the things that cause the most problems. The last time I let one of my guys
convince me that it would be ok if he wrote it in C++ I spent more time
reviewing and fixing it than he did writing it.
Notwithstanding that rant, I wholeheartedly agree that a sound high-level
design is essential to a good implementation; my point was rather that
whereas you say sync should be avoided as a principal, I support the maximum
orthogonality of multi-threaded tasks as a precept. Perhaps the difference
is semantical, but I have found that if I start talking about too many sync
points then they just don’t use sync objects (when needed) rather than
understanding the point that the design is wrong. When put in terms of set
theory it seems easier to understand that thread A assigning a task to
thread B and then waiting for completion is a useless design for increased
parallelism in an application.
As I am sure you will recall, when the application becomes too generic
deadlock managers become essential because sometimes there is no way to
control lock acquisition order (al la SQL servers etc.) and while this
extremely high-level pattern doesn’t suffer from deadlocks per se, it can’t
be said that the performance on a specific dataset will be better than a
dedicated algorithm with either an exclusive lock, shared-reader lock, or
refcount (aka rundown) access pattern. The usual trade offs between perf
and flexibility apply - but I digress …
“Joseph M. Newcomer” wrote in message news:xxxxx@ntdev…
The kind of code that 30 years ago would have involved several mutexes and
lots of synchronization. It took me years to learn that synchronization
should be avoided by adopting higher-level patterns, and since I started
doing this, I don’t have deadlocks, performance problems, lock contention,
or any of the other problems that plague people who believe what they were
taught about synchronization.
Low-level locks on shared queues are, of course, essential. The key idea is
not that synchronization should be avoided, but that *reasoning* about
synchronization should be avoided, because this is where all the errors are
made. Any “convergence” involves asynchronous notifications; for example,
never block on a set of threads waiting for them to finish; instead, keep
track of the threads you have launched (as a “reference count”) and as each
thread terminates, it sends a notification that it has terminated, at which
point you decrement the reference count; when the reference count reaches
zero, you know all the threads have finished. These kinds of patterns tend
to also be robust-under-maintenance, always keeping in mind that maintenance
is done by unskilled programmers (the new hire, or yourself six months
later).
In OO, as in many contexts, contention on the storage allocator is a
problem. This is rarely addressed by the OO languages, which tend to not
allow multiple allocators, and which also leads to architectural problems
when multiple storage pools are used in multiple threads. I teach courses on
this problem. It is interesting that C++0x does simplify some of these
issues, and I am currently revising my course to take C++0x (VS2010) into
account.
joe
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of m
Sent: Monday, March 14, 2011 7:27 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Overlapped and APC
I’m not sure what kind of programs you have been writing these 15 years, but
while I agree that unnecessary synchronization should be avoided at all
costs, IMHO any kind of systems programming, in KM or UM, requires thread
sync. Good designs will allow threads to ‘run free’ and operate on as
orthogonal a datum as possible with as thread local and NUMA local a dataset
as possible, but eventually, whether through a routine in the application, a
system service or simple a memory allocation, a shared resource will need to
be accessed and a sync point generated. This is especially true in OO
languages where object creation and destruction are frequent events and
access to the heap is often a limiting factor on an otherwise independent
set of computations
“Joseph M. Newcomer” wrote in message news:xxxxx@ntdev…
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jake Oshins
Sent: Saturday, March 12, 2011 6:01 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Overlapped and APC
Overlapped structs pretty much require that you have a thread for every
pending I/O, or at the very least, you handle completing I/O on the same
thread that initiated it. Remember that when this stuff was designed, very
few operating systems were multi-threaded. I think that the original
designers thought that threads would end up being so fundamentally light
that creating lots and lots of them would be useful and common. Much of the
Win32 user-mode programming model has this assumption.
*****
Actually, this is not true. You do not need one thread per pending I/O, and
there is no requirement to handle completion in the initiating thread.
When this was designed, Windows was already multithreaded, and the threads
were known to not be “light”, nor was it expected that such usage would be
common. I have never seen any assumption that this was so, or assumed so at
user level. We knew it wasn’t true.
Note that async callback I/O was not created until end users demanded it,
and to make it viable, the concept of “alertable wait state” had to be used,
because the async callbacks of VAX VMS were a total flaming disaster! (I
believe the phrase used by Dave Cutler was “Asynchronous callback I/O will
go into Windows over my dead body!” and I point out that (a) Windows has
async callback I/O and (b) Dave Cutler is alive. What happened to satisfy
him that async callback I/O was viable? Alertable wait state! (I tried to
use async callback I/O under VMS, and there was no possible way to make it
work right, because it Just Happened, which could result in recursive calls
to the allocator when the allocator had been preempted by a previous async
call).
It is quite common to have a small number of threads (perhaps one)
initiating I/O, and a small number of threads (usually equal to the number
of cores) to handle completion. Using callback I/O is one of the single
worst I/O models that exists, because it is (a) hard to use (b) requires
handling the callback in the initiating thread, thus reducing throughput (c)
requires polling for completion (otherwise known as “entering alertable wait
state”, which must be done frequently enough to maximize throughput but not
so often as to overload the scheduler with unnecessary AWS activations). It
should be avoided. The best model is the I/O Completion Port model, which
allows multiple threads to respond to I/O completion.
******
The last twenty years have taught us that you really don’t want hundreds of
threads, each dedicated to some task. The interaction between them is
overwhelmingly complex and error prone, not to mention inefficient. What we
now know is that you really don’t want very many more threads than there are
processors (virtual, physical or logical) to run your code. It makes more
sense to treat each completing I/O as something to be handled by a pool of
threads.
*****
Threads have complex interactions only if programs are badly written.
Synchronization, which all introductory courses teach, should be avoided
like the plague; synchronization represents where threads “rub together”,
and like any physical system, all this does is cause friction which
generates heat and wastes energy. Only in the kernel is synchronization
critical, because we work so close to the hardware, and often outside the
fundamental OS concepts like “scheduler”. In application space, I consider
that the instant you add a mutex or CRTICAL_SECTION to code, your design has
failed. There are better models, such as the agent pattern, that handle
multithreading much more cleanly than explicit programmer synchronization.
The problem with the way we teach it is that we spend so much time teaching
it, with lots of exercises, and lots of emphasis on its importance, that the
students end up thinking it is the *only* way to handle multithreading, and
it is not only *not* the only way, it is usually the *worst possible* way.
I typically use an IOCP without handle binding to do interthread message
queueing, or just use the standard PostMessage queue, and thus never have
issues about synchronization (synchronization *is* needed, of course, but I,
as a programmer, should never have to THINK about it as an issue). I have
not used a mutex or CRITICAL_SECTION in multithreaded applications in at
least 15 years, once I stopped using MS-DOS (otherwise known as Windows 9x,
which didn’t have IOCPs).
One of the problems about “interthread communication” is “how do I
communicate what should happen when this I/O operation completes”. The
simplest way is to embed the OVERLAPPED structure in a structure of my own
that carries all the necessary context, such as pointers to the buffers,
information about what to do, etc. When I get the pointer to the OVERLAPPED
structure in my IOCP handler, I just cast it to the larger structure and I
have everything I could possibly need (and if I don’t, I built my structure
wrong). No global variables are needed, for example. Generally, one of the
worst disasters is the old habit of using global variables to hold state,
which means that all the threads have to interact in complex ways. If you
simply ignore the concept of global variables, life gets a lot simpler.
typedef struct {
OVERLAPPED ovl;
…my context data here
} MY_OVERLAPPED;
*****
This is the idea behind the completion port. (And it’s the idea behind most
of the kernel-mode constructs in NT from the beginning.)
Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group
This post implies no warranties and confers no rights.
wrote in message news:xxxxx@ntdev…
I know that completion ports fixed all the problems that are related with
APC.
But for apcs( I mean completion routines), What problems, that are caused by
overlapped io, were fixed by completion port invention.
In overlapped io main thread calls Getoverlappedresult, Getoverlappedresult
calls waitforxxx and starts to wait.
In Completion ports also main thread calls, (in fact must call),
waitforxxxEx and starts to wait.
Completion routine has a problem about load balancing. So they developed
completion ports. What was the problem with overlapped io, so completion
routine was invented?
Thanks
KeWaitForSingleBill(…)
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
–
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
–
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.