|
|
От: |
MaximE
|
|
| Дата: | 16.01.05 10:00 | ||
| Оценка: | 19 (3) | ||
http://groups-beta.google.com/group/comp.std.c/msg/ae49dc7a96c625f5
There are (depending on how one breaks these things down) at least three
independent issues here:
1. Atomic access granularity
2. Read/write ordering
3. Memory coherency
This discussion has been trying to address the three as if they were the
same. They're not even connected. (Or, at best, only very loosely connected.)
Most modern machines (yes, I should probably use quotes again) are designed
for fast and efficient multiprocessing. The memory interconnect is the big
bottleneck, and a lot of work has gone into streamlining the memory access
and cache protocols. Memory accesses are usually made in "chunks" not
necessarily related to the size of the data type. In general, a machine has a
few "memory access types", very likely a smaller set than the set of
instruction set types. Usually, only "memory access types" are atomic. On an
Alpha EV4 and EV5, for example, the memory access types are 32-bit "words"
and 64-bit "longwords". Smaller accesses require loading and masking a word
or longword. While some machines might choose to hide the distinction between
"memory types" and "instruction types", Alpha doesn't. To read a byte, you
must load the enclosing word/longword, (I'll use "word" for convenience from
now on), and use special instructions to mask/shift into the desired
position.
That's all fine, except when you get into concurrently executing entities
(threads, or processes using shared memory), and one tries to STORE into the
non-atomic field. To store such a field, you fetch the current word, shift &
mask (field insert) your new data, and then write the word back. If another
entity is simultaneously setting a different field in the same word, only ONE
of the two will supply a new value for the entire word, including the other
thread's field. Worse, a non-atomic field (e.g., a 16-bit short rather than a
byte) might be split between two words. If it does, you can get into trouble
even reading, because you have to read both words and combine them to create
the short value you want. Some other thread might have changed one of the
words between your two reads. That's "word tearing", and it means you've
gotten the wrong data. Of course word tearing can happen on writes, too, in
addition to the normal field access problems. Messy!
Attempting to share information between concurrently executing instruction
streams (that may be on separate processors) also requires read/write
ordering. That is, if you set a flag to signal some change in state (e.g.,
adding an item to a queue), you must be able to know that seeing the change
in the flag means you can see the new queue item. Modern SMP memory systems
frequently allow reordering of operations in the CPU to memory controller
pipeline, for all sorts of reasons (including cache synchronization issues,
speculative execution, etc.) So you may queue an item and then set a flag (or
fill in the fields of a structure and then queue it), but have the data
become visible to another processor in a different order. Unless you're
communicating between concurrently executing entities, reordering doesn't
affect you -- so it's a great performance tradeoff. But it means that when
you require ordering, you need to do something extra. One common way to force
ordering is a "memory barrier" instruction. On Alpha, for example, MB
prevents memory operation reordering across the instruction. (One could
consider a stream of memory requests in a pipe between the CPU and memory,
which can be arbitrarily reordered for implementation convenience; but the
reordering agent can't move anything past an "MB token".)
And then we've got memory coherency. A "write-through" cache may invalidate
other caches, and update main memory. But a "write-back" cache may not write
into main memory for some time. Even if other caches are invalidated, the
processors won't see the new value until it's written. That's OK, though, as
long as both processors make proper use of memory barriers. The writer puts a
memory barrier between writing the data and writing the flag (or pointer),
and the reader puts a memory barrier between reading the flag/pointer and
reading the data. Now, whenever the flag/pointer appears in memory, you know
that the data to which it points is valid -- because you can't have read it
before the flag, and the writer can't have written it after the flag.
For more information without getting too deep into processor implementation
details, see the section "Memory visibility between threads" in my book
(Programming with POSIX Threads, web link in my .sig). Curt Schimmel's
UNIX Systems for Modern Architectures (Addison-Wesley) has a section called
"Other Memory Models" that describes the SPARC architecture's "Partial Store
Ordering" (loose read/write ordering with memory barriers), though it doesn't
address word tearing.
...
The whole idea of RISC is *exactly* to make software more complex. That is, by
simplifying the hardware, hardware designers can produce more stable designs that
can be produced more quickly and with more advanced technology to result in faster
hardware. The cost of this is more complicated software. Most of the complexity is
hidden by the compiler -- but you can't necessarily hide everything. Remember that
POSIX took advantage of some loopholes in the ANSI C specification around external
calls to proclaim that you can do threaded programming in C without requiring
expensive and awkward hacks like "volatile". Still, the interpretation of ANSI C
semantics is stretched to the limit. The situation would be far better if a future
version of ANSI C (and C++) *did* explicitly recognize the requirements of threaded
programming.
...
> However, I really believe that dataA and dataB should both be declared as
> "volatile" to prevent the compiler from being too aggressive on it's
> optimization. The mutex still doesn't guarantee that the compiler hasn't
> cached the data in an internal register across a function call. My memory
> isn't perfect, but I do think this bit me on IRIX.
The existence of the mutex doesn't require this, but the semantics of POSIX and
ANSI C do require it. Remember that you lock a mutex by calling a function, passing
an address. While an extraordinarily aggressive C compiler with a global analyzer
might be able to determine reliably that there's no way that call could access the
data you're trying to protect, such a compiler is unlikely -- and, if it existed,
it would simply violate POSIX 1003.1-1996, failing to support threads.
You do NOT need volatile for threaded programming. You do need it when you share
data between "main code" and signal handlers, or when sharing hardware registers
with a device. In certain restricted situations, it MIGHT help when sharing
unsynchronized data between threads (but don't count on it -- the semantics of
"volatile" are too fuzzy). If you need volatile to share data, protected by POSIX
synchronization objects, between threads, then your implementation is busted.
...
> If you stick to a "natural machine word" that is declared as "volatile",
> you do not absolutely need a mutex (in fact I've done it). Of course, there are
> only certain cases where this works and shouldn't be done unless you really know
> your hardware architecture and what you're doing! If you have a machine with a
> lot of processors, unnecessarily locking mutexes can really kill parallelism.
> I'll give one example where this might be used:
> volatile int stop_flag = 0; /* assuming an int is atomic */
> thread_1
> {
> /* bunch of code */
> if some condition exists such that we wish to stop thread_2
> stop_flag = 1;
> /* more code — or not*/
> }
> thread_2
> {
> while(1)
> {
> /* check if thread should stop */
> if (stop_flag)
> break;
> /* do whatever is going on in this loop */
> }
> }
> Of course, this assumes the hardware has some sort of cache coherency
> mechanism. But I don't believe POSIX mutex's or memory barriers (as
> defined for the DEC alpha) have any impact on cache coherency.
If a machine has a cache, and has no mechanism for cache coherency, then it can't
work as a multiprocessor.
> The example is simplistic, but it should work on a vast majority of
> systems. In fact the stop_flag could just as easily be a counter
> of some sort as long as only one thread is modifying the counter...
In some cases, yes, you can do this. But, especially with your "stop_flag",
remember that, if you fail to use a mutex (or other POSIX-guaranteed memory
coherence operation), a thread seeing stop_flag set CANNOT assume anything about
other program state. Nor can you ensure that any thread will see the changed value
of stop_flag in any particular bounded time -- because you've done nothing to
ensure memory ordering, or coherency.
And remember very carefully that bit about "as long as only one thread is
modifying". You cannot assume that "volatile" will ever help you if two threads
might modify the counter at the same time. On a RISC machine, "modify" still means
load, modify, and store, and that's not atomic. You need special instructions to
protect atomicity across that sequence (e.g., load-lock/store-conditional, or
compare-and-swap).
Am I trying to scare you? Yeah, sure, why not? If you really feel the need to do
something like this, do yourself (and your project) the courtesy of being EXTREMELY
frightened about it. Document it in extreme and deadly detail, and write that
documentation as if you were competing with Stephen King for "best horror story of
the year". I mean to the point that if someone takes over the project from you, and
doesn't COMPLETELY understand the implications, they'll be so terrified of the risk
that they'll rip out your optimizations and use real synchronization. Because this
is just too dangerous to use without full understanding.
There are ways to ensure memory ordering and coherency without using any POSIX
synchronization mechanisms, on any machine that's capable of supporting POSIX
semantics. It's just that you need to be really, really careful, and you need to be
aware that you're writing extremely machine-specific (and therefore inherently
non-portable) code. Some of this is "more portable" than others, but even the
"fairly portable" variants (like your stop_flag) are subject to a wide range of
risks. You need to be aware of them, and willing to accept them. Those who aren't
willing to accept those risks, or don't feel inclined to study and fully understand
the implications of each new platform to which they might wish to port, should
stick with mutexes.
http://groups-beta.google.com/group/comp.lang.c++/browse_frm/thread/82ea6195d505250c/ba4fc712e25a904b?q=volatile+nor+sufficient&_done=%2Fgroups%3Fie%3Dutf-8%26oe%3Dutf-8%26q%3Dvolatile+nor+sufficient%26qt_s%3DSearch+Groups%26&_doneTitle=Back+to+Search&&d#ba4fc712e25a904b
In fact, though you've said you weren't intending to advocate direct access to
any "volatile" variable without applying synchronization and casting away
"volatile", your first Gadget::Wait example does precisely that, and is wildly
incorrect and dangerously misleading. Compiler volatile semantics are not
sufficient when sharing flag_ between threads, because the hardware, as well as
the compiler, may reorder memory accesses arbitrarily, even with volatile. (Nor
would a compiler implementation that issued memory barriers at each sequence
point for volatile variables be sufficient, unless ALL data was volatile, which
is impractical and unreasonably expansive.)
Memory barriers must be applied where necessary on many architectures, and
there is no standard or portable way to generate them. There is no excuse for a
compiler to require both volatile AND memory barriers, because there's no
excuse for a compiler to reorder memory access around its own memory barrier
construct. (Usually either a compiler builtin such as Compaq C/C++ "__MB()" or
an asm("mb") "pseudocall".) The standard and portable way to ensure memory
consistency is to rely on the POSIX memory model, which is based solely on
specific POSIX API calls rather than expensive and inappropriately defined
language keywords or nonportable hardware instructions. A system or compiler
that does not provide the proper memory model (without volatile) with proper
use of the portable POSIX API calls does not conform to POSIX, and cannot be
considered for serious threading work. Volatile is irrelevant.
Entirely aside from the language issues, my point was simply that "volatile",
and especially its association with threaded programming, has been an extremely
confusing issue for many people. Simply using them together is going to cause
even more confusion. The illusory promise of volatile will lead novices into
trouble.
In contradiction to your absurd statement that "writing multithread programs
becomes impossible" without volatile, the intended C and C++ semantics
associated with volatile are neither useful nor sufficient for threaded code.
And it is WITH volatile, not without, that "the compiler wastes vast
optimization opportunities", especially as the expense of meeting the volatile
"contract" is of no benefit to threaded code.