OS Concepts : Race Conditions

A race condition or race hazard

is a type of flaw in an electronic or software system where
the output is dependent on the sequence or timing of other uncontrollable events.
The term originates with the idea of two signals racing each other to influence the output first.
Race conditions can occur in electronics systems, especially logic circuits, and in computer software, especially multithreaded or distributed programs.
Race conditions arise in software when separate processes or threads of execution depend on some shared state.

Mutually exclusive operations are those that cannot be interrupted while accessing some resource such as a memory location. I

Operations upon shared states are critical sections that must be mutually exclusive in order to avoid harmful collision between processes or threads that share those states.

As a simple example let us assume that two tasks each want to increment the value of a global integer by one. Ideally, the following sequence of operations would take place:

Task 1	Task 2	New Value of "i"
		0
read value of i		0
increase value		0
write value to i		1
	read value of i	1
	increase value	1
	write value to i	2

In the case shown above, the final value of i is 2, as expected. However, if the two threads run simultaneously without locking or synchronization, the outcome of the operation could be wrong. The alternative sequence of operations below demonstrates this scenario:

Task 1	Task 2	New Value of "i"
		0
read value of i		0
	read value of i	0
increase value		0
	increase value	0
write value to i		1
	write value to i	1

The final value of i is 1 instead of the expected result of 2. This occurs because the increment operations of the second case are not mutually exclusive. Mutually exclusive operations are those that cannot be interrupted while accessing some resource such as a memory location. In the first case, T1 was not interrupted while accessing the variable i, so its operation was mutually exclusive.

For another example, consider the following two tasks, in pseudocode:

global int A = 0
 
// increments the value of A and print "RX"
// activated whenever an interrupt is received from the serial controller
task Received()
    A = A + 1
    print "RX"
end task
 
// prints out only the even numbers
// is activated every second
task Timeout()
    if (A is evenly divisible by 2)
        print A
    end if
end task

Output would look something like:

0
0
0
RX
RX
2
RX
RX
4
4

Now consider this chain of events, which might occur next:

timeout occurs, activating task Timeout
task Timeout evaluates A and finds it is divisible by 2, so elects to execute the "print A" next.
data is received on the serial port, causing an interrupt and a switch to task Received
task Received runs to completion, incrementing A and printing "RX"
control returns to task Timeout
task Timeout executes print A, using the current value of A, which is 5.

Mutexes are used to address this problem in concurrent programming.

File systems

In file systems, two or more programs may "collide" in their attempts to modify or access a file, which could result in data corruption.

File locking provides a commonly used solution.

A more cumbersome remedy involves organizing the system in such a way that one unique process (running a daemon or the like) has exclusive access to the file, and all other processes that need to access the data in that file do so only via interprocess communication with that one process (which of course requires synchronization at the process level).

A different form of race condition exists in file systems where unrelated programs may affect each other by suddenly using up available resources such as disk space (or memory, or processor cycles).

Software not carefully designed to anticipate and handle this race situation may then become quite fragile and unpredictable. Such a risk may be overlooked for a long time in a system that seems very reliable. But eventually enough data may accumulate or enough other software may be added to critically destabilize many parts of a system. Probably the best known example of this occurred with the near-loss of the Mars Rover "Spirit" not long after landing, but this is a commonly overlooked hazard in many computer systems. A solution is for software to request and reserve all the resources it will need before beginning a task; if this request fails then the task is postponed, avoiding the many points where failure could have occurred. (Alternatively, each of those points can be equipped with error handling, or the success of the entire task can be verified afterwards, before continuing.) A more common but incorrect approach is to simply verify that enough disk space (for example) is available before starting a task; this is not adequate because in complex systems the actions of other running programs can be unpredictable.

Tech Guide - Interview , Programming & Others

Saturday, September 29, 2012