Charles Anthony
Posted on February 17, 2024
Multics implementation has a subtle but problematic assumption: that all of the CPU's run about the same rate.
For the DPS8/M simulator's running of CPUs in individual threads, this can be problematic. For example, during boot the bootload CPU will start the additional configured CPUs something like thus:
Signal all other running CPUs to idle
Set up fault vectors to direct interrupted CPU to start-up code
Clear "I am running flag"
Interrupt new CPU
Count to 5000
Check "I am running flag", if not set, the new CPU failed to start
...
and
interrupt_handler:
Check CPU configuration; if ok, set "I am running flag"
Continue with initialization code...
Because the CPU threads run at different rates, and because the newly started thread has more initial overhead (thread creation time, cache misses), it can be the case that it does not reach "set flag" before the bootload processor has counted to 5000, leading to Multics chaos where Multics believes that the added processor is not running, yet that processor is in the scheduler code looking for work to do.
The correct solution is to load balance the CPU threads so that they work at the same pace, but that is a hard problem.
The short term solution was to add "stall points"; the simulator CPU threads watch specified instructions in the Multics executable and suspend themselves for a short time before executing the instruction, allowing other CPU threads time to catch up. For the above case, a stall point is set at the start of the "Count to 5000" code; the causes the boot CPU thread to suspend, giving the new CPU thread time to get to the "set the flag" state before the "count to 5000" loop runs.
This problem is believed to be only in the Multics CPU start up code, whereas the rest of Multics follows much more rigorous synchronization practices. The code that handles CPU start up was reviewed and potential race conditions identified, and stall points added to the simulator configuration to forestall problems.
This has fixed the problem for nearly all cases, but a few corner cases remain. Heavily loaded systems may experience wait times exceeding the specified stall times, and very low-end systems (fewer host CPUs then the simulator CPU threads or insufficient cache for example) have been know to encounter occasional intermittent problems.
However, a reliable fail scenario was recently characterized, and tweaking stall point settings has proven ineffective, prompting a deeper examination into the issue.
Posted on February 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.