Details
-
Technical task
-
Resolution: Fixed
-
Critical
-
3.1.3, 3.1.4, 4.0.0, 4.1.0, 4.1.1, 4.5.0
-
Security Level: Public
-
None
Description
This has been observed happening when a high priority task is busy, tasks of lower priorities never get a chance to run and have been observed waiting for many hours.
The example seen in the field was the following scenario:
A new node entering the cluster was hit by 10 concurrent DCP streams, this results in 10 DCP consumer Processor tasks all competing for the 3 NONIO threads.
A related MB (MB-18452) here also meant that the Processor tasks ran for long times without any yield. However the logs show that the task do eventually yield, the long wait times aren't due to tasks effectively infinite looping. This can be seen because we have many hits in the runtimes histogram showing that there was opportunity for other waiting NONIO tasks to run...
However during the 40 minutes of uptime a number of checkpoint stats were requested by ns_server. These tasks are added to the NONIO queue and during the observed period, these checkpoint tasks were never scheduled.
There is evidence that in one set of log files that a checkpoint task was actually waiting to run for 10 hours.
Clearly this should not happen and it looks like the scheduler only considered the task priority, not wait-time, when scheduling during this heavy workload. Processor has priority 0, all other NONIO tasks are lower priority (higher numbers).
There is code which looks to try and also schedule tasks by the longest waiter but the observed behaviour is that this didn't trigger.
Attachments
Issue Links
- blocks
-
MB-19612 4.5.1 Minor Release
- Closed