load average calculation imperfections
Mark Geisert
mark@maxrnd.com
Tue May 17 05:39:45 GMT 2022
Jon Turney wrote:
> On 16/05/2022 06:25, Mark Geisert wrote:
>> Corinna Vinschen wrote:
>>> On May 13 13:04, Corinna Vinschen wrote:
>>>> On May 13 11:34, Jon Turney wrote:
>>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>>
>>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>>>> errors on subsequent counter reads. This sounds like it now matches what
>>>>>>> Corinna reported for W11. I wonder if she's running build 1706 already.
>>>>>>
>>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>>
>>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>>> getloadavg function and, consequentially, /proc/loadavg. So maybe xload
>>>>>> now works, but Cygwin is still broken.
>>>>>
>>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>>>> one averaged over a period of time.
>>>>>
>>>>> This is what the following comment is meant to record:
>>>>>
>>>>> "Note that PDH will only return data for '% Processor Time' after the second
>>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>>
>>>> But.
>>>>
>>>> Every invocation of getloadavg() returns 0. Even under load. Calling
>>>> `cat /proc/loadavg' is an excercise in futility.
>>>>
>>>> The only way to make getloadavg() work is to call it in a loop from the
>>>> same process with a 1 sec pause between invocations. In that case, even
>>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>>
>>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>>> are frozen in the last state they had when stopping that process.
>>>
>>> Oh, and, stopping and restarting all Cygwin processes in the session will
>>> reset the loadavg to 0.
>>>
>>>> Any suggestions how to fix this?
>>
>> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the
>> following update to Cygwin's loadavg.cc:
>>
>> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
>> index 127591a2e..cceb3e9fe 100644
>> --- a/winsup/cygwin/loadavg.cc
>> +++ b/winsup/cygwin/loadavg.cc
>> @@ -87,6 +87,9 @@ static bool load_init (void)
>> }
>>
>> initialized = true;
>> +
>> + /* prime the data pump, hopefully */
>> + (void) PdhCollectQueryData (query);
>> }
>
> Yeah, something like this might be a good idea, as at the moment we report load
> averages of 0 for the 5 seconds after the first time someone asks for it.
>
> It's not ideal, because with this change, we go on to call PdhCollectQueryData()
> again very shortly afterwards, so the first value for '% Processor Time' is
> measured over a very short interval, and so may be very inaccurate.
Perhaps add a short delay, say 100ms, after that first PdhCollectQueryData()?
Enough for anything compute-bound to be measurable but not enough to be
human-noticeable? Something even shorter?
[...]
>> Any other Cygwin app I know of is using getloadavg() under the hood. When it
>> calculates a new set of 1,5,15 minute load averages, it uses total %processor
>> time and total processor queue length. It has a decay behavior that I think has
>> been around since early Unix. What I haven't noticed before is an "inverse"
>> decay behavior that seems wrong to me, but maybe Linux has this. That is, if
>> you have just one compute-bound process the load average won't reach 1.0 until
>> that process has been running for a full minute. You don't see instantaneous load.
>
> In fact it asymptotically approaches 1, so it wouldn't each it until you've had a
> load of 1 for a long time compared to the time you are averaging over.
>
> Starting from idle, a unit load after 1 minute would result in an 1-minute load
> average of (1 - (1/e)) = ~0.62. See
> https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html for some
> discussion of that.
>
> That's just how it works, as a measure of demand, not load.
Thanks for that link; that was interesting to read. OK on that's how it is, the
ramp even more drawn out over time than I was thinking.
[...]
>> Ideally, the shared data should have the most recently calculated 1,5,15 minute
>> load averages and a timestamp of when they were calculated. And then any
>> process that calls getloadavg() should independently decide whether it's time to
>> calculate an updated set of values for machine-wide use. But can the decay
>> calculations get messed up due to multiple updaters? I want to say no, but I
>> can't quite convince myself. Each updater has its own idea of the 1,5,15
>> timespans, doesn't it, because updates can occur at random, rather than at a set
>> period like a kernel would do?
>
> I think not, because last_time is part of the shared loadavginfo state, which is
> the unix epoch time that the last update was computed, and updating that is
> guarded by a mutex.
>
> That's not to say that this code might not be wrong in some other way :)
Alright, I see the problem with how I was visualizing multiple updaters. I was
thinking of the "real" load average over time as a superposition (sum, I guess) of
the decaying exponential curves of all the updaters' calculations. But no, each
updater replaces the current curve with a new one based on its own new data. What
I was envisioning would be much more complex and require more state memory. Oof.
I can submit a patch for the added PdhCollectQueryData() plus short Sleep() if it
would make sense to try it for awhile on Cygwin head. Other suggestions welcome.
..mark
More information about the Cygwin-developers
mailing list