load average calculation imperfections

Tue May 17 05:39:45 GMT 2022

Jon Turney wrote:
> On 16/05/2022 06:25, Mark Geisert wrote:
>> Corinna Vinschen wrote:
>>> On May 13 13:04, Corinna Vinschen wrote:
>>>> On May 13 11:34, Jon Turney wrote:
>>>>> On 12/05/2022 10:48, Corinna Vinschen wrote:
>>>>>> On May 11 16:40, Mark Geisert wrote:
>>>>>>>
>>>>>>> The first counter read now gets error 0xC0000BC6 == PDH_INVALID_DATA, but no
>>>>>>> errors on subsequent counter reads.  This sounds like it now matches what
>>>>>>> Corinna reported for W11.  I wonder if she's running build 1706 already.
>>>>>>
>>>>>> Erm... looks like I didn't read your mail throughly enough.
>>>>>>
>>>>>> This behaviour, the first call returning with PDH_INVALID_DATA and only
>>>>>> subsequent calls returning valid(?) values, is what breaks the
>>>>>> getloadavg function and, consequentially, /proc/loadavg.  So maybe xload
>>>>>> now works, but Cygwin is still broken.
>>>>>
>>>>> The first attempt to read '% Processor Time' is expected to fail with
>>>>> PDH_INVALID_DATA, since it doesn't have a value at a particular instant, but
>>>>> one averaged over a period of time.
>>>>>
>>>>> This is what the following comment is meant to record:
>>>>>
>>>>> "Note that PDH will only return data for '% Processor Time' after the second
>>>>> call to PdhCollectQueryData(), as it's computed over an interval, so the
>>>>> first attempt to estimate load will fail and 0.0 will be returned."
>>>>
>>>> But.
>>>>
>>>> Every invocation of getloadavg() returns 0.  Even under load.  Calling
>>>> `cat /proc/loadavg' is an excercise in futility.
>>>>
>>>> The only way to make getloadavg() work is to call it in a loop from the
>>>> same process with a 1 sec pause between invocations.  In that case, even
>>>> a parallel `cat /proc/loadavg' shows the same load values.
>>>>
>>>> However, as soon as I stop the looping process, the /proc/loadavg values
>>>> are frozen in the last state they had when stopping that process.
>>>
>>> Oh, and, stopping and restarting all Cygwin processes in the session will
>>> reset the loadavg to 0.
>>>
>>>> Any suggestions how to fix this?
>>
>> I'm getting somewhat better behavior from repeated 'cat /proc/loadavg' with the 
>> following update to Cygwin's loadavg.cc:
>>
>> diff --git a/winsup/cygwin/loadavg.cc b/winsup/cygwin/loadavg.cc
>> index 127591a2e..cceb3e9fe 100644
>> --- a/winsup/cygwin/loadavg.cc
>> +++ b/winsup/cygwin/loadavg.cc
>> @@ -87,6 +87,9 @@ static bool load_init (void)
>>       }
>>
>>       initialized = true;
>> +
>> +    /* prime the data pump, hopefully */
>> +    (void) PdhCollectQueryData (query);
>>     }
> 
> Yeah, something like this might be a good idea, as at the moment we report load 
> averages of 0 for the 5 seconds after the first time someone asks for it.
> 
> It's not ideal, because with this change, we go on to call PdhCollectQueryData() 
> again very shortly afterwards, so the first value for '% Processor Time' is 
> measured over a very short interval, and so may be very inaccurate.

Perhaps add a short delay, say 100ms, after that first PdhCollectQueryData()? 
Enough for anything compute-bound to be measurable but not enough to be 
human-noticeable?  Something even shorter?

[...]
>> Any other Cygwin app I know of is using getloadavg() under the hood. When it 
>> calculates a new set of 1,5,15 minute load averages, it uses total %processor 
>> time and total processor queue length.  It has a decay behavior that I think has 
>> been around since early Unix.  What I haven't noticed before is an "inverse" 
>> decay behavior that seems wrong to me, but maybe Linux has this.  That is, if 
>> you have just one compute-bound process the load average won't reach 1.0 until 
>> that process has been running for a full minute.  You don't see instantaneous load.
> 
> In fact it asymptotically approaches 1, so it wouldn't each it until you've had a 
> load of 1 for a long time compared to the time you are averaging over.
> 
> Starting from idle, a unit load after 1 minute would result in an 1-minute load 
> average of (1 - (1/e)) = ~0.62.   See 
> https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html for some 
> discussion of that.
> 
> That's just how it works, as a measure of demand, not load.

Thanks for that link; that was interesting to read.  OK on that's how it is, the 
ramp even more drawn out over time than I was thinking.

[...]
>> Ideally, the shared data should have the most recently calculated 1,5,15 minute 
>> load averages and a timestamp of when they were calculated.  And then any 
>> process that calls getloadavg() should independently decide whether it's time to 
>> calculate an updated set of values for machine-wide use.  But can the decay 
>> calculations get messed up due to multiple updaters?  I want to say no, but I 
>> can't quite convince myself.  Each updater has its own idea of the 1,5,15 
>> timespans, doesn't it, because updates can occur at random, rather than at a set 
>> period like a kernel would do?
> 
> I think not, because last_time is part of the shared loadavginfo state, which is 
> the unix epoch time that the last update was computed, and updating that is 
> guarded by a mutex.
> 
> That's not to say that this code might not be wrong in some other way :)

Alright, I see the problem with how I was visualizing multiple updaters.  I was 
thinking of the "real" load average over time as a superposition (sum, I guess) of 
the decaying exponential curves of all the updaters' calculations.  But no, each 
updater replaces the current curve with a new one based on its own new data.  What 
I was envisioning would be much more complex and require more state memory.  Oof.

I can submit a patch for the added PdhCollectQueryData() plus short Sleep() if it 
would make sense to try it for awhile on Cygwin head.  Other suggestions welcome.

..mark