Is this correct behaviour for 'rev'?

Brian Inglis Brian.Inglis@SystematicSW.ab.ca
Thu Oct 24 13:54:07 GMT 2024


On 2024-10-23 23:01, Mark Geisert via Cygwin wrote:
> On 10/22/2024 10:33 PM, Mark Geisert via Cygwin wrote:
>> On 10/22/2024 8:00 PM, Backwoods BC via Cygwin wrote:
>>> It appears that 'rev' is choking on any character \x80 or higher, but
>>> is OK with those \x1f or smaller. It doesn't give an error or ignore
>>> it, it just stops.
>>>
>>> I don't have access to a Linux box so I can't see if this happens
>>> there and nothing in the documentation suggests that this is the
>>> correct functionality.
>>>
>>> Test case:
>>> printf 'no non-ASCII characters\nhex 01 >\x01< here\nhex 80 >\x80<
>>> here\nLine 4\n'|rev|rev
>>>
>>> This is for "rev from util-linux 2.33.1"
>>>
>>> I don't have the current version of 'rev' on my system due to not
>>> having updated in a while. I accidentally screwed up my installation
>>> and have been reluctant to wipe it and start over.
>>>
>>> So, is this the expected behaviour for the current version of 'rev'
>>> under Cygwin and/or Linux?
>>
>> The current Cygwin util-linux 2.39.3-2 rev behaves in the same, broken way.  
>> It looks like line-ending char(s) are not being handled correctly.   Don't 
>> know yet if it's rev itself or fgetws() being used by rev that's busted.  I'll 
>> investigate further.  Thanks for the report!
> 
> This is a locale issue.  In the default Cygwin locale, rev mishandles the \x80 
> byte and instead of stopping with an error message it enters an infinite loop.  
> I'll probably report this upstream instead of working out a local fix.
> 
> There is a work-around: change to the "C" locale just to run rev.
>      LC_ALL=C rev zzz
> where zzz is a file containing your four lines.  You can also run your original 
> testcase with "rev" replaced by "LC_ALL=C rev" in both places.

I run with a UTF-8 locale and have not noticed any issues as I use UTF-8 files.
The man page for rev(1) says it works on wide characters, and `cygcheck rev` 
shows it is built with gettext-devel libintl/libiconv.

I could see an issue if the shell and file locales mismatch, or possibly if the 
file contains SMP aka non-BMP characters as UTF-16 surrogates.

The correct approach should be to match the execution locale to the file locale, 
for example, `LC_ALL=...UTF-8 rev ...` which should produce the expected results.

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry


More information about the Cygwin mailing list