Unix Domain Socket Limitation?
Norton Allen
allen@huarp.harvard.edu
Fri Dec 4 13:51:02 GMT 2020
On 12/3/2020 8:11 PM, Ken Brown wrote:
> On 12/2/2020 12:30 PM, Norton Allen wrote:
>> On 11/30/2020 9:22 PM, Norton Allen wrote:
>>> Yeah, so now the example no longer blocks for me. Unfortunately
>>> these bugs are not present in my application, so I will need to keep
>>> working on this.
>>>
>>
>> After paring the main application down and back up, I finally
>> narrowed in on the condition that was causing this blocking behavior.
>> The issue arises when a client connect()s twice to the same server
>> with non-blocking unix-domain sockets before calling select().
>>
>> There are a few pieces to this. With the client configured to
>> connect() just once, I can see that the server's select() returns as
>> soon as the client calls connect(), but then the server's accept()
>> blocks until the client calls select(). That is not proper
>> non-blocking behavior, but it appears that the implementation under
>> Cygwin does require that client and server both be communicating
>> synchronously to accomplish the connect() operation.
>>
>> I tried running this under Ubuntu 16.04 and found that connect()
>> succeeded immediately, so no subsequent select() is required, and
>> there does not appear to be a possibility for this collision. That
>> proves to hold true even if the server is not waiting in select() to
>> process the connect() with accept().
>>
>> A workaround for this issue may be to keep the socket blocking until
>> after connect().
>>
>> I have pushed the new minimal example program, 'rapid_connects' to
>> https://github.com/nthallen/cygwin_unix
>>
>> The server is run like before as:
>>
>> $ ./rapid_connects server
>>
>> The client can be run in two different modes. To connect with just
>> one socket:
>>
>> $ ./rapid_connects client1
>>
>> To connect with two:
>>
>> $ ./rapid_connects client2
>>
>> My immediate strategy will be to develop a workaround for my project.
>> Having spent a day inside cygwin1.dll, I can see that I have a steep
>> learning curve to make much of a contribution there.
>
> I'm traveling at the moment and unable to do any testing, but I wonder
> if you're bumping into an issue that was just discussed on the
> cygwin-developers list:
>
> https://cygwin.com/pipermail/cygwin-developers/2020-December/012015.html
>
> A different workaround is described there.
>
> If it's the same issue, then I don't think it will happen with the new
> AF_UNIX implementation. More in a few days.
>
It does seem related.
A work around that is working for me is to do a blocking connect() and
switch to non-blocking when that completes. In my application, the
connect() generally occurs once at the beginning of a run, so blocking
for a few milliseconds does not impact responsiveness.
More information about the Cygwin
mailing list