This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Re: cygwin on W7: stalled scp (openssh-6.8p1), lost ssh-scp-pipe data


Hi Corinna,

sorry for the delayed answer and for most likely breaking the thread by replying to my old email, I just subscribed to the ML... your email was:

> On Jun  1 20:53, Corinna Vinschen wrote:
> > On Jun  1 17:03, Theodor.Kazhan@gmx.de wrote:
> > > Hi folks,
> > > 
> > > unfortunately, my issue seems to be not interesting enough to trigger someones attention... ;)
> > > 
> > > Was my issue well described or do you have further questions? I'd
> > > appreciate any help leading me forward, e.g. maybe some hints in howto
> > > instrument cygwin sources to be able to further trace the data through
> > > and debug into the cygwins write / read functions...
> > 
> > I have no good idea how to debug this further, other than running under
> > strace (which takes a lot of patience), but I tried to do the same with
> > the stock scp 6.8 on Cygwin 2.0.2 x86_, basically a loop:
>                                        ^^^
>                                         64
> 
> >   while true
> >   do
> >     scp -C my_linux_box:TEST.txt . && diff TEST.orig TEST.txt || break
> >   done
> > 
> > The TEST.orig/TEST.txt file was generated as described in your OP.  The
> > loop is running now for a couple of hours without fail.
> > 
> > Any chance this is BLODA-induced?
> 
> Corinna


I was able to reproduce the issue using the same server but another client system (also Win7 + cygwin according to the attached cygcheck.REDACTED.txt w/o changing scp/ssh) which has for sure much less BLODAs (e.g. no McApe;)...

It took significantly longer to appear: ~8 h for the first hanging scp connection:

$ while true; do DATE=`date +%Y%m%d_%H%M%S`; echo -n "started: $DATE"; scp -v -C -o BatchMode=yes tester@10.IP2.IP3.IP4:"logDownload_m/TEST.txt" . > scp_$DATE.txt 2>&1; echo " - finished: `date +%Y%m%d_%H%M%S`"; rm -v `ls -1 scp_2015* | head -n -3`; sleep 5; done
started: 20150603_103328 - finished: 20150603_103542
âscp_20150603_102652.txtâ entfernt
âscp_20150603_102912.txtâ entfernt
âscp_20150603_103251.txtâ entfernt
âscp_20150603_103256.txtâ entfernt
started: 20150603_103547 - finished: 20150603_103759
âscp_20150603_103303.txtâ entfernt
. . .
started: 20150603_182649 - finished: 20150603_182802
âscp_20150603_182240.txtâ entfernt
started: 20150603_182807 - finished: 20150603_182922
âscp_20150603_182355.txtâ entfernt
started: 20150603_182927 - finished: 20150603_183036
âscp_20150603_182509.txtâ entfernt
started: 20150603_183041 - finished: 20150603_183151
âscp_20150603_182649.txtâ entfernt
started: 20150603_183156

I just found this last connection (started: 20150603_183156) being still active today waiting inside the scp process for data to read... Looking at the received TEST.txt shows the same symptoms... 1x 16k block of data is missing:

$ sdiff -stw40 <(cat TEST.txt | sed -re "s/(B0.....-N001024-)/\1\n/g" | cut -d- -f-2) <(seq -w 1 1 94796 | sed -re "s/.*/B0&-N000001/g")
                            >  B026375-N000001
B094793-N000001    <

The "B026375" 16k block is missing (and - as TEST.txt is simply overwritten every time - the "B094793" 16k block seems to be duplicated while the first one is from the current transmission which has an offset of -16k and the 2nd is at the correct position from the last transmission of the file).

Before I started to debug into scp/ssh I tried to use strace in two ways: attaching to the scp/ssh processes and starting the scp/ssh processes using strace. I was only able to reproduce the issue using the attach approach (most likely after the buffer was already lost) ... Starting the transmission using strace from the beginning or attaching early enough before <HUGEFILE> is pulled did not reproduce it.

All of that to me looks like a little race condition... (I read some ML posts related to starting threads in cygwin to ensure that data is written using M$Win IO-Functions - but this if far beyond my current know-how (not threads in general, I do have some experience in debugging really nasty race conditions))


How to proceed with that? Would "we" suspect Cygwin/M$Win to be the one to blame...? who could we ask for help?


Thanks and BR,
T.


-----UrsprÃngliche Nachricht-----
Gesendet: Montag, 01 Juni 2015 um 17:03:40 Uhr
Von: Theodor.Kazhan@gmx.de
An: Theodor.Kazhan@gmx.de,cygwin@cygwin.com
Betreff: Re: cygwin on W7: stalled scp (openssh-6.8p1), lost ssh-scp-pipe data
Hi folks,

unfortunately, my issue seems to be not interesting enough to trigger someones attention... ;)

Was my issue well described or do you have further questions? I'd appreciate any help leading me forward, e.g. maybe some hints in howto instrument cygwin sources to be able to further trace the data through and debug into the cygwins write / read functions...

Thanks,
Theo

-----UrsprÃngliche Nachricht-----
Gesendet: Freitag, 22 Mai 2015 um 09:52:20 Uhr
Von: Theodor.Kazhan@gmx.de
An: cygwin@cygwin.com
Betreff: cygwin on W7: stalled scp (openssh-6.8p1), lost ssh-scp-pipe data
Hi ML-followers,

I occasionally observe stalled scp connections while copying logfile archives from a debian server to cygwin on Windows 7. Besides https://sourceware.org/ml/cygwin/2015-02/msg00575.html I did not find similar issues, but there is no blocking/non-blocking switch involved in the middle of the data transmission (at random places in subsequent tests) through the pipe, where some data is lost (pls see below).

Pls find the redacted output of my "cygcheck -svr" attached. I also noticed that issue also before upgrading my cygwin to that latest version.

As I'm not into the details of read/write functions, I was only able to track down the issue as follows - I hope it is reproducible or otherwise helpful - so now I need your help:


1) I added some instrumentation in scp/ssh (both binaries copied to /usr/local/bin/ and used below):

user@host /usr/src/openssh-6.8p1-1.src/openssh-6.8p1-debug
$ grep -n -C4 TK: *.c
channels.c-1724-
channels.c-1725-                if (c->datagram) {
channels.c-1726-                        /* ignore truncated writes, datagrams might get lost */
channels.c-1727-                        len = write(c->wfd, buf, dlen);
channels.c:1728:logit("TK: channels.c: channel_handle_wfd(1): write: len=%d, buf=%.16s", len, buf);
channels.c-1729-                        free(data);
channels.c-1730-                        if (len < 0 && (errno == EINTR || errno == EAGAIN ||
channels.c-1731-                            errno == EWOULDBLOCK))
channels.c-1732-                                return 1;
--
channels.c-1745-                        dlen = MIN(dlen, 8*1024);
channels.c-1746-#endif
channels.c-1747-
channels.c-1748-                len = write(c->wfd, buf, dlen);
channels.c:1749:logit("TK: channels.c: channel_handle_wfd(2): write: len=%d, buf=%.16s", len, buf);
channels.c-1750-                if (len < 0 &&
channels.c-1751-                    (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK))
channels.c-1752-                        return 1;
channels.c-1753-                if (len <= 0) {
--
channels.c-2368-                return 0;
channels.c-2369-
channels.c-2370-        /* Get the data. */
channels.c-2371-        data = packet_get_string_ptr(&data_len);
channels.c:2372:logit("TK: channels.c: channel_input_data: data_len=%d, buf=%.16s", data_len, data);
channels.c-2373-        win_len = data_len;
channels.c-2374-        if (c->datagram)
channels.c-2375-                win_len += 4;  /* string length header */
channels.c-2376-
--
dispatch.c-107-                         return r;
dispatch.c-108-                 if (type == SSH_MSG_NONE)
dispatch.c-109-                         return 0;
dispatch.c-110-         }
dispatch.c:111:logit("TK: dispatch.c: ssh_dispatch_run: type=%d", type);
dispatch.c-112-         if (type > 0 && type < DISPATCH_MAX &&
dispatch.c-113-             ssh->dispatch[type] != NULL) {
dispatch.c-114-                 if (ssh->dispatch_skip_packets) {
dispatch.c-115-                         debug2("skipped packet (type %u)", type);
--
scp.c-1110-                     count += amt;
scp.c-1111-                     do {
scp.c-1112-                             j = atomicio6(read, remin, cp, amt,
scp.c-1113-                                 scpio, &statbytes);
scp.c:1114:logit("TK: scp.c: sink: amt=%d, j=%d, i=%d, size=%d, count=%d, buf=%.16s", amt, j, i, size, count, cp);
scp.c-1115-                             if (j == 0) {
scp.c-1116-                                     run_err("%s", j != EPIPE ?
scp.c-1117-                                         strerror(errno) :
scp.c-1118-                                         "dropped connection");


2) Generated easily debuggable (numbered 16k ascii blocks) HUGEFILE on a debian server (file size taken from an original log archive):

$ for BLOCK in `seq -w 1 1 94795`; do for ELEM in `seq -w 1 1 1024`; do echo -n "B0$BLOCK-N00$ELEM-"; done; done > TEST.txt


3) Pulled that file in a loop from the debian server to the local cygwin on Windows 7:

$ while true; do /usr/local/bin/scp -v -C -o BatchMode=yes user@10.IP2.IP3.IP4:logDownload_m/TEST.txt 93_KO_TEST.txt 2>&1 | tee 93_KO_scp_trace.txt; echo; date; echo; sleep 15; done


4) In case of stalled scp (in my env, it does not take that much time to get it, maybe 10-to-30 tries), collect and compare the "KO" output to a previously recorded and prepared "OK" output:

$ split -b 163840 -a 5 93_KO_TEST.txt 93_KO_TEST.txt.split_


$ for FILE in 93_OK_TEST.txt.split_*; do echo "$FILE"; diff -q $FILE `echo $FILE | sed -re "s/93_OK_/93_KO_/g"` || break; done
. . .
93_OK_TEST.txt.split_aanqv
93_OK_TEST.txt.split_aanqw
Files 93_OK_TEST.txt.split_aanqw and 93_KO_TEST.txt.split_aanqw differ


$ sdiff <(cat 93_OK_TEST.txt.split_aanqw | sed -re "s/(B......-N001024-)/\1\n/g") <(cat 93_KO_TEST.txt.split_aanqw | sed -re "s/(B......-N001024-)/\1\n/g")
B092261-N000001-B092261-N000002-B092261-N000003-B092261-N0000   B092261-N000001-B092261-N000002-B092261-N000003-B092261-N0000
B092262-N000001-B092262-N000002-B092262-N000003-B092262-N0000   B092262-N000001-B092262-N000002-B092262-N000003-B092262-N0000
B092263-N000001-B092263-N000002-B092263-N000003-B092263-N0000   B092263-N000001-B092263-N000002-B092263-N000003-B092263-N0000
B092264-N000001-B092264-N000002-B092264-N000003-B092264-N0000   B092264-N000001-B092264-N000002-B092264-N000003-B092264-N0000
B092265-N000001-B092265-N000002-B092265-N000003-B092265-N0000   B092265-N000001-B092265-N000002-B092265-N000003-B092265-N0000
B092266-N000001-B092266-N000002-B092266-N000003-B092266-N0000   B092266-N000001-B092266-N000002-B092266-N000003-B092266-N0000
B092267-N000001-B092267-N000002-B092267-N000003-B092267-N0000 <
B092268-N000001-B092268-N000002-B092268-N000003-B092268-N0000   B092268-N000001-B092268-N000002-B092268-N000003-B092268-N0000
B092269-N000001-B092269-N000002-B092269-N000003-B092269-N0000   B092269-N000001-B092269-N000002-B092269-N000003-B092269-N0000
B092270-N000001-B092270-N000002-B092270-N000003-B092270-N0000   B092270-N000001-B092270-N000002-B092270-N000003-B092270-N0000
                                                              > B092271-N000001-B092271-N000002-B092271-N000003-B092271-N0000


... So, the 16k block "B092267" is missing in the "KO" case, lets check the scp/ssh logs, where we do have 3 steps where the HUGEFILE data is handled:

a) Writing it to the channel buffer in the ssh process after reading it from the socket and after decryption and decompression: "TK: channels.c: channel_input_data:"
b) Reading it from the channel buffer and writing it to the pipe towards the scp parent process in ssh process: "TK: channels.c: channel_handle_wfd(2):"
c) Reading it from the pipe in the scp process to write it to the target file: "TK: scp.c: sink:"

The logs show that the buffer is handled in a) and b), but does not show up in c), it is simply lost...

$ for TRACKER in TK:.channels.c:.channel_input_data: TK:.channels.c:.channel_handle_wfd.2.: TK:.scp.c:.sink:; do echo; cat 93_KO_scp_trace.txt | grep "$TRACKER" | grep "B0922[67]"; done

TK: channels.c: channel_input_data: data_len=16384, buf=B092260-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092261-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092262-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092263-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092264-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092265-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092266-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092267-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092268-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092269-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092270-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092271-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092272-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092273-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092274-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092275-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092276-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092277-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092278-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B092279-N000001-

TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092260-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092261-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092262-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092263-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092264-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092265-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092266-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092267-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092268-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092269-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092270-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092271-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092272-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092273-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092274-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092275-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=32768, buf=B092276-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092278-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092279-N000001-

TK: scp.c: sink: amt=65536, j=65536, i=1511587840, size=1553121293, count=65536, buf=B092261-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=1511653376, size=1553121293, count=65536, buf=B092265-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=1511718912, size=1553121293, count=65536, buf=B092270-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=1511784448, size=1553121293, count=65536, buf=B092274-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=1511849984, size=1553121293, count=65536, buf=B092278-N000001-

The 3rd line of c) "TK: scp.c: sink:" should be "buf=B092269-N000001-" if all of the 64k data of the 2nd line "count=65536, buf=B092265-N000001-" would have been read correctly from the pipe, but it is: "buf=B092270-N000001-", i.e. one 16k block is written to the pipe by ssh ("TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B092267-N000001-"), but does not show up on teh other end of the pipe in scp.


In case bigger blocks are written to the pipe by ssh (e.g. "len=32768"), the bigger blocks get lost and does not show up in scp, e.g.:

93_KOOK_TEST.txt/93_OK_TEST.txt.split_aacmz VS 92_KO_TEST.txt/92_KO_TEST.txt.split_aacmz
B016891-N000001-B016891-N000002-B016891-N000003-B016891-N0000 <
B016892-N000001-B016892-N000002-B016892-N000003-B016892-N0000 <
B016893-N000001-B016893-N000002-B016893-N000003-B016893-N0000   B016893-N000001-B016893-N000002-B016893-N000003-B016893-N0000
. . . 
B016900-N000001-B016900-N000002-B016900-N000003-B016900-N0000   B016900-N000001-B016900-N000002-B016900-N000003-B016900-N0000
                                                              > B016901-N000001-B016901-N000002-B016901-N000003-B016901-N0000
                                                              > B016902-N000001-B016902-N000002-B016902-N000003-B016902-N0000

92_KO_TEST.txt/92_KO_scp_trace.txt
TK: channels.c: channel_input_data: data_len=16384, buf=B016890-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B016891-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B016892-N000001-
TK: channels.c: channel_input_data: data_len=16384, buf=B016893-N000001-

TK: channels.c: channel_handle_wfd(2): write: len=16384, buf=B016890-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=32768, buf=B016891-N000001-
TK: channels.c: channel_handle_wfd(2): write: len=32768, buf=B016893-N000001-

TK: scp.c: sink: amt=65536, j=65536, i=276627456, size=1553121293, count=65536, buf=B016885-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=276692992, size=1553121293, count=65536, buf=B016889-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=276758528, size=1553121293, count=65536, buf=B016895-N000001-
TK: scp.c: sink: amt=65536, j=65536, i=276824064, size=1553121293, count=65536, buf=B016899-N000001-


Any help is much appreciated, thanks in advance!
T.

Attachment: cygcheck.REDACTED.txt
Description: Text document

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]