This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

gawk: Bad File Descriptor error with concurrent readonly access to a network file

We let thousands of tiles undergo the same time consuming processing tasks. We use a multi core Windows 7 workstation running several tiles simultaneously in separate shell windows (parallel processing). A batch script controls the work flow of the task with gawk interpreting a number of setup / definition files at run time for each tile / working step. From time to time we get "Bad File Descriptor" errors in gawk (and, e.g., cat, head, tail) when accessing these setup files (they are only read). The full error line reads similar to:

(With job.awk and first access to datafile.txt at gawk source line 31:)
   "gawk: job.awk:31: fatal: error reading input file `datafile.txt': Bad file descriptor"
(With inline gawk scripts typically:)
   "gawk: fatal: error reading input file `datafile.txt': Bad file descriptor"
(With something like "cat datafile.txt > destination":)
   "cat: datafile.txt: Bad file descriptor"

We use MS-Windows shell cmd.exe with batch scripts executing the gawk and other commands.
I tried to use gawk's BEGINFILE rule in order to trap that error. However, the BEGINFILE block is never entered, rather, gawk immediately crashes with the "Bad File Descriptor" error.

I found nothing helpful in the web on that. Several updates to latest versions throughout last years brought no change in this behaviour.

Isolating and tracking down the problem with the test case included below I found out:

1) Concurrent read access to the setup files was possible and worked fine with local files (24 hrs testing with millions of file accesses in 4 parallel jobs).
2) However, when the file to be read (datafile.txt) is stored on a network share on a file server - which is the case in our working environment - the error could be reproduced. The number of Bad file descriptor errors seems to be related to the work load at the server where the file resides.
3) The MS copy command shows no such error, even with network files. So we can substitute the cat's by copy's. For gawk, however, there is no shell alternative.

It looks like there is a small time frame in opening files when the server file is non-accessible to other processes. If a parallel job happens to access the same file within that short time period while another process is opening it, the "Bad File Descriptor" error is thrown.

I would at least expect such a file opening error be submitted to a BEGINFILE rule (as included in the test example) in gawk; but rather I hoped that Cygwin could cope with these situations.
Microsoft obviously is able to cope with these situations (if it is a concurrent file access problem which I am sure is the case), since with copy instead of cat (or gawk) I never experienced such access problems.

Here is the test case I have used. It consists of 3 files:

datafile.txt    A datafile filled with dummy content
chkParallelError.bat     The control Job which has to be started in a cmd.exe shell window; it features 2 optional parameters:
First parameter is the datafile name which defaults to datafile.txt (eventually add a path to it if it is stored in a different directory, e.g. network share)
Second parameter is the number of parallel jobs which should be run, it defaults to the shell (cmd.exe) symbol NUMBER_OF_PROCESSORS, or to 4 if not set. This should be chosen in accordance with the number of cores available (e.g. not exceeding 2*number of processors).
The syntax is MS Windows cmd.exe shell syntax of MS Windows 7.

The job chkParallelErrorJob.bat is started as many times as given in the number of parallel jobs in separate shell windows (cmd.exe). There are 3 calls included that are currently all commented out with rem, namely gawk, cat, and copy (in our case MS Windows cmd.exe command). In order to run one of them it is necessary to erase the respective rem.
Each job creates a logfile "chkParallelError_1.log", "chkParallelError_2.log", etc. in the local directory, where the output (stderr and stdout) is directed to, which can be parsed for "Bad". Additionally the output is partly shown in the shell windows. In my environment I experienced roughly 1 "Bad file descriptor" error in 5secs - 10minutes; eventually a server should be used that has some work load.

Remark 1: Operating a similar test case with a gawk script instead of inline source I experienced from time to time "Bad File Descriptor" errors even when accessing the gawk script source itself, if that script was stored at the network share as well. With the gawk script stored locally, that error did not occur during 24 hrs testing time.

Remark 2: Attached cygcheck_150924.out was edited: Deleted several shell symbols, computer names, network shares, etc.

Remark 3: Since screen output is given for each call to gawk, it may be helpful to minimize the shell windows that popped up (parallel processes) in order to speed up the process so that the errors become more frequent.

Remark 4: For the time being we have a workaround, "copy"-ing the setup files to a local SSD in separate directories for each parallel process before access by any Cygwin tool which is not convenient but at least works.

=== datafile.txt: ==============================
This is line 1
This is line 2
This is line 3
=== chkParallelError.bat =======================
@echo off
    set lis=%1
    set njobs=%2
    if "%lis%"   == "" set lis=datafile.txt
    if "%njobs%" == "" set njobs=%NUMBER_OF_PROCESSORS%
    if "%njobs%" == "" set njobs=4
    for /L %%I in (1,1,%njobs%) do echo start chkParallelErrorJob %%I %lis%&start chkParallelErrorJob %%I %lis%
=== chkParallelErrorJob.bat ====================
@echo off
    set instance=%1
    set lis=%~dpnx2
    echo instance=%instance% lis=%lis%
    set n=0
rem Loop endlessly calling a gawk script that simply counts the lines
rem Write stdout and stderr to a logfile
rem After each call, write timestamp and number of call to logfile and stdout.
    set/a n=n+1
rem !!! Clear one of the following rems in order to activate that particular command !!!
rem     gawk 'BEGINFILE{if(ERRNO)print "Trapped error",ERRNO,"opening file";}{n++}END{print "%date% %time% call %n%",n,"entries read"}' %lis%   >> chkParallelError_%instance%.log 2>&1
rem     cat  %lis% > %lis%_%instance%                  2>> chkParallelError_%instance%.log
rem     copy %lis%   %lis%_%instance%                   >> chkParallelError_%instance%.log 2>&1
rem In case of error write a note and the (last line of the) error message
    if %errorlevel% neq 0 echo Error: %errorlevel%&tail -1 chkParallelError_%instance%.log
rem Write timestamp and count mark to logfile and screen
    echo %date% %time% Instance %instance% Call %n% >> chkParallelError_%instance%.log 2>&1
    echo %date% %time% Instance %instance% Call %n% copy
    goto loop

Kind regards,

Attachment: cygcheck_150924.out
Description: cygcheck_150924.out

Problem reports:
Unsubscribe info:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]