This is the mail archive of the mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Wget ignores robot.txt entry

Randall R Schulz wrote:

What's in your "~/.wgetrc" file? If it contains this:

robots = off

Then wget will not respect a "robots.txt" file on the host from which it is retrieving files.

Before I learned of this option (accessible _only_ via this directive in the .wgetrc file), I did something too clever by half to get robots.txt ignored, so I know that wget does respect it.

I have only two wgetrc related files as follows:


NB: I use win98 and these are under my cygwin directory i:\cygwin (i.e. /cygdrive/i).

I have never changed either file--I just accept the default installed by setup. However, the two files differ by a few lines which are just comments anyway. i.e. doing:

$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
< # You can set the default proxy for Wget to use. It will override the
< # value in the environment.
> # You can set the default proxies for Wget to use for http and ftp.
> # They will override the value in the environment.
> #ftp_proxy =

shows this. Moreover,

$ grep robot /etc/wgetrc
# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
#robots = on

shows the only references to "robot" are also comments.

The stated default for wget is "robots=on" which I have seen honored for quite a number of other downloads and since I didn't use "-e robots=off", that can't explain it. The only other thing I have found that might be related is not under my control and I haven't yet figured out how to check it. From the wget documentation it states:

The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:

<meta name="robots" content="nofollow">

This is explained in some detail at <>. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

Perhaps there is a counterpart to the above, i.e., <meta name="robots" content="follow"> that's being involked and someone from Redhat could check into and rule this out.

Thanks (and still puzzled)!

Lowell Anderson

Randall Schulz

At 18:14 2003-02-13, L Anderson wrote:

Using the latest of things Cygwin, I downloaded some stuff with wget from <> to peruse off-line and noticed a problem I can't explain:

The <> file has the entries:

User-agent: *
Disallow: /snapshots/
Disallow: /cgi-bin/
Disallow: /cgi2-bin/

so wget should not download /cgi-bin/.

However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml"; downloads /cgi-bin anyway.

NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml doesn't download /cgi-bin

I ran a validity check on <> and found no errors.

Is this a bug in wget or am I doing something wrong?


Lowell Anderson

Unsubscribe info:
Bug reporting:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]