Wget ignores robot.txt entry

Randall R Schulz rrschulz@cris.com
Fri Feb 14 02:57:00 GMT 2003


Lowell,

What's in your "~/.wgetrc" file? If it contains this:

robots = off

Then wget will not respect a "robots.txt" file on the host from which 
it is retrieving files.

Before I learned of this option (accessible _only_ via this directive 
in the .wgetrc file), I did something too clever by half to get 
robots.txt ignored, so I know that wget does respect it.

Randall Schulz


At 18:14 2003-02-13, L Anderson wrote:
>Using the latest of things Cygwin, I downloaded some stuff with wget 
>from <http://cygwin.com> to peruse off-line and noticed a problem I 
>can't explain:
>
>The <http://cygwin.com/robots.txt> file has the entries:
>
>User-agent: *
>Disallow: /snapshots/
>Disallow: /cgi-bin/
>Disallow: /cgi2-bin/
>
>so wget should not download /cgi-bin/.
>
>However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml 
>http://cygwin.com/" downloads /cgi-bin anyway.
>
>NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml 
>http://cygwin.com/ doesn't download /cgi-bin
>
>I ran a validity check on <http://cygwin.com/robots.txt> and found no errors.
>
>Is this a bug in wget or am I doing something wrong?
>
>Thanks,
>
>Lowell Anderson


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/



More information about the Cygwin mailing list