bew
| |
# Doesn't handle subdomains very well. bob.steve.com goes into steve.com # # Bugs - the following files won't be fetched: # 1) Any files that aren't linked to # 2) CGI scripts (unless the source is linked somewhere) # 3) Any files only accessed by CGI scripts # 4) Any files only loaded by java/javascript (such as by mouseover, etc..) # 5) Schemes other than http:// (such as ftp:/ gopher:/) # 6) Weird stuff: <img dynsrc=..>, style sheets, ... # 7) Misses 'README' and 'HEADER' files in directory indexes, and anything # else the server doesn't want us to see (IndexIgnore in Apache:srm.conf) # 8) bob.com:80/page and bob.com/page will be treated as different pages # 9) Requires that the server implements the HEAD mechanism # 10) Trimming /www./ isn't good for -www because it means we falsely # see external links as NOT FOUND (i.e.: www.yahoo.com) # # Apache doesn't give 'Last-Modified' information for .shtml files, so # we have to load them every time. # # Ignores robots.txt, while it *is* a web crawler, the first step must be # manual and is assumed to be intended. # # Also: # Assumes that http://domain.com/dir/ == http://domain.com/dir/index.html # Doesn't ignore <!-- html comments --> (probably for the best?) # Ignores <base=..> setting # Doesn't know about symbolic links on the server, may fetch multiple # copies of files, or even get stuck in infinite loops # # Weird problem - say you want everything inside of bob.com/a/b but the # start page is bob.com/a/b/b.html and bob.com/a/b is unreadable: # 1) web-R bob.com/a/b <- won't work, can't read start page # 2) web-R bob.com/a/b/b.html <- won't work, only gets b.html page # 3) web-R bob.com/a/b bob.com/a/b/b.html <- works