MarginalHacks.com DaveSource.com
bew
 

What is it?

A webcrawler/robot for mirroring entire web sites or web trees. ("bew" for "web" mirrored, ain`t I clever?). Just give it a URL and it gets everything underneath it. It also does use the HEAD mechanism to only upload changed files, and it can tell you about broken links in a site as a helpful side-effect. It`s fairly similar to the perl webcopy command written by Víctor Parada. Bew is also an incredibly uncommon last name for Americans. Go figure. If you want something more stable but don't need the recursion, see findex.

Features:

License:

This software is essentially free, but please read my payment spiel
Please read the full license

Download:

It's a single perl script.

Documentation?

Maybe someday. Until then, either read the source or check the usage with the '-h' flag.

Requires:

  1. Perl, which kicks ass
  2. Lynx, a text browser (or rewrite to use lwp-request or somesuch)

Install

It's just a perl script. No install required.

Revision History:

See the CHANGELOG

Freshmeat?

You bet.

Known Bugs

Plenty. This is a work in progress. For a current list of bugs, see the source. For 0.89a:

# Doesn't handle subdomains very well.  bob.steve.com goes into steve.com
#
# Bugs - the following files won't be fetched:
# 1) Any files that aren't linked to
# 2) CGI scripts (unless the source is linked somewhere)
# 3) Any files only accessed by CGI scripts
# 4) Any files only loaded by java/javascript (such as by mouseover, etc..)
# 5) Schemes other than http:// (such as ftp:/  gopher:/)
# 6) Weird stuff:  <img dynsrc=..>, style sheets, ...
# 7) Misses 'README' and 'HEADER' files in directory indexes, and anything
#    else the server doesn't want us to see (IndexIgnore in Apache:srm.conf)
# 8) bob.com:80/page and bob.com/page will be treated as different pages
# 9) Requires that the server implements the HEAD mechanism
# 10) Trimming /www./ isn't good for -www because it means we falsely
#     see external links as NOT FOUND (i.e.:  www.yahoo.com)
#
# Apache doesn't give 'Last-Modified' information for .shtml files, so
# we have to load them every time.
#
# Ignores robots.txt, while it *is* a web crawler, the first step must be
# manual and is assumed to be intended.
#
# Also:
#   Assumes that http://domain.com/dir/ == http://domain.com/dir/index.html
#   Doesn't ignore <!-- html comments -->    (probably for the best?)
#   Ignores <base=..> setting
#   Doesn't know about symbolic links on the server, may fetch multiple
#     copies of files, or even get stuck in infinite loops
#
# Weird problem - say you want everything inside of bob.com/a/b but the
# start page is bob.com/a/b/b.html and bob.com/a/b is unreadable:
# 1)  web-R bob.com/a/b         <- won't work, can't read start page
# 2)  web-R bob.com/a/b/b.html  <- won't work, only gets b.html page
# 3)  web-R bob.com/a/b bob.com/a/b/b.html      <- works