What is it?

A webcrawler/robot for mirroring entire web sites or web trees. ("bew" for "web" mirrored, ain`t I clever?). Just give it a URL and it gets everything underneath it. It also does use the HEAD mechanism to only download changed files if you already have the mirror, and it can tell you about broken links in a site as a helpful side-effect. It`s fairly similar to the perl webcopy command written by Víctor Parada.

"Bew" is also an incredibly uncommon last name for Americans. Go figure.

If you want something more stable but don't need the recursion, see findex.



This software is essentially free, but please read my payment spiel
Please read the full license


It's a single perl script.


Maybe someday. Until then, either read the source or check the usage with the '-h' flag.


  1. Perl, which kicks ass
  2. Lynx, a text browser (or rewrite to use lwp-request or somesuch)


It's just a perl script. No install required.

Known Bugs

Plenty. This is a work in progress. For a current list of bugs, see the source. For 0.89a:

# Doesn't handle subdomains very well. goes into
# Bugs - the following files won't be fetched:
# 1) Any files that aren't linked to
# 2) CGI scripts (unless the source is linked somewhere)
# 3) Any files only accessed by CGI scripts
# 4) Any files only loaded by java/javascript (such as by mouseover, etc..)
# 5) Schemes other than http:// (such as ftp:/  gopher:/)
# 6) Weird stuff:  <img dynsrc=..>, style sheets, ...
# 7) Misses 'README' and 'HEADER' files in directory indexes, and anything
#    else the server doesn't want us to see (IndexIgnore in Apache:srm.conf)
# 8) and will be treated as different pages
# 9) Requires that the server implements the HEAD mechanism
# 10) Trimming /www./ isn't good for -www because it means we falsely
#     see external links as NOT FOUND (i.e.:
# Apache doesn't give 'Last-Modified' information for .shtml files, so
# we have to load them every time.
# Ignores robots.txt, while it *is* a web crawler, the first step must be
# manual and is assumed to be intended.
# Also:
#   Assumes that ==
#   Doesn't ignore <!-- html comments -->    (probably for the best?)
#   Ignores <base=..> setting
#   Doesn't know about symbolic links on the server, may fetch multiple
#     copies of files, or even get stuck in infinite loops
# Weird problem - say you want everything inside of but the
# start page is and is unreadable:
# 1)  web-R         <- won't work, can't read start page
# 2)  web-R  <- won't work, only gets b.html page
# 3)  web-R      <- works