|
b e w
|
| |
What is it?
A webcrawler/robot for mirroring entire web sites or web trees. ("bew" for
"web" mirrored, ain`t I clever?).
Just give it a URL and it gets everything underneath it. It also
does use the HEAD mechanism to only upload changed files, and
it can tell you about broken links in a site as a helpful side-effect.
It`s fairly similar to the perl
webcopy
command written by
Víctor Parada.
Bew is also an
incredibly uncommon
last name for Americans. Go figure.
If you want something more stable but don't need the recursion, see findex.
Features:
- Simple, small, and easy to modify or improve.
License:
This software is essentially free, but please read my
payment spiel
Please read the
full license
Download:
It's a single perl script.
Documentation?
Maybe someday. Until then, either read the source or check the
usage with the '-h' flag.
Requires:
- Perl, which kicks ass
- Lynx, a text browser (or rewrite to
use lwp-request or somesuch)
Install
It's just a perl script. No install required.
Revision History:
See the CHANGELOG
Freshmeat?
You bet.
Known Bugs
Plenty. This is a work in progress. For a current list of bugs,
see the source. For 0.89a:
# Doesn't handle subdomains very well. bob.steve.com goes into steve.com
#
# Bugs - the following files won't be fetched:
# 1) Any files that aren't linked to
# 2) CGI scripts (unless the source is linked somewhere)
# 3) Any files only accessed by CGI scripts
# 4) Any files only loaded by java/javascript (such as by mouseover, etc..)
# 5) Schemes other than http:// (such as ftp:/ gopher:/)
# 6) Weird stuff: <img dynsrc=..>, style sheets, ...
# 7) Misses 'README' and 'HEADER' files in directory indexes, and anything
# else the server doesn't want us to see (IndexIgnore in Apache:srm.conf)
# 8) bob.com:80/page and bob.com/page will be treated as different pages
# 9) Requires that the server implements the HEAD mechanism
# 10) Trimming /www./ isn't good for -www because it means we falsely
# see external links as NOT FOUND (i.e.: www.yahoo.com)
#
# Apache doesn't give 'Last-Modified' information for .shtml files, so
# we have to load them every time.
#
# Ignores robots.txt, while it *is* a web crawler, the first step must be
# manual and is assumed to be intended.
#
# Also:
# Assumes that http://domain.com/dir/ == http://domain.com/dir/index.html
# Doesn't ignore <!-- html comments --> (probably for the best?)
# Ignores <base=..> setting
# Doesn't know about symbolic links on the server, may fetch multiple
# copies of files, or even get stuck in infinite loops
#
# Weird problem - say you want everything inside of bob.com/a/b but the
# start page is bob.com/a/b/b.html and bob.com/a/b is unreadable:
# 1) web-R bob.com/a/b <- won't work, can't read start page
# 2) web-R bob.com/a/b/b.html <- won't work, only gets b.html page
# 3) web-R bob.com/a/b bob.com/a/b/b.html <- works
Other Software?
Besides MarginalHacks, you
can also find plenty of software at
Free Downloads Center
and www.AllWorldSoft.com