[lug] wget page-requisites
will.sterling at gmail.com
Wed Jan 12 13:57:55 MST 2011
wget is really your best option. Curl is much better for grabbing a large
number of files using regular expressions, using cookines, forms etc. But
it needs a wrapper script to mirror a website.
Here is an example from the cURL project.
On Wed, Jan 12, 2011 at 1:05 PM, Davide Del Vento <
davide.del.vento at gmail.com> wrote:
> >> $ wget --page-requisites
> >> (snip)
> >> Downloaded: 1 files, 49K in 0.1s (404 KB/s)
> >> It is not downloading any of the 5 inline images (1 in the header, 4
> >> in the body). What am I doing wrong?
> > It probably has to do with the fact that the images are hosted on
> > domains, and wget doesn't want to follow them.
> Yes, I was suspecting this.
> > Try with the "--span-hosts"
> > option, but you may also want to play with "--convert-links" if you want
> > view everything locally.
> Thanks. This solves the simple single-page example, but of course life
> is always harder than simple examples. My actual wget is doing
> --mirror of the whole domain and adding the --span-hosts mess that
> What I want is a --span-host that works only for the --page-requisites
> and not for the recursion. It doesn't seem like a weird request at
> all, I want the pages that I am downloading to be complete with their
> requisites (images) even if they are hosted somewhere else, but I
> don't want to recurse the whole web (as it happens if I do a
> span-host). Any ideas?
> I guess I could count the deepest level of the domain I am mirroring,
> and use that as recursion level instead of the infinite that mirror
> uses. But if I get that wrong, I don't mirror the whole site. And then
> I have to continuously maintain that number, which is a pain. And
> then, even if not the whole internet-for-sure I am still downloading
> the world and his dog. This must be possible, isn't it?
> Using curl or anything else instead of wget is an option, if they are
> more flexible than wget.
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the LUG