[lug] Web crawler advice
nate at natetech.com
Mon May 5 22:11:44 MDT 2008
On May 5, 2008, at 7:06 PM, Bear Giles wrote:
> karl horlen wrote:
>> Can you say more about how you detect that people are leeching your
>> site content and how you prevent it. For instance what specific
>> rewrite rules or other techniques do you use to help defeat this
>> type of behavior?
> One standard technique is to look at the REFERER (sic) header. It
> contains the URL of the page referring to the graphic/page/whatever.
> Like all headers it's trivially manipulated by a knowledgeable
> person, but it's a good approach for the casual user.
> It's a little confusing at first. Say you're "pooh at woods.com" and
> you visit the page "badbear.com/lunch.html" that contains a link to
> the image honeypot.com/daisy.jpg. The server at honeypot.com will
> see a "remote addr" of woods.com and a REFERER header of
> It can then decide what to do. Many sites block deep linking by
> checking the REFERER and blocking queries from outside of its own
> domain. More casual approaches would redirect queries with a REFERER
> link from specific blacklisted domains.
Yep, that's how I found it. I could care less about "casual" deep-
linking to my personal site, but when you're getting bombarded by the
crappy MySpace stuff (and the browser sends the REFERRER stuff
correctly) it's pretty obvious... the web server logs are pounded.
I've since sent not only myspace referrals but also blogspot and
livejournal to the bit-bucket. Could care less if people linking from
those sites see what they want to see on my pages.
I even had a guy COMPLAIN that he had been SELLING people "custom
MySpace pages" that included deep-links to my site, and that I had
"broke" them. What a tard.
I suppose I could have turned that into an opportunity of some kind,
but I just replied saying he was welcome to find the same funny photos
and things I had on my webserver out on the net and host them on his
own webservers to deal with the crushing load he'd put on a box on a
residential connection, that was never meant to service half of the
world's MySpace teenie boppers saying, "Dude - UR sooo HOOTTT!" to
some girl they don't know.
I have stuff I don't even know for sure is not copyrighted, up on the
blog... I would never make a buck on any of it. It's just posted as a
"ha-ha funny" type of thing on my blog pages and I always copy it down
(to save their server from load) and give credit for where it was
"found" with a link, if it wasn't e-mailed to me.
Anyway... since someone else shared, I redirect them to this:
[Of course, publishing2 appears to have problems of their own...]
And the graphic comes from this article:
Where there's bitching about MySpace, talk of some anti-MySpace site
called "LostCherry", and then even more bitching about Digg "burying"
the "Lost Cherry Story"...
Basically, I redirect the cesspool back to the cesspool, I figure.
Plus it just continues the "controversy chain" ad-nauseam. Might as
well. These sites love this kind of crap. More traffic to claim to
their advertisers, else they wouldn't have a business model.
The ADD Poster Children who don't understand HTML or browsers who want
to "investigate" why they're getting a "new" graphic some way they
don't understand, end up chasing around wondering who publishing2 is,
find the article, and say "ooh, shiny!" and dive into the comment
sections of publishing2, LostCherry, MySpace and Digg to continue the
Of course, it's a never-ending game. I wonder how many rewrites from
Apache a browser will follow before it gives up. Might be fun to
redirect to a pool of high-bandwidth servers in a circular rewrite,
where one hands to the other, which hands to a third, which hands back
to the original... but I'm not THAT evil. If the browsers don't stop
the chain, and I bet they don't... you could probably lock up
someone's browser bad enough that they would have to close all of
their tabs and start over. Imagine that happening in an image link on
some doofuses MySpace page.
Game over. He who dies with the most bandwidth wins.
nate at natetech.com
More information about the LUG