[lug] Web crawler advice

Zan Lynx zlynx at acm.org
Mon May 5 10:36:06 MDT 2008

On Mon, 2008-05-05 at 12:18 -0400, gordongoldin at aim.com wrote:
> I'm doing a project to analyze text content on the web:
> i need to:
> start with a list of URLs
> for each URL in the URL list
>    fetch the page
>    throw away non-English pages
>    extract the sentence text content, (not hidden text, menus, lists,
> etc.)

The place I work at has done a lot of this in the last few months as
part of a web classification project (blocking unwanted sites).

All I can say is, "Good Luck!"  You'll need it.

Just figuring out what is English and what isn't is tough.  Some places
like Russia (lumping all those Russian-speaking countries together here)
seem to build web pages marked as Latin-1, or Windows-12xx (whatever)
but are really KOI-8 or KOI-7.

The Chinese and Japanese are bad with this also.  It doesn't seem to
matter to anyone using a Chinese version of Windows because it tries the
native charset first but to us Westerners it comes out as goop.

We did try checking the percentage of English words against a dictionary
but that isn't very reliable either.

Fancier sites will auto-detect and serve pages in your language so make
sure your web spider is sending the right language headers.

Just finding out what text is actually visible to the user is very
difficult.  If the site uses AJAX techniques, it is impossible (nearly).
If you really want to work on AJAX sites, I'd say the best thing to try
is to write a Firefox plugin to get Firefox to load the site, then scan
the DOM tree for visible objects not overlaid with non-transparent
objects, with contrasting front/back colors, that aren't overridden by
higher level OBJECT or EMBED tags.

Those tags remind of another problem: Flash and Silverlight and Java
plugins.  There's not much hope of getting text out of those either.
Zan Lynx <zlynx at acm.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20080505/24748433/attachment.pgp>

More information about the LUG mailing list