[lug] ADD-ON to Web crawler advice
gsexton at mhsoftware.com
Mon May 5 15:44:40 MDT 2008
Bear Giles wrote:
> George Sexton wrote:
>> gordongoldin at aim.com wrote:
>>> See question below - can one get only text - to speed up the
>>> text-only search?
>>> To get only English - how reliable is the lang="en" ?
>> you could spot check, but I'm guessing that 99% of the pages don't set
>> Charset really won't be helpful. I use UTF-8, so there's no telling
>> from it.
>> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>> that might be slightly helpful.
> All of the ISO-8859-x have the same ASCII subset so that doesn't help.
Actually it does. ISO-8859-5 does have the same characters in the low
set, but it's fair to assume when you see it that the content of the
page is Hebrew. As you point out, it's not necessarily non-English, but
anyone creating a web page with that encoding is either used to writing
Hebrew pages, or has Hebrew on that page...
> (Remember that ASCII is a 7-bit code, with the high bit clear when
> pushed into an 8-bit character. The ISO-8859-x codes are designed as
> extensions of ASCII, not replacements for it.)
I understand character sets pretty well. The real answer is use UTF-8
and then you don't have to worry about it. If you fool around with the
ISO-8859- series, then you can't have mixed content on the same page.
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
MH Software, Inc.
Voice: +1 303 438 9585
More information about the LUG