[lug] Web crawler advice

Jason Vallery jason at vallery.net
Mon May 5 10:21:47 MDT 2008

Hi Gordon,

I did something similar for harvesting content from RSS feeds as my
source.  For my application I started with the fantastic PHP Sphider



On Mon, May 5, 2008 at 10:18 AM,  <gordongoldin at aim.com> wrote:
>  I'm doing a project to analyze text content on the web:
>  i need to:
>  start with a list of URLs
>  for each URL in the URL list
>     fetch the page
>     throw away non-English pages
>     extract the sentence text content, (not hidden text, menus, lists, etc.)
>        write that content to a file
>     extract all the links
>        add just the new links to the URL list (not those already in the list
> of URLs)
>  i could just use java, but then i would have to write everything.
>  beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
>  i can't tell to what extent nutch can parse the pages. i know it can give
> me the links, but i don't know if it can extract just the text i care about.
>  Gordon Golding
> ________________________________
> Plan your next roadtrip with MapQuest.com: America's #1 Mapping Site.
> _______________________________________________
>  Web Page:  http://lug.boulder.co.us
>  Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>  Join us on IRC: lug.boulder.co.us port=6667 channel=#colug

Jason Vallery
jason at vallery.net

mobile: +1.720.352.8822
home: +1.303.993.3712
web: http://vallery.net/

More information about the LUG mailing list