|
|
-
Web Crawler to MahoutPat Ferrel 2012-04-16, 17:08
Bixo I've heard about but did not realize it has gotten such good
traction. I will definitely grab it and take a deeper look. Thanks On 4/13/12 7:49 PM, Ted Dunning wrote: > Nutch has scalability limits that other crawlers are able to avoid so it > isn't quite as fashionable lately. > > Ken Krugler's work with common crawl and Bixo is a bit more current. > > On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel<[EMAIL PROTECTED]> wrote: > >> Thanks I'll check that out. >> >> Actually it was pretty easy to write a custom >> SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is >> using crawled data from Nutch already. >> >> On 4/13/12 4:22 PM, Peyman Mohajerian wrote: >> >>> One solution is to use Solr, which integrates nicely with Nutch. Read data >>> off Solr using SolrReader API. >>> >>> On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[EMAIL PROTECTED]> >>> wrote: >>> >>> I'd like to use Nutch to gather data to process with Mahout. Nutch >>>> creates >>>> parsed text for the pages it crawls. Nutch also has several cl tools to >>>> turn the data into a text file (readseg for instance). The tools I've >>>> found >>>> either create one big text file with markers in it for records or allow >>>> you >>>> to get one record from the big text file. Mahout expects a sequence file >>>> or >>>> a directory full of text files and includes at least one special purpose >>>> reader for wikipedia dump files. >>>> >>>> Does anyone have a simple way to turn the nutch data into sequence files? >>>> I'd ideally like to preserve the urls for use with named vectors later in >>>> the pipeline. It seems a simple tool to write but maybe it's already >>>> there >>>> somewhere? >>>> >>>> >>>> |