-Web Crawler to Mahout
Pat Ferrel 2012-04-16, 17:08
Bixo I've heard about but did not realize it has gotten such good
traction. I will definitely grab it and take a deeper look. Thanks
On 4/13/12 7:49 PM, Ted Dunning wrote:
> Nutch has scalability limits that other crawlers are able to avoid so it
> isn't quite as fashionable lately.
> Ken Krugler's work with common crawl and Bixo is a bit more current.
> On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel<[EMAIL PROTECTED]> wrote:
>> Thanks I'll check that out.
>> Actually it was pretty easy to write a custom
>> SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is
>> using crawled data from Nutch already.
>> On 4/13/12 4:22 PM, Peyman Mohajerian wrote:
>>> One solution is to use Solr, which integrates nicely with Nutch. Read data
>>> off Solr using SolrReader API.
>>> On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[EMAIL PROTECTED]>
>>> I'd like to use Nutch to gather data to process with Mahout. Nutch
>>>> parsed text for the pages it crawls. Nutch also has several cl tools to
>>>> turn the data into a text file (readseg for instance). The tools I've
>>>> either create one big text file with markers in it for records or allow
>>>> to get one record from the big text file. Mahout expects a sequence file
>>>> a directory full of text files and includes at least one special purpose
>>>> reader for wikipedia dump files.
>>>> Does anyone have a simple way to turn the nutch data into sequence files?
>>>> I'd ideally like to preserve the urls for use with named vectors later in
>>>> the pipeline. It seems a simple tool to write but maybe it's already