Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Web Crawler to Mahout


Copy link to this message
-
Web Crawler to Mahout
Pat Ferrel 2012-04-16, 17:08
Bixo I've heard about but did not realize it has gotten such good
traction. I will definitely grab it and take a deeper look. Thanks

On 4/13/12 7:49 PM, Ted Dunning wrote:
> Nutch has scalability limits that other crawlers are able to avoid so it
> isn't quite as fashionable lately.
>
> Ken Krugler's work with common crawl and Bixo is a bit more current.
>
> On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel<[EMAIL PROTECTED]>  wrote:
>
>> Thanks I'll check that out.
>>
>> Actually it was pretty easy to write a custom
>> SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is
>> using crawled data from Nutch already.
>>
>> On 4/13/12 4:22 PM, Peyman Mohajerian wrote:
>>
>>> One solution is to use Solr, which integrates nicely with Nutch. Read data
>>> off Solr using SolrReader API.
>>>
>>> On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[EMAIL PROTECTED]>
>>>   wrote:
>>>
>>>   I'd like to use Nutch to gather data to process with Mahout. Nutch
>>>> creates
>>>> parsed text for the pages it crawls. Nutch also has several cl tools to
>>>> turn the data into a text file (readseg for instance). The tools I've
>>>> found
>>>> either create one big text file with markers in it for records or allow
>>>> you
>>>> to get one record from the big text file. Mahout expects a sequence file
>>>> or
>>>> a directory full of text files and includes at least one special purpose
>>>> reader for wikipedia dump files.
>>>>
>>>> Does anyone have a simple way to turn the nutch data into sequence files?
>>>> I'd ideally like to preserve the urls for use with named vectors later in
>>>> the pipeline. It seems a simple tool to write but maybe it's already
>>>> there
>>>> somewhere?
>>>>
>>>>
>>>>