Home | About | Sematext search-lucene.com search-hadoop.com
clear query|facets|time Search criteria: .   Results from 91 to 100 from 133 (0.089s).
Loading phrases to help you
refine your search...
Re: Getting html pages through a Nutch crawl (for a dataset) - Nutch - [mail # user]
...If you need the urls, then yes, you just need to further process that file.  If you need the content of those htlm files, then I'm not.sure how to.do.that  On Monday, January 23, 2...
   Author: remi tassing, 2012-01-23, 16:17
Re: Getting html pages through a Nutch crawl (for a dataset) - Nutch - [mail # user]
...Hi,  in your output directory, you should see two files: 1..part-00000.crc 2. part-00000  Open the second one with a text editor and you should be able to see the crawled urls. Per...
   Author: remi tassing, 2012-01-23, 14:32
Re: Getting html pages through a Nutch crawl (for a dataset) - Nutch - [mail # user]
...Hi Sameendra,  read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb  For instance the following command, will read your database and output the crawled URLs to the d...
   Author: remi tassing, 2012-01-23, 06:44
Re: concurrent Nutch instances in parallel - Nutch - [mail # user]
...Thanks Markus! I'll merge segments for now and try Hadoop when it gets more serious Remi  On Sunday, January 22, 2012, Markus Jelsma  wrote: quite look...
   Author: remi tassing, 2012-01-22, 16:47
concurrent Nutch instances in parallel - Nutch - [mail # user]
...Hi, Is it safe to run concurrent instances of Nutch in different machines and just merge the segments later on?  I believe Hadoop is recommended for this purpose, but I'm not ready to l...
   Author: remi tassing, 2012-01-21, 16:04
Re: Partly remove already crawled urls - Nutch - [mail # user]
...The main purpose is to remove urls matching a certain pattern from the Nutch segments(or database).  Remi  On Thursday, January 19, 2012, Lewis John Mcgibbney  wrote: wrote: m...
   Author: remi tassing, 2012-01-19, 20:26
Re: Regex help - exclude a url - Nutch - [mail # user]
...Your homepage is probably http://www.homepage.com/index.html, so try -^http://www.homepage.com/index.html +^http://www.homepage.com  On Thursday, January 19, 2012, Dean Del Ponte  ...
   Author: remi tassing, 2012-01-19, 17:14
Re: Partly remove already crawled urls - Nutch - [mail # user]
...Plz advice for maintainance tool for Nutch.  I heard of Luke for Solr, I'll try it.  Remi  On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney  wrote:  ...
   Author: remi tassing, 2012-01-19, 14:19
Partly remove already crawled urls - Nutch - [mail # user]
...Hi,  Let's say my filters in regex-urlfilter.txt weren't well written and I crawled outside my wanted boundaries. Now I noticed it and want to remove those urls.  what would you re...
   Author: remi tassing, 2012-01-19, 13:43
Re: invalid uri with "three dots" - Nutch - [mail # user]
...I posted a question on this JIRA: https://issues.apache.org/jira/browse/HTTPCLIENT-858?focusedCommentId=13188481#comment-13188481   I looks like the same problem  On Tue, Jan 17, 2...
   Author: remi tassing, 2012-01-18, 14:51
Sort:
project
Nutch (133)
Solr (27)
type
mail # user (133)
date
last 7 days (0)
last 30 days (0)
last 90 days (0)
last 6 months (0)
last 9 months (133)
author
Markus Jelsma (1783)
Lewis John Mcgibbney (1179)
Julien Nioche (817)
Mattmann, Chris A (406)
lewis john mcgibbney (336)
Andrzej Bialecki (302)
Ferdy Galema (229)
Tejas Patil (218)
Bai Shen (177)
kiran chitturi (165)
Sebastian Nagel (164)
alxsss@...)
remi tassing (133)
Lewis John McGibbney (129)
Gabriele Kahlout (115)