|
|
Bai Shen 2012-07-30, 17:12
I set up Nutch 2.x with a new instance of HBase. I ran the following commands.
bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all
When looking at the parse log, I'm seeing a bunch of "different batch id" messages. These are all on urls that I did not inject into the database.
Any ideas what's causing this?
Thanks.
Lewis John Mcgibbney 2012-07-30, 17:14
Can you stick on debug logging and see what the batch ID's actually are?
On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote: > I set up Nutch 2.x with a new instance of HBase. I ran the following > commands. > > bin/nutch inject urls > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -all > > When looking at the parse log, I'm seeing a bunch of "different batch id" > messages. These are all on urls that I did not inject into the database. > > Any ideas what's causing this? > > Thanks.
-- Lewis
Bai Shen 2012-07-31, 17:18
Is there a specific place it's located? I turned on debugging, but I'm not seeing a batch id.
On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote:
> Can you stick on debug logging and see what the batch ID's actually are? > > On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote: > > I set up Nutch 2.x with a new instance of HBase. I ran the following > > commands. > > > > bin/nutch inject urls > > bin/nutch generate -topN 1000 > > bin/nutch fetch -all > > bin/nutch parse -all > > > > When looking at the parse log, I'm seeing a bunch of "different batch id" > > messages. These are all on urls that I did not inject into the database. > > > > Any ideas what's causing this? > > > > Thanks. > > > > -- > Lewis >
alxsss@... 2012-07-31, 17:44
Hi,
Most likely you run generate command a few times and did not run updatedb. So, each generate command assigned different batchId s to its own set of urls.
Alex.
-----Original Message----- From: Bai Shen <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Tue, Jul 31, 2012 10:26 am Subject: Re: Different batch id Is there a specific place it's located? I turned on debugging, but I'm not seeing a batch id.
On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote:
> Can you stick on debug logging and see what the batch ID's actually are? > > On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote: > > I set up Nutch 2.x with a new instance of HBase. I ran the following > > commands. > > > > bin/nutch inject urls > > bin/nutch generate -topN 1000 > > bin/nutch fetch -all > > bin/nutch parse -all > > > > When looking at the parse log, I'm seeing a bunch of "different batch id" > > messages. These are all on urls that I did not inject into the database. > > > > Any ideas what's causing this? > > > > Thanks. > > > > -- > Lewis >
Bai Shen 2012-07-31, 18:45
Nope. I ran exactly the listed commands. And like I said, the ones that show a different batch id were urls that I didn't inject. So no idea how they got in there.
On Tue, Jul 31, 2012 at 1:44 PM, <[EMAIL PROTECTED]> wrote:
> Hi, > > Most likely you run generate command a few times and did not run updatedb. > So, each generate command assigned different batchId s to its own set of > urls. > > Alex. > > > > -----Original Message----- > From: Bai Shen <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Tue, Jul 31, 2012 10:26 am > Subject: Re: Different batch id > > > Is there a specific place it's located? I turned on debugging, but I'm not > seeing a batch id. > > On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > > > Can you stick on debug logging and see what the batch ID's actually are? > > > > On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen <[EMAIL PROTECTED]> > wrote: > > > I set up Nutch 2.x with a new instance of HBase. I ran the following > > > commands. > > > > > > bin/nutch inject urls > > > bin/nutch generate -topN 1000 > > > bin/nutch fetch -all > > > bin/nutch parse -all > > > > > > When looking at the parse log, I'm seeing a bunch of "different batch > id" > > > messages. These are all on urls that I did not inject into the > database. > > > > > > Any ideas what's causing this? > > > > > > Thanks. > > > > > > > > -- > > Lewis > > > > >
Bai Shen 2012-08-02, 12:59
I just tried running this with the actual batch Id instead of using -all, and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote:
> I set up Nutch 2.x with a new instance of HBase. I ran the following > commands. > > bin/nutch inject urls > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -all > > When looking at the parse log, I'm seeing a bunch of "different batch id" > messages. These are all on urls that I did not inject into the database. > > Any ideas what's causing this? > > Thanks. >
alxsss@... 2012-08-02, 18:47
Hi,
I have found out that, what happens after
bin/nutch generate -topN 1000
is that only 1000 of the urls have been marked by gnmrk
Then bin/nutch fetch -all
skips all urls that do not have gnmrk according to the code Utf8 mark = Mark.GENERATE_MARK.checkMark(page); if (!NutchJob.shouldProcess(mark, batchId)) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")"); } return; }
since shouldProcess(mark, batchId) returns false if mark is null.
Then
bin/nutch parse -all skips all urls that do not have fetch mark according to the code Utf8 mark = Mark.FETCH_MARK.checkMark(page); String unreverseKey = TableUtil.unreverseUrl(key); if (!NutchJob.shouldProcess(mark, batchId)) { LOG.info("Skipping " + unreverseKey + "; different batch id"); return; }
this outputs to log as INFO and are those that you see in log file.
So, it seems to me that -all option to fetch, parse and solrindex do not work as expected.
Alex.
-----Original Message----- From: Bai Shen <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Thu, Aug 2, 2012 5:59 am Subject: Re: Different batch id I just tried running this with the actual batch Id instead of using -all, and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote:
> I set up Nutch 2.x with a new instance of HBase. I ran the following > commands. > > bin/nutch inject urls > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -all > > When looking at the parse log, I'm seeing a bunch of "different batch id" > messages. These are all on urls that I did not inject into the database. > > Any ideas what's causing this? > > Thanks. >
Ferdy Galema 2012-08-03, 08:30
Hi,
It depends on the expectation ;)
I agree that it may be confusing, but currently the -all option in the various Nutch tools only process "all with a mark". There is a separate option that is able to process "all regardless if mark is present or not". For the parser this is -reparse. For the indexer -reindex. (At least in the current branch).There is no such thing for the fetcher. It is up for discussion if a "-refetch" option would be useful here. If there is such an option, the purpose of the generator would be gone.
Ferdy.
On Thu, Aug 2, 2012 at 8:47 PM, <[EMAIL PROTECTED]> wrote:
> Hi, > > I have found out that, what happens after > > bin/nutch generate -topN 1000 > > is that only 1000 of the urls have been marked by gnmrk > > Then > bin/nutch fetch -all > > skips all urls that do not have gnmrk > according to the code > Utf8 mark = Mark.GENERATE_MARK.checkMark(page); > if (!NutchJob.shouldProcess(mark, batchId)) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > different batch id (" + mark + ")"); > } > return; > } > > since shouldProcess(mark, batchId) returns false if mark is null. > > Then > > bin/nutch parse -all > skips all urls that do not have fetch mark > according to the code > Utf8 mark = Mark.FETCH_MARK.checkMark(page); > String unreverseKey = TableUtil.unreverseUrl(key); > if (!NutchJob.shouldProcess(mark, batchId)) { > LOG.info("Skipping " + unreverseKey + "; different batch id"); > return; > } > > this outputs to log as INFO and are those that you see in log file. > > So, it seems to me that -all option to fetch, parse and solrindex do not > work as expected. > > Alex. > > > > -----Original Message----- > From: Bai Shen <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Thu, Aug 2, 2012 5:59 am > Subject: Re: Different batch id > > > I just tried running this with the actual batch Id instead of using -all, > and I'm still getting similar results. > > On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[EMAIL PROTECTED]> wrote: > > > I set up Nutch 2.x with a new instance of HBase. I ran the following > > commands. > > > > bin/nutch inject urls > > bin/nutch generate -topN 1000 > > bin/nutch fetch -all > > bin/nutch parse -all > > > > When looking at the parse log, I'm seeing a bunch of "different batch id" > > messages. These are all on urls that I did not inject into the database. > > > > Any ideas what's causing this? > > > > Thanks. > > > > >
|
|