|
webdev1977
2010-08-10, 17:55
Eddie Drapkin
2010-08-10, 23:04
webdev1977
2010-08-11, 10:03
Claudio Martella
2010-08-11, 13:56
webdev1977
2010-08-11, 15:23
Julien Nioche
2010-08-11, 15:39
Doğacan Güney
2010-08-11, 15:44
webdev1977
2010-08-11, 16:59
Claudio Martella
2010-08-11, 16:03
webdev1977
2010-08-11, 17:00
webdev1977
2010-08-11, 14:02
|
-
Have yet to complete a very large filesystem crawlwebdev1977 2010-08-10, 17:55
Wow.. this is very frustrating! I just downloaded and configured the 1.2 tagged version from SVN and I STILL can not complete a file system crawl using the nutch crawl command. Has anyone been able to complete a crawl using the nutch crawl command and using the file: protocol? I have a very very large shared drive that I am crawling (300,000 + files). I have very little memory to use, about 2GB total. I am running this as a prototype on my Win XP box. Any ideas based on the stack trace what might be causing this? --------hadoop.log Snippet ------------------------------------------------------------ 2010-08-10 13:16:03,438 WARN mapred.LocalJobRunner - job_local_0025 java.lang.OutOfMemoryError at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(Unknown Source) at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:83) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:136) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at java.io.DataInputStream.read(Unknown Source) at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149) at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:973) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:932) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2010-08-10 13:16:03,672 INFO mapred.JobClient - Job complete: job_local_0025 2010-08-10 13:16:03,672 INFO mapred.JobClient - Counters: 17 2010-08-10 13:16:03,672 INFO mapred.JobClient - ParserStatus 2010-08-10 13:16:03,672 INFO mapred.JobClient - failed=59 2010-08-10 13:16:03,672 INFO mapred.JobClient - success=905 2010-08-10 13:16:03,672 INFO mapred.JobClient - FileSystemCounters 2010-08-10 13:16:03,672 INFO mapred.JobClient - FILE_BYTES_READ=19515258622 2010-08-10 13:16:03,672 INFO mapred.JobClient - FILE_BYTES_WRITTEN=25431386296 2010-08-10 13:16:03,672 INFO mapred.JobClient - FetcherStatus 2010-08-10 13:16:03,672 INFO mapred.JobClient - exception=34 2010-08-10 13:16:03,672 INFO mapred.JobClient - success=964 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map-Reduce Framework 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce input groups=260 2010-08-10 13:16:03,672 INFO mapred.JobClient - Combine output records=0 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map input records=1000 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce shuffle bytes=0 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce output records=741 2010-08-10 13:16:03,672 INFO mapred.JobClient - Spilled Records=5856 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map output bytes=309514931 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map input bytes=145708 2010-08-10 13:16:03,672 INFO mapred.JobClient - Combine input records=0 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map output records=2928 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce input records=742 View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1076547.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-10, 17:55
-
Re: Have yet to complete a very large filesystem crawlEddie Drapkin 2010-08-10, 23:04
On 8/10/2010 12:55 PM, webdev1977 wrote:
> Wow.. this is very frustrating! I just downloaded and configured the 1.2 > tagged version from SVN and I STILL can not complete a file system crawl > using the nutch crawl command. > > Has anyone been able to complete a crawl using the nutch crawl command and > using the file: protocol? I have a very very large shared drive that I am > crawling (300,000 + files). > I have very little memory to use, about 2GB total. I am running this as a > prototype on my Win XP box. > > Any ideas based on the stack trace what might be causing this? > > > --------hadoop.log Snippet > ------------------------------------------------------------ > 2010-08-10 13:16:03,438 WARN mapred.LocalJobRunner - job_local_0025 > java.lang.OutOfMemoryError > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(Unknown Source) > at > org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:83) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:136) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.DataInputStream.read(Unknown Source) > at > org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149) > at > org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101) > at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328) > at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358) > at > org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330) > at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) > at > org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:973) > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:932) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237) > at > org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Job complete: > job_local_0025 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Counters: 17 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - ParserStatus > 2010-08-10 13:16:03,672 INFO mapred.JobClient - failed=59 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - success=905 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - FileSystemCounters > 2010-08-10 13:16:03,672 INFO mapred.JobClient - > FILE_BYTES_READ=19515258622 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - > FILE_BYTES_WRITTEN=25431386296 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - FetcherStatus > 2010-08-10 13:16:03,672 INFO mapred.JobClient - exception=34 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - success=964 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map-Reduce Framework > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce input groups=260 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Combine output > records=0 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Map input records=1000 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce shuffle bytes=0 > 2010-08-10 13:16:03,672 INFO mapred.JobClient - Reduce output You ran out of memory; give Java more heap space. What is it now? Try giving it as much more as you can. +
Eddie Drapkin 2010-08-10, 23:04
-
Re: Have yet to complete a very large filesystem crawlwebdev1977 2010-08-11, 10:03
That would make sense, but I am pretty sure this is not the issue. In this config, I am running with 1024mb of memory. I kind of thought that nutch was able to run on this amount of memory? It would just take much longer. I tried to run the same crawl using the SMB plugin on a Linux machine with 8GB of memory. Of course it ran longer, but in the end, I got the same error. I have turned on various levels of logging and debugging, and I have had no luck figuring out what might be causing it. -- View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1085270.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-11, 10:03
-
Re: Have yet to complete a very large filesystem crawlClaudio Martella 2010-08-11, 13:56
On 8/11/10 12:03 PM, webdev1977 wrote:
> That would make sense, but I am pretty sure this is not the issue. In this > config, I am running with 1024mb of memory. I kind of thought that nutch > was able to run on this amount of memory? It would just take much longer. > > I tried to run the same crawl using the SMB plugin on a Linux machine with > 8GB of memory. Of course it ran longer, but in the end, I got the same > error. I have turned on various levels of logging and debugging, and I have > had no luck figuring out what might be causing it. > what kind of parsers are you using? I had problems with tika parsers threads staying hung and eating all my memory. -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2010-08-11, 13:56
-
Re: Have yet to complete a very large filesystem crawlwebdev1977 2010-08-11, 15:23
I am using tika... should I not be? The problem is that this shared drive has such a diverse set of documents, I was trying to include as many document types as possible. There are some really really office documents that can't be open by the newer versions of office. I was having problems in nutch 1.0 with parsing them. hmm.. maybe I should turn off tika? -- View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1089160.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-11, 15:23
-
Re: Have yet to complete a very large filesystem crawlJulien Nioche 2010-08-11, 15:39
What about profiling the application to find where the memory leak comes
from? On 11 August 2010 16:23, webdev1977 <[EMAIL PROTECTED]> wrote: > > I am using tika... should I not be? The problem is that this shared drive > has such a diverse set of documents, I was trying to include as many > document types as possible. There are some really really office documents > that can't be open by the newer versions of office. I was having problems > in nutch 1.0 with parsing them. hmm.. maybe I should turn off tika? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1089160.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com +
Julien Nioche 2010-08-11, 15:39
-
Re: Have yet to complete a very large filesystem crawlDoğacan Güney 2010-08-11, 15:44
On Wed, Aug 11, 2010 at 18:23, webdev1977 <[EMAIL PROTECTED]> wrote:
> > I am using tika... should I not be? The problem is that this shared drive > has such a diverse set of documents, I was trying to include as many > document types as possible. There are some really really office documents > that can't be open by the newer versions of office. I was having problems > in nutch 1.0 with parsing them. hmm.. maybe I should turn off tika? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1089160.html > Sent from the Nutch - User mailing list archive at Nabble.com. > Can you check this issue? https://issues.apache.org/jira/browse/NUTCH-356 <https://issues.apache.org/jira/browse/NUTCH-356>Maybe it can help. -- Doğacan Güney +
Doğacan Güney 2010-08-11, 15:44
-
Re: Have yet to complete a very large filesystem crawlwebdev1977 2010-08-11, 16:59
Doğacan Güney-3 wrote: > > On Wed, Aug 11, 2010 at 18:23, webdev1977 <[EMAIL PROTECTED]> wrote: > >> >> I am using tika... should I not be? The problem is that this shared >> drive >> has such a diverse set of documents, I was trying to include as many >> document types as possible. There are some really really office >> documents >> that can't be open by the newer versions of office. I was having >> problems >> in nutch 1.0 with parsing them. hmm.. maybe I should turn off tika? >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1089160.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > Can you check this issue? > > https://issues.apache.org/jira/browse/NUTCH-356 > > <https://issues.apache.org/jira/browse/NUTCH-356>Maybe it can help. > > > -- > Doğacan Güney > > Thanks for the sugestion, I am trying it out as we speak. ! -- View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1090655.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-11, 16:59
-
Re: Have yet to complete a very large filesystem crawlClaudio Martella 2010-08-11, 16:03
On 8/11/10 5:23 PM, webdev1977 wrote:
> I am using tika... should I not be? The problem is that this shared drive > has such a diverse set of documents, I was trying to include as many > document types as possible. There are some really really office documents > that can't be open by the newer versions of office. I was having problems > in nutch 1.0 with parsing them. hmm.. maybe I should turn off tika? personally, i solved this by applying this patch: https://issues.apache.org/jira/browse/NUTCH-696 it will kill the hangup threads. this is not ideal, but it will avoid eating up all your memory. -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2010-08-11, 16:03
-
Re: Have yet to complete a very large filesystem crawlwebdev1977 2010-08-11, 17:00
Claudio Martella wrote: > > > personally, i solved this by applying this patch: > > https://issues.apache.org/jira/browse/NUTCH-696 > > it will kill the hangup threads. this is not ideal, but it will avoid > eating up all your memory. > > I am using the tagged 1.2 version from SVN.. I thought that this patch was commited to this version. Maybe I am wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1090658.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-11, 17:00
-
Re: Have yet to complete a very large filesystem crawlwebdev1977 2010-08-11, 14:02
Some more info. Seems to be hung on the MapReduce task. console output: finishing thread FetcherThread, activeThreads=9 finishing thread FetcherThread, activeThreads=8 finishing thread FetcherThread, activeThreads=7 activeThreads=7, spinWaiting=0, fetchQueues.totalSize=0 finishing thread FetcherThread, activeThreads=6 finishing thread FetcherThread, activeThreads=5 finishing thread FetcherThread, activeThreads=4 finishing thread FetcherThread, activeThreads=3 finishing thread FetcherThread, activeThreads=2 finishing thread FetcherThread, activeThreads=1 activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 finishing thread FetcherThread, activeThreads=0 activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 activeThreads=0 After about 20 mins of this, I run out of memory: hadoop log snippet: (SAME MESSAGE OVER AND OVER and OVER again during that 20+ mins.) 2010-08-11 09:57:38,566 INFO mapred.LocalJobRunner - reduce > reduce 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Creating group ParserStatus with nothing 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding failed 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding success 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Creating group FileSystemCounters with nothing 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding FILE_BYTES_READ 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Creating group FetcherStatus with nothing 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding exception 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding success 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding SPILLED_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding REDUCE_INPUT_GROUPS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding REDUCE_SHUFFLE_BYTES 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding REDUCE_OUTPUT_RECORDS 2010-08-11 09:57:38,566 DEBUG mapred.Counters - Adding REDUCE_INPUT_RECORDS -- View this message in context: http://lucene.472066.n3.nabble.com/Have-yet-to-complete-a-very-large-filesystem-crawl-tp1076547p1087857.html Sent from the Nutch - User mailing list archive at Nabble.com. +
webdev1977 2010-08-11, 14:02
|