|
forwardswing
2012-05-14, 03:24
Markus Jelsma
2012-05-14, 07:50
forwardswing
2012-05-14, 08:35
forwardswing
2012-05-15, 04:50
Markus Jelsma
2012-05-15, 11:04
forwardswing
2012-05-16, 06:22
forwardswing
2012-05-16, 14:15
Lewis John Mcgibbney
2012-05-17, 09:50
forwardswing
2012-05-18, 05:12
Lewis John Mcgibbney
2012-05-18, 10:09
Lewis John Mcgibbney
2012-05-18, 10:14
Olivier LEVILLAIN
2012-06-18, 08:46
|
-
Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-14, 03:24
when I use Nutch1.2,it alwayls occurs the following error:
dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript main.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript my parse-plugins.xml is: <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> </mimeType> <mimeType name="application/rss+xml"> <plugin id="parse-rss" /> <plugin id="feed" /> </mimeType> <mimeType name="application/x-bzip2"> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-gzip"> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-javascript"> <plugin id="parse-js" /> </mimeType> <mimeType name="application/x-shockwave-flash"> <plugin id="parse-swf" /> </mimeType> <mimeType name="application/zip"> <plugin id="parse-zip" /> </mimeType> <mimeType name="text/xml"> <plugin id="parse-html" /> <plugin id="parse-rss" /> <plugin id="feed" /> </mimeType> <mimeType name="application/vnd.nutch.example.cat"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/vnd.nutch.example.md5sum"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/javascript"> <plugin id="parse-tika" /> </mimeType> <mimeType name="text/javascript"> <plugin id="parse-tika" /> </mimeType> <aliases> <alias name="parse-tika" extension-id="org.apache.nutch.parse.tika.Parser" /> <alias name="parse-ext" extension-id="ExtParser" /> <alias name="parse-html" extension-id="org.apache.nutch.parse.html.HtmlParser" /> <alias name="parse-js" extension-id="JSParser" /> <alias name="parse-msexcel" extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" /> <alias name="parse-mspowerpoint" extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" /> <alias name="parse-msword" extension-id="org.apache.nutch.parse.msword.MSWordParser" /> <alias name="parse-oo" extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" /> <alias name="parse-pdf" extension-id="org.apache.nutch.parse.pdf.PdfParser" /> <alias name="parse-rss" extension-id="org.apache.nutch.parse.rss.RSSParser" /> <alias name="feed" extension-id="org.apache.nutch.parse.feed.FeedParser" /> <alias name="parse-swf" extension-id="org.apache.nutch.parse.swf.SWFParser" /> <alias name="parse-text" extension-id="org.apache.nutch.parse.text.TextParser" /> <alias name="parse-zip" extension-id="org.apache.nutch.parse.zip.ZipParser" /> </aliases> and nutch-site.xml is: <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> Who can help me ? -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptMarkus Jelsma 2012-05-14, 07:50
you have text/javascript mapped to Tika but Tika does not have a parser
for this MIME-type. Remove the mappings but keep it mapped to parse-js. That should work, that is, the proper parser should be invoked. On Sun, 13 May 2012 20:24:29 -0700 (PDT), forwardswing <[EMAIL PROTECTED]> wrote: > when I use Nutch1.2,it alwayls occurs the following error: > dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type > text/javascript > main.js: failed(2,0): Can't retrieve Tika parser for mime-type > text/javascript > Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type > text/javascript > > my parse-plugins.xml is: > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-html" /> > </mimeType> > > <mimeType name="application/rss+xml"> > <plugin id="parse-rss" /> > <plugin id="feed" /> > </mimeType> > > <mimeType name="application/x-bzip2"> > > <plugin id="parse-zip" /> > </mimeType> > > <mimeType name="application/x-gzip"> > > <plugin id="parse-zip" /> > </mimeType> > > <mimeType name="application/x-javascript"> > <plugin id="parse-js" /> > </mimeType> > > <mimeType name="application/x-shockwave-flash"> > <plugin id="parse-swf" /> > </mimeType> > > <mimeType name="application/zip"> > <plugin id="parse-zip" /> > </mimeType> > > <mimeType name="text/xml"> > <plugin id="parse-html" /> > <plugin id="parse-rss" /> > <plugin id="feed" /> > </mimeType> > > > > <mimeType name="application/vnd.nutch.example.cat"> > <plugin id="parse-ext" /> > </mimeType> > > <mimeType name="application/vnd.nutch.example.md5sum"> > <plugin id="parse-ext" /> > </mimeType> > > <mimeType name="application/javascript"> > <plugin id="parse-tika" /> > </mimeType> > <mimeType name="text/javascript"> > <plugin id="parse-tika" /> > </mimeType> > > > > <aliases> > <alias name="parse-tika" > extension-id="org.apache.nutch.parse.tika.Parser" /> > <alias name="parse-ext" extension-id="ExtParser" /> > <alias name="parse-html" > extension-id="org.apache.nutch.parse.html.HtmlParser" /> > <alias name="parse-js" extension-id="JSParser" /> > <alias name="parse-msexcel" > extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" /> > <alias name="parse-mspowerpoint" > > extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" > /> > <alias name="parse-msword" > extension-id="org.apache.nutch.parse.msword.MSWordParser" /> > <alias name="parse-oo" > extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" /> > <alias name="parse-pdf" > extension-id="org.apache.nutch.parse.pdf.PdfParser" /> > <alias name="parse-rss" > extension-id="org.apache.nutch.parse.rss.RSSParser" /> > <alias name="feed" > extension-id="org.apache.nutch.parse.feed.FeedParser" /> > <alias name="parse-swf" > extension-id="org.apache.nutch.parse.swf.SWFParser" /> > <alias name="parse-text" > extension-id="org.apache.nutch.parse.text.TextParser" /> > <alias name="parse-zip" > extension-id="org.apache.nutch.parse.zip.ZipParser" /> > </aliases> > > > and nutch-site.xml is: > <property> > <name>plugin.includes</name> > > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > </property> > > > > Who can help me ? > > -- > View this message in context: > > http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex
-
Re: Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-14, 08:35
I modify the parse-plugins.xml clip from:
<mimeType name="text/javascript"> <plugin id="parse-tike" /> </mimeType> to : <mimeType name="text/javascript"> <plugin id="parse-js" /> </mimeType> but there occurs another error: Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript' fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with: java.lang.ArrayIndexOutOfBoundsException: -53 Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript' fetch of http://10.31.8.29:8080/AWIsys/main.js failed with: java.lang.ArrayIndexOutOfBoundsException: -53 Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript' fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with: java.lang.ArrayIndexOutOfBoundsException: -53 Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js: UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript' fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with: java.lang.ArrayIndexOutOfBoundsException: -53 What's the meaning of "-53" If necessary ,I can provide the js files. Thank you for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-15, 04:50
I am sincerely waiting for your reply.
-- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983795.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptMarkus Jelsma 2012-05-15, 11:04
I see, it doesn't work. The JSParser is known not to work very well, or work
at all. Why do you want to parse JS anyway? It's not a very common practice to do so. On Monday 14 May 2012 01:35:01 forwardswing wrote: > I modify the parse-plugins.xml clip from: > <mimeType name="text/javascript"> > <plugin id="parse-tike" /> > </mimeType> > > to : > <mimeType name="text/javascript"> > <plugin id="parse-js" /> > </mimeType> > > but there occurs another error: > Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0): > Content not JavaScript: 'text/javascript' > fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with: > java.lang.ArrayIndexOutOfBoundsException: -53 > > Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0): > Content not JavaScript: 'text/javascript' > fetch of http://10.31.8.29:8080/AWIsys/main.js failed with: > java.lang.ArrayIndexOutOfBoundsException: -53 > > Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0): > Content not JavaScript: 'text/javascript' > fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with: > java.lang.ArrayIndexOutOfBoundsException: -53 > > Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js: > UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript' > fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with: > java.lang.ArrayIndexOutOfBoundsException: -53 > > > What's the meaning of "-53" > > If necessary ,I can provide the js files. > > Thank you for your help. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type > -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing > list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex
-
Re: Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-16, 06:22
I have a page which is mainly controlled by javascript & ajax.
So i need to parse it. Thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984018.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-16, 14:15
Is there a way to resolve this ?
-- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptLewis John Mcgibbney 2012-05-17, 09:50
I see some problems from the thread.
1) Please ensure both of the following are mapped to parse-js as Markus suggested <mimeType name="application/javascript"> <plugin id="parse-tika" /> </mimeType> <mimeType name="text/javascript"> <plugin id="parse-tika" /> </mimeType> 2) Your alias for the parse-ja plugin class is incorrect. You can find the correct path here [0] 3) Please ensure that your regex-urlfilter configuration does NOT skip JS and js mimeTypes 4) I tried fetching and parsing one of the links you provided in your thread... which did not work. Is there maybe something else at play here? [0] http://svn.apache.org/repos/asf/nutch/tags/release-1.2/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/ On Wed, May 16, 2012 at 3:15 PM, forwardswing <[EMAIL PROTECTED]> wrote: > Is there a way to resolve this ? > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis
-
Re: Can't retrieve Tika parser for mime-type text/javascriptforwardswing 2012-05-18, 05:12
First of all,thank you very much for your reply.
I have followed your suggestion and did the following modification: <mimeType name="application/javascript"> <plugin id="parse-js" /> </mimeType> <mimeType name="text/javascript"> <plugin id="parse-js" /> </mimeType> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" /> There is still an error: dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type text/javascript here is the js file to be parse,could you please have a try in your environment ? http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Can't retrieve Tika parser for mime-type text/javascriptLewis John Mcgibbney 2012-05-18, 10:09
I tried configuring my instance to fetch and parse your page with the
following result lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$ ./nutch parsechecker http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js contentType: application/javascript signature: 4bf7aa15c0e79cb2330bc80c417f0a55 --------- Url --------------- http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js --------- ParseData --------- Version: 5 Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript' Title: Outlinks: 0 Content Metadata: Parse Metadata: So I tried a small experiment to see if I could hack a solution but unfortunately as far as I got was to find that beginning on line 152 of the JSParserFilter class we see public ParseResult getParse(Content c) { String type = c.getContentType(); if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/x-javascript")) return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT, "Content not JavaScript: '" + type + "'").getEmptyParseResult(c.getUrl(), getConf()); It appears from the ParserChecker that ParseStatus is returning the FAILED_INVALID_FORMAT message which we get. If you are going to focus on getting the plugin to actually parse your files, I would begin there, however I wouldn't expect miracles from the Parser if it is geared specifically for mimeType application/x-javascript hth Lewis On Fri, May 18, 2012 at 6:12 AM, forwardswing <[EMAIL PROTECTED]> wrote: > First of all,thank you very much for your reply. > > I have followed your suggestion and did the following modification: > > <mimeType name="application/javascript"> > <plugin id="parse-js" /> > </mimeType> > <mimeType name="text/javascript"> > <plugin id="parse-js" /> > </mimeType> > > <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" /> > > There is still an error: > dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type > text/javascript > here is the js file to be parse,could you please have a try in your > environment ? > > http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis
-
Re: Can't retrieve Tika parser for mime-type text/javascriptLewis John Mcgibbney 2012-05-18, 10:14
One final poin there which I forgot.
The point of the parse-js plugin is to extract outlinks from JS pages. The page you supplied contained only one outlink to a page which no longer exists, so depending on what your purposes are you may not find the parse-js plugin of much help Lewis On Fri, May 18, 2012 at 11:09 AM, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > I tried configuring my instance to fetch and parse your page with the > following result > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$ > ./nutch parsechecker > http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js > fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js > parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js > contentType: application/javascript > signature: 4bf7aa15c0e79cb2330bc80c417f0a55 > --------- > Url > --------------- > http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js > --------- > ParseData > --------- > Version: 5 > Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript' > Title: > Outlinks: 0 > Content Metadata: > Parse Metadata: > > So I tried a small experiment to see if I could hack a solution but > unfortunately as far as I got was to find that beginning on line 152 > of the JSParserFilter class we see > > public ParseResult getParse(Content c) { > String type = c.getContentType(); > if (type != null && !type.trim().equals("") && > !type.toLowerCase().startsWith("application/x-javascript")) > return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT, > "Content not JavaScript: '" + type + > "'").getEmptyParseResult(c.getUrl(), getConf()); > > It appears from the ParserChecker that ParseStatus is returning the > FAILED_INVALID_FORMAT message which we get. If you are going to focus > on getting the plugin to actually parse your files, I would begin > there, however I wouldn't expect miracles from the Parser if it is > geared specifically for mimeType application/x-javascript > > hth > > Lewis > > On Fri, May 18, 2012 at 6:12 AM, forwardswing <[EMAIL PROTECTED]> wrote: >> First of all,thank you very much for your reply. >> >> I have followed your suggestion and did the following modification: >> >> <mimeType name="application/javascript"> >> <plugin id="parse-js" /> >> </mimeType> >> <mimeType name="text/javascript"> >> <plugin id="parse-js" /> >> </mimeType> >> >> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" /> >> >> There is still an error: >> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type >> text/javascript >> here is the js file to be parse,could you please have a try in your >> environment ? >> >> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > > -- > Lewis -- Lewis
-
Can't retrieve Tika parser for mime-type text/csvOlivier LEVILLAIN 2012-06-18, 08:46
Hi,
i get the following error message when I try to parse a csv file: Can't retrieve Tika parser for mime-type text/csv... I use nutch 1.4 and solr 3.6... The parsechecker gives the same message: bin/nutch parsechecker http://dsiwikis/documents/forms/open_source_decls.csv fetching: http://dsiwikis/documents/forms/open_source_decls.csv parsing: http://dsiwikis/documents/forms/open_source_decls.csv contentType: text/csv --------- Url --------------- http://dsiwikis/documents/forms/open_source_decls.csv--------- ParseData --------- Version: 5 Status: failed(2,0): Can't retrieve Tika parser for mime-type text/csv Title: Outlinks: 0 Content Metadata: Parse Metadata: My parse-plugins.xml file contains: <mimeType name="text/csv"> <plugin id="parse-tika" /> </mimeType> and my nutch-default.xml contains: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|sc oring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> I searched the list and found something about this error but the thread changed from direction and provided no answer to the original problem... Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-csv-tp3990071.html Sent from the Nutch - User mailing list archive at Nabble.com. |