Pull request #205 was recently merged into master branch for Nutch 1.x in fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
I am new to nutch and solr and have just started crawling and indexing a few select websites. Using the built in html parsing/indexing, I am getting searchable fields like url, content, host, sometimes a title, and a few other indexing related fields like digest, boost, segment, and tstamp. That said, I realized very quickly that I need better results. While exploring the source of the website, I noticed references to schema.org and get excited by what I see. That’s how I stumbled upon NUTCH-1129.
I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
Q: Now what? How do I gain Any23 microdata parsing / indexing capabilities introduced by NUTCH-1129?
Q: Do I replace parse-(html | tika)|index-(basic | anchor) in plugin.includes with something like parse-(html | tika | any23)|index-(basic | anchor | any23)
Q: How do I expose the discovered microdata structure / items to end-user such as Solr? For example, what are the microdata items and do I need to map them to Solr in solrindex-mapping.xml?
I’d also be interested to learn how to point at a specific URL and see how nutch sees the microdata (best case), then learn how to leverage this into nutch and finally into solr.
Thanks for any guidance.