Pull request #205 was recently merged into master branch for Nutch 1.x in fulfillment of NUTCH-1129 "microdata for Nutch 1.x"

I am new to nutch and solr and have just started crawling and indexing a few select websites. Using the built in html parsing/indexing, I am getting searchable fields like url, content, host, sometimes a title, and a few other indexing related fields like digest, boost, segment, and tstamp. That said, I realized very quickly that I need better results. While exploring the source of the website, I noticed references to schema.org and get excited by what I see. That’s how I stumbled upon NUTCH-1129.

I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.

Q: Now what?  How do I gain Any23 microdata parsing / indexing capabilities introduced by NUTCH-1129?
Q: Do I replace parse-(html | tika)|index-(basic | anchor) in plugin.includes with something like parse-(html | tika | any23)|index-(basic | anchor | any23)
Q: How do I expose the discovered microdata structure / items to end-user such as Solr? For example, what are the microdata items and do I need to map them to Solr in solrindex-mapping.xml?

I’d also be interested to learn how to point at a specific URL and see how nutch sees the microdata (best case), then learn how to leverage this into nutch and finally into solr.

Thanks for any guidance.
-David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB