&tldr; If I wanted to learn about the nutch pipeline at a high level, then write a custom parser / indexer of my own where would a starting point be?

I have used the latest 1.x Nutch to crawl a few specific websites and been disappointed with the results, even after experimenting with new html-microdata capabilities with updates to Any23 project incorporated by Nutch, I am still not (yet) excited. Bottom line is website data is not well structured and not super friendly to algorithmic consumption (but you already knew that). To that end, I am interested to developer custom parsers per internet domain in an effort to capture specific domain data. It currently looks like the plugin.includes does not allow a per domain-based approach for parser / indexer. I wonder if someone could guide me toward a high level view of the Nutch data pipeline, then guide me towards where to get started for creating custom parsers that might support a per-domain approach?

Thanks,
David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB