Hello Nutchians,
I need to be able to query a (nutch 1.x) crawldb for read-only search/sort/summarize purposes, based on combinations of status, fetch_time, score, and things like that. What is a good tool or process for doing such things?
Up until now, I've been doing readdb-dump and then processing the output with python code that I wrote. But this is slow and clunky, and my code probably has bugs. I wonder, would Hive be a good tool for this?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB