I need to be able to query a (nutch 1.x) crawldb for read-only search/sort/summarize purposes, based on combinations of status, fetch_time, score, and things like that. What is a good tool or process for doing such things?
Up until now, I've been doing readdb-dump and then processing the output with python code that I wrote. But this is slow and clunky, and my code probably has bugs. I wonder, would Hive be a good tool for this?