Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -noparsetext -nogenerate
It is so close to working perfectly!
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB