I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump segPath destPath -noparse -noparsedata -noparsetext -nogenerate
It is so close to working perfectly!