As for "does for directories"...y, I've been thinking about a modification of -z for tar/zip files, pst and, I guess, now WARC. Files that can be so enormous that you'd want to unpack them before indexing. No one would really want to index the Enron pst (if it actually existed) as a single file, rather, they'd want to be able to unpack it and index the individual files. And, while you can attach a bunch of files inside a PDF or MSOffice file, in practice, there seems to be a fundamental difference between how users might want to deal with embedded files in, say, a PDF than in a PST.
Depending on interest, might make sense to add disk images to the list of zip/pst/etc..., e.g. AFF?
From: Nick Burch [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 10, 2017 2:45 PM
To: [EMAIL PROTECTED]
Subject: Re: Adding a WARC parser to Tika
On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...
No, I do think we should add a WARC parser to Tika Parsers.
Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!