+1 from me makes sense

Giuseppe is interested in this too FWIW

On 7/10/17, 2:59 PM, "Allison, Timothy B." <[EMAIL PROTECTED]> wrote:

    Oh, ok...
   
    As for "does for directories"...y, I've been thinking about a modification of -z for tar/zip files, pst and, I guess, now WARC.  Files that can be so enormous that you'd want to unpack them before indexing.  No one would really want to index the Enron pst (if it actually existed) as a single file, rather, they'd want to be able to unpack it and index the individual files.  And, while you can attach a bunch of files inside a PDF or MSOffice file, in practice, there seems to be a fundamental difference between how users might want to deal with embedded files in, say, a PDF than in a PST.  
   
    Depending on interest, might make sense to add disk images to the list of zip/pst/etc..., e.g. AFF?
   
   
   
    -----Original Message-----
    From: Nick Burch [mailto:[EMAIL PROTECTED]]
    Sent: Monday, July 10, 2017 2:45 PM
    To: [EMAIL PROTECTED]
    Subject: Re: Adding a WARC parser to Tika
   
    On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
    > Sorry, I can't tell if this is tongue-in-cheek...
   
    No, I do think we should add a WARC parser to Tika Parsers.
   
    Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!
   
    Nick
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB