In case it helps, I'll try to summarise what we've done in this area.

Currently our webarchive-discovery indexing tool parses the WARC and then passes the payload to Tika:

https://github.com/ukwa/webarchive-discovery
https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/solr/TikaExtractor.java

This works fine, but along the way we've also experimented with adding WARC parsing to Tika directly. The code is an extremely messy proof-of-concept but I've pushed it here so you can see how it works:

https://github.com/ukwa/tika/tree/experimental-warc-parsing

The parser itself is fairly straightforward:

https://github.com/ukwa/tika/blob/5d89169151257a2696ceac2a4897527ea1b227a7/tika-parsers/src/main/java/org/apache/tika/parser/warc/WARCParser.java#L94

but it did require a few changes elsewhere...

1. Needed to teach Tika to spot ARC/WARC:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-a7a8080db8d7c69d9a66b875b4c5b9e7

2. Added webarchive-commons as a dependency:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-2426935affac837a5f8f7a84a15939f7

3. Enable concatenated block gunzip in order to parse WARC.GZ:
https://github.com/apache/tika/compare/master...ukwa:experimental-warc-parsing#diff-5ae41a78b18e2ca8481960cd5e02b860
(given this was explicitly disabled before, this may be contentious?)

There's another couple of bigger issues that would need resolving too.

Firstly, the WARC format is not a file archive, but primarily a HTTP request/response archive. There are 8 different record types (see https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-record-types for details) that may or may not be of interest. The HTTP request and the response get separate records, and of course the response might be 303 or 404, not just 200. One strategy that is fairly widely used is to simply ignore anything that is not a 200 response, but that does discard quite a lot of information.

Secondly, I'm not sure how many layers of embedded are appropriate. According to the spec, I would argue that these are the layers:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (an individual WARC record)
- archive.warc.gz/record.warc/http.response (the message/http in its entirety)
- archive.warc.gz/record.warc/http.response/entity.body (the actual resource)

This is probably overkill (and gets worse if it's a gzipped HTTP response!). We could just use:

- archive.warc.gz (a series of block-concatenated gzip records)
- archive.warc.gz/record.warc (the parsed entity.body, with all relevant info from WARC and HTTP headers attached as metadata)

Collapsing the layers down does make is less clear where some of the metadata is coming from, but it’s probably worth it.

One final note - I've not put the test WARC files in that repo yet as I need to create some new ones from an Apache 2 source.

I hope this is useful.

Best,
Andy
=-=-=-=-=-=-=-=
Dr Andrew N. Jackson
Web Archiving Technical Lead
01937 546602
@UKWebArchive
@anjacks0n
Blog: http://britishlibrary.typepad.co.uk/webarchive/

-----Original Message-----
From: Nick Burch [mailto:[EMAIL PROTECTED]]
Sent: 10 July 2017 19:45
To: [EMAIL PROTECTED]
Subject: Re: Adding a WARC parser to Tika

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to make it easier to run over crawl collections without having to unpack them first!

Nick
******************************************************************************************************************
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB