Hi,

I have asked this on Stackoverflow
<https://stackoverflow.com/questions/48102004/how-to-implement-an-inputstream-that-dynamically-guesses-the-extension-of-a-file>
and
was pointed here, with the hope that more people would be able to help.

We have a custom implementation of an InputStream that can currently update
multiple MessageDigest-s and while reading the data. This allows for a
single reading and processing of the data and avoids having to re-read
files in order to calculate their checksums. This is quite efficient and
saves time (and is implemented in here
<https://github.com/strongbox/strongbox/blob/9dcb13255512cd396e63f712bb5ce82bb632726c/strongbox-storage/strongbox-storage-core/src/main/java/org/carlspring/strongbox/io/ArtifactInputStream.java>
).

As a follow-up step, we'd like to use Apache Tika to guess the file
extension from the stream, which is sent over HTTP. I know some of you will
suggest simply setting the Content-Type header and requiring that it's set,
but, unfortunately, for various reasons, we cannot rely on this, or enforce
it. Hence, I'm looking for a way to guess the extension based on the
InputStream, while it's being sent.

We also need to be able to guess complex extension types (such as tar.gz,
tar.bz2 and other similar ones that aren't easy to guess by just doing a
substring from the last index of the dot until the end of the string).

What is the most-efficient way to do this? We cannot afford to read the
whole files in memory, as the application will have to be able to handle a
large number of concurrent requests. Could somebody please provide an
example, of how this could be done?

We have an open issue <https://github.com/strongbox/strongbox/issues/370> and
a pull request here
<https://github.com/strongbox/strongbox/pull/468/files#diff-8024b836036b6f5fb567a3ce48c2a4d6R221>,
if anyone would like to have a closer look and help out.

Looking forward to your suggestions and replies!
Kind regards,

Martin Todorov
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB