Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - Problem extracting date from Outlook 2007 .msg file


Copy link to this message
-
Problem extracting date from Outlook 2007 .msg file
Joe Wicentowski 2012-06-25, 20:14
Hi all,

Hello!  This is my message to the list.  I'm building an application
that uses Tika to extract text from Outlook 2007 .msg files, among
other things.  While experimenting with some sample .msg files, I
noticed that Tika is failing not returning the date of most messages.
For example, Outlook indicates that the following message was sent on
"Fri 6/22/2012 8:11 AM", but no date appears in the HTML head or in
the early portion of the body of the Tika output [1].  I retrieved
this using Tika 1.1 on Windows XP using the following command:

  java -jar tika-app-1.1.jar "C:\Documents and
Settings\wicentowskijc\Desktop\portal\outlook\RE  Inquiry.msg" >
inquiry.html

If anyone has suggestions for ensuring that the date can be preserved
in Tika's output, I'd be grateful.

Thanks,
Joe
[1] Tika output showing no date

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="Message-Bcc" content="" />
        <meta name="subject" content="Inquiry" />
        <meta name="Content-Length" content="40960" />
        <meta name="Message-Recipient-Address" content="[EMAIL PROTECTED]" />
        <meta name="Message-From" content="History Mailbox" />
        <meta name="Author" content="History Mailbox" />
        <meta name="Message-To" content="'Snip'" />
        <meta name="Message-Cc" content="" />
        <meta name="Content-Type" content="application/vnd.ms-outlook" />
        <meta name="resourceName" content="RE  Inquiry.msg" />
    </head>
    <body>
        <h1>RE: Inquiry</h1>
        <dl>
            <dt>From</dt>
            <dd>History Mailbox</dd>
            <dt>To</dt>
            <dd>'Snip'</dd>
            <dt>Recipients</dt>
            <dd>[EMAIL PROTECTED]</dd>
        </dl>
        <p>Dear Snip</p>
...