|
|
-
Re: [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utilityLance Norskog 2012-03-01, 02:52
integrations/org.apache.mahout.text.SequenceFilesFromMailArchives
Used in examples/bin/asf-email-examples.sh On Wed, Feb 29, 2012 at 6:27 AM, Frank Scholten <[EMAIL PROTECTED]> wrote: > Ah of course! Good one. > > Do you know if there is an existing tool to index those emails? > > On Sat, Feb 25, 2012 at 4:10 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> Apache mail files? You need an AWS account to pull them. >> >> http://www.lucidimagination.com/search/document/1ab0374bd10d8d89 >> >> On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA) >> <[EMAIL PROTECTED]> wrote: >>> >>> [ https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734 ] >>> >>> Frank Scholten commented on MAHOUT-944: >>> --------------------------------------- >>> >>> Renamed config to LuceneStorageConfig and simplified serialization. Added AbstractLuceneStorageTest with helper methods for indexing documents. >>> >>> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6 >>> >>> Does anyone know of a large index I can use for testing? Wikipedia is not that big, the sequential lucene2seq version takes only 3,5 minutes on my machine to convert it into a sequence file. >>> >>>> LuceneIndexToSequenceFiles (lucene2seq) utility >>>> ----------------------------------------------- >>>> >>>> Key: MAHOUT-944 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-944 >>>> Project: Mahout >>>> Issue Type: New Feature >>>> Components: Integration >>>> Affects Versions: 0.5 >>>> Reporter: Frank Scholten >>>> Assignee: Grant Ingersoll >>>> Priority: Minor >>>> Fix For: 0.7 >>>> >>>> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch >>>> >>>> >>>> Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index. >>>> The output from this tool can be then fed into seq2sparse and from there you can do text clustering. >>>> Comes with Java bean configuration. >>>> Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill? >>>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!) >>>> or the attached patch. >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >> >> >> >> -- >> Lance Norskog >> [EMAIL PROTECTED] >> -- Lance Norskog [EMAIL PROTECTED] |