|
|
-
Calculating IDF value more efficiently
Kasun Perera 2012-04-28, 02:38
This is my program to calculate TF-IDF value for a document in a collection of documents. This is working fine, but takes lot of time when calculating the "IDF" values (finding the no of documents which contains particular term).
Is there a more efficient way of finding the no of documents which contains a particular term?
freq = termsFreq.getTermFrequencies();
terms = termsFreq.getTerms();
int noOfTerms = terms.length;
score = new float[noOfTerms]; DefaultSimilarity simi = new DefaultSimilarity();
for (i = 0; i < noOfTerms; i++) {
int noofDocsContainTerm = noOfDocsContainTerm(terms[i]);
float tf = simi.tf(freq[i]);
float idf = simi.idf(noofDocsContainTerm, noOfDocs);
score[i] = tf * idf ;
}
////
public int noOfDocsContainTerm(String querystr) throws CorruptIndexException, IOException, ParseException{
QueryParser qp=new QueryParser(Version.LUCENE_35, "docuemnt", new StandardAnalyzer(Version.LUCENE_35));
Query q=qp.parse(querystr);
int hitsPerPage = docNames.length; //minumum number or search results IndexSearcher searcher = new IndexSearcher(ramMemDir, true); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
return hits.length; } -- Regards
Kasun Perera
-
Re: Calculating IDF value more efficiently
Robert Muir 2012-05-06, 23:32
Look at IndexReader.docFreq
On Fri, Apr 27, 2012 at 10:38 PM, Kasun Perera <[EMAIL PROTECTED]> wrote: > This is my program to calculate TF-IDF value for a document in a collection > of documents. This is working fine, but takes lot of time when calculating > the "IDF" values (finding the no of documents which contains particular > term). > > Is there a more efficient way of finding the no of documents which contains > a particular term? > > freq = termsFreq.getTermFrequencies(); > > terms = termsFreq.getTerms(); > > int noOfTerms = terms.length; > > score = new float[noOfTerms]; > DefaultSimilarity simi = new DefaultSimilarity(); > > for (i = 0; i < noOfTerms; i++) { > > int noofDocsContainTerm = noOfDocsContainTerm(terms[i]); > > float tf = simi.tf(freq[i]); > > float idf = simi.idf(noofDocsContainTerm, noOfDocs); > > score[i] = tf * idf ; > > } > > //// > > public int noOfDocsContainTerm(String querystr) throws > CorruptIndexException, IOException, ParseException{ > > QueryParser qp=new QueryParser(Version.LUCENE_35, "docuemnt", new > StandardAnalyzer(Version.LUCENE_35)); > > Query q=qp.parse(querystr); > > int hitsPerPage = docNames.length; //minumum number or search results > IndexSearcher searcher = new IndexSearcher(ramMemDir, true); > TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); > > searcher.search(q, collector); > > ScoreDoc[] hits = collector.topDocs().scoreDocs; > > return hits.length; > } > > > -- > Regards > > Kasun Perera
-- lucidimagination.com
---------------------------------------------------------------------
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext