|
|
-
Getting the frequencies by corresponding order of documents were indexed
Kasun Perera 2012-05-11, 07:58
I have collection of documents (say 10 documents)and i'm indexing them this way, by storing the term vector
StringReader strRdElt = new StringReader(content); Document doc = new Document();
String docname=docNames[docNo];
doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
IndexWriter iW; try {
NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
new StandardAnalyzer(Version.LUCENE_35)));
iW.addDocument(doc); iW.close();
}
After Index all the documents, i'm getting the term-frequencies of each document this way IndexReader re = IndexReader.open(FSDirectory.open(new File(pathToIndex)), true) ; TermFreqVector termsFreq[]; for(int i=0;i<noOfDocs;i++){ termsFreq[i] = re.getTermFreqVector(i, "doccontent");
}
my problem is i'm not getting the termfreqncy vector correspondingly. Say for 2nd document that I have indexed i'm getting it's corresponding termfrequncies and terms at "termsFreq[9]"
What is the reason for that?, how can I get the corresponding termfrequncies by the order that I have indexed the documents? -- Regards
Kasun Perera
-
Re: Getting the frequencies by corresponding order of documents were indexed
Ian Lea 2012-05-11, 11:22
Can't spot anything obviously wrong in your code and what you are trying to do should work. Are you positive that what you think is the second doc is really being added second? You only show one doc being added. Are there already 7 docs in the index before you start? -- Ian. On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <[EMAIL PROTECTED]> wrote: > I have collection of documents (say 10 documents)and i'm indexing them this > way, by storing the term vector > > StringReader strRdElt = new StringReader(content); > > > Document doc = new Document(); > > String docname=docNames[docNo]; > > doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES)); > > IndexWriter iW; > try { > > NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ; > > iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, > > new StandardAnalyzer(Version.LUCENE_35))); > > iW.addDocument(doc); > iW.close(); > > } > > After Index all the documents, i'm getting the term-frequencies of each > document this way > > > IndexReader re = IndexReader.open(FSDirectory.open(new > File(pathToIndex)), true) ; > TermFreqVector termsFreq[]; > for(int i=0;i<noOfDocs;i++){ > termsFreq[i] = re.getTermFreqVector(i, "doccontent"); > > } > > my problem is i'm not getting the termfreqncy vector correspondingly. Say > for 2nd document that I have indexed i'm getting it's corresponding > termfrequncies and terms at "termsFreq[9]" > > What is the reason for that?, how can I get the corresponding > termfrequncies by the order that I have indexed the documents? > > > -- > Regards > > Kasun Perera
---------------------------------------------------------------------
-
Re: Getting the frequencies by corresponding order of documents were indexed
Kasun Perera 2012-05-11, 11:35
On Fri, May 11, 2012 at 4:52 PM, Ian Lea <[EMAIL PROTECTED]> wrote:
> Can't spot anything obviously wrong in your code and what you are > trying to do should work. Are you positive that what you think is the > second doc is really being added second? You only show one doc being > added. Are there already 7 docs in the index before you start? > > > Hi Ian
yes I'm sure 2nd doc is added second and I use debugger several times to confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors but their positions are changed. I gave doc #2 as example. #5th termfrequncy vector is correspond to doc and so on.
I figured out to overcome this but it may be not efficient. I stored another field at indexing time, base on the content inside new field i'm able to map the doc with its termfrequncy vector. Is there any other efficient way? This may be a bug in Lucene?
Thanks
> -- > Ian. > > > On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <[EMAIL PROTECTED]> > wrote: > > I have collection of documents (say 10 documents)and i'm indexing them > this > > way, by storing the term vector > > > > StringReader strRdElt = new StringReader(content); > > > > > > Document doc = new Document(); > > > > String docname=docNames[docNo]; > > > > doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES)); > > > > IndexWriter iW; > > try { > > > > NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ; > > > > iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, > > > > new StandardAnalyzer(Version.LUCENE_35))); > > > > iW.addDocument(doc); > > iW.close(); > > > > } > > > > After Index all the documents, i'm getting the term-frequencies of each > > document this way > > > > > > IndexReader re = IndexReader.open(FSDirectory.open(new > > File(pathToIndex)), true) ; > > TermFreqVector termsFreq[]; > > for(int i=0;i<noOfDocs;i++){ > > termsFreq[i] = re.getTermFreqVector(i, "doccontent"); > > > > } > > > > my problem is i'm not getting the termfreqncy vector correspondingly. Say > > for 2nd document that I have indexed i'm getting it's corresponding > > termfrequncies and terms at "termsFreq[9]" > > > > What is the reason for that?, how can I get the corresponding > > termfrequncies by the order that I have indexed the documents? > > > > > > -- > > Regards > > > > Kasun Perera > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards
Kasun Perera
-
Re: Getting the frequencies by corresponding order of documents were indexed
Ian Lea 2012-05-11, 12:50
What version of lucene are you using? If not the latest, try that. If you really think there is a lucene bug post a small self-contained test case that demonstrates the problem. -- Ian. On Fri, May 11, 2012 at 12:35 PM, Kasun Perera <[EMAIL PROTECTED]> wrote: > On Fri, May 11, 2012 at 4:52 PM, Ian Lea <[EMAIL PROTECTED]> wrote: > >> Can't spot anything obviously wrong in your code and what you are >> trying to do should work. Are you positive that what you think is the >> second doc is really being added second? You only show one doc being >> added. Are there already 7 docs in the index before you start? >> >> >> > Hi Ian > > yes I'm sure 2nd doc is added second and I use debugger several times to > confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors > but their positions are changed. I gave doc #2 as example. #5th > termfrequncy vector is correspond to doc and so on. > > I figured out to overcome this but it may be not efficient. I stored > another field at indexing time, base on the content inside new field i'm > able to map the doc with its termfrequncy vector. Is there any other > efficient way? This may be a bug in Lucene? > > Thanks > >> -- >> Ian. >> >> >> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <[EMAIL PROTECTED]> >> wrote: >> > I have collection of documents (say 10 documents)and i'm indexing them >> this >> > way, by storing the term vector >> > >> > StringReader strRdElt = new StringReader(content); >> > >> > >> > Document doc = new Document(); >> > >> > String docname=docNames[docNo]; >> > >> > doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES)); >> > >> > IndexWriter iW; >> > try { >> > >> > NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ; >> > >> > iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, >> > >> > new StandardAnalyzer(Version.LUCENE_35))); >> > >> > iW.addDocument(doc); >> > iW.close(); >> > >> > } >> > >> > After Index all the documents, i'm getting the term-frequencies of each >> > document this way >> > >> > >> > IndexReader re = IndexReader.open(FSDirectory.open(new >> > File(pathToIndex)), true) ; >> > TermFreqVector termsFreq[]; >> > for(int i=0;i<noOfDocs;i++){ >> > termsFreq[i] = re.getTermFreqVector(i, "doccontent"); >> > >> > } >> > >> > my problem is i'm not getting the termfreqncy vector correspondingly. Say >> > for 2nd document that I have indexed i'm getting it's corresponding >> > termfrequncies and terms at "termsFreq[9]" >> > >> > What is the reason for that?, how can I get the corresponding >> > termfrequncies by the order that I have indexed the documents? >> > >> > >> > -- >> > Regards >> > >> > Kasun Perera >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards > > Kasun Perera
---------------------------------------------------------------------
-
Re: Getting the frequencies by corresponding order of documents were indexed
Erick Erickson 2012-05-14, 11:30
In general you can't rely on anything like this. I admit the merge stuff isn't my area of expertise, but when segments are merged, there's no guarantee that they're merged in order. In general the internal Lucene doc ID should be treated as predictable only for closed segments.
Your solution of using your own unique ID is much better.
Best Erick
On Fri, May 11, 2012 at 8:50 AM, Ian Lea <[EMAIL PROTECTED]> wrote: > What version of lucene are you using? If not the latest, try that. > If you really think there is a lucene bug post a small self-contained > test case that demonstrates the problem. > > > -- > Ian. > > > On Fri, May 11, 2012 at 12:35 PM, Kasun Perera <[EMAIL PROTECTED]> wrote: >> On Fri, May 11, 2012 at 4:52 PM, Ian Lea <[EMAIL PROTECTED]> wrote: >> >>> Can't spot anything obviously wrong in your code and what you are >>> trying to do should work. Are you positive that what you think is the >>> second doc is really being added second? You only show one doc being >>> added. Are there already 7 docs in the index before you start? >>> >>> >>> >> Hi Ian >> >> yes I'm sure 2nd doc is added second and I use debugger several times to >> confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors >> but their positions are changed. I gave doc #2 as example. #5th >> termfrequncy vector is correspond to doc and so on. >> >> I figured out to overcome this but it may be not efficient. I stored >> another field at indexing time, base on the content inside new field i'm >> able to map the doc with its termfrequncy vector. Is there any other >> efficient way? This may be a bug in Lucene? >> >> Thanks >> >>> -- >>> Ian. >>> >>> >>> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <[EMAIL PROTECTED]> >>> wrote: >>> > I have collection of documents (say 10 documents)and i'm indexing them >>> this >>> > way, by storing the term vector >>> > >>> > StringReader strRdElt = new StringReader(content); >>> > >>> > >>> > Document doc = new Document(); >>> > >>> > String docname=docNames[docNo]; >>> > >>> > doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES)); >>> > >>> > IndexWriter iW; >>> > try { >>> > >>> > NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ; >>> > >>> > iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, >>> > >>> > new StandardAnalyzer(Version.LUCENE_35))); >>> > >>> > iW.addDocument(doc); >>> > iW.close(); >>> > >>> > } >>> > >>> > After Index all the documents, i'm getting the term-frequencies of each >>> > document this way >>> > >>> > >>> > IndexReader re = IndexReader.open(FSDirectory.open(new >>> > File(pathToIndex)), true) ; >>> > TermFreqVector termsFreq[]; >>> > for(int i=0;i<noOfDocs;i++){ >>> > termsFreq[i] = re.getTermFreqVector(i, "doccontent"); >>> > >>> > } >>> > >>> > my problem is i'm not getting the termfreqncy vector correspondingly. Say >>> > for 2nd document that I have indexed i'm getting it's corresponding >>> > termfrequncies and terms at "termsFreq[9]" >>> > >>> > What is the reason for that?, how can I get the corresponding >>> > termfrequncies by the order that I have indexed the documents? >>> > >>> > >>> > -- >>> > Regards >>> > >>> > Kasun Perera >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >> >> -- >> Regards >> >> Kasun Perera > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
---------------------------------------------------------------------
|
|