|
Shelly_Singh
2010-08-10, 06:54
Anshum
2010-08-10, 07:25
Danil ŢORIN
2010-08-10, 07:36
Shelly_Singh
2010-08-10, 08:01
Shelly_Singh
2010-08-10, 08:05
anshum.gupta@...)
2010-08-10, 08:24
Michael McCandless
2010-08-10, 08:54
findbestopensource
2010-08-10, 11:46
Shelly_Singh
2010-08-10, 12:19
Anshum
2010-08-10, 12:28
Dan OConnor
2010-08-10, 12:32
Shelly_Singh
2010-08-10, 12:35
Shelly_Singh
2010-08-10, 12:38
Danil ŢORIN
2010-08-10, 12:41
prashant ullegaddi
2010-08-10, 12:43
Shelly_Singh
2010-08-10, 12:55
Danil ŢORIN
2010-08-10, 13:22
Shelly_Singh
2010-08-10, 13:32
Shelly_Singh
2010-08-10, 13:41
anshum.gupta@...)
2010-08-10, 13:49
Pablo Mendes
2010-08-10, 13:51
Shelly_Singh
2010-08-11, 04:58
Anshum
2010-08-11, 05:07
Shelly_Singh
2010-08-11, 05:24
Shelly_Singh
2010-08-16, 07:12
Danil ŢORIN
2010-08-16, 09:32
|
-
Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 06:54
Hi,
I am developing an application which uses Lucene for indexing and searching 1 bln documents. (the document size is very small though. Each document has a single field of 5-10 words; so I believe that my data size is within the tested limits). I am using the following configuration: 1. 1.5 gig RAM to the jvm 2. 100GB disk space. 3. Index creation tuning factors: a. mergeFactor = 10 b. maxFieldLength = 10 c. maxMergeDocs = 5000000 (if I try with a larger value, I get an out-of-memory) With these settings, I am able to create an index of 100 million docs (10 pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite satisfactory for me, but nevertheless, I want to know what else can be done to tune it further. Please help. Also, with these settings, can I expect the time and size to grow linearly for 1bln (10 pow 9) documents? Thanks and Regards, Shelly Singh Center For KNowledge Driven Information Systems, Infosys Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Phone: (M) 91 992 369 7200, (VoIP)2022978622
-
Re: Scaling Lucene to 1bln docsAnshum 2010-08-10, 07:25
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > Hi, > > I am developing an application which uses Lucene for indexing and searching > 1 bln documents. (the document size is very small though. Each document has > a single field of 5-10 words; so I believe that my data size is within the > tested limits). > > I am using the following configuration: > 1. 1.5 gig RAM to the jvm > 2. 100GB disk space. > 3. Index creation tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > out-of-memory) > > With these settings, I am able to create an index of 100 million docs (10 > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > satisfactory for me, but nevertheless, I want to know what else can be done > to tune it further. Please help. > Also, with these settings, can I expect the time and size to grow linearly > for 1bln (10 pow 9) documents? > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > > >
-
Re: Scaling Lucene to 1bln docsDanil ŢORIN 2010-08-10, 07:36
The problem actually won't be the indexing part.
Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise you to split index into shards, each in it's own jvm. This way you'll improve both indexing and search performance. On Tue, Aug 10, 2010 at 10:25, Anshum <[EMAIL PROTECTED]> wrote: > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am developing an application which uses Lucene for indexing and searching >> 1 bln documents. (the document size is very small though. Each document has >> a single field of 5-10 words; so I believe that my data size is within the >> tested limits). >> >> I am using the following configuration: >> 1. 1.5 gig RAM to the jvm >> 2. 100GB disk space. >> 3. Index creation tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 (if I try with a larger value, I get an >> out-of-memory) >> >> With these settings, I am able to create an index of 100 million docs (10 >> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >> satisfactory for me, but nevertheless, I want to know what else can be done >> to tune it further. Please help. >> Also, with these settings, can I expect the time and size to grow linearly >> for 1bln (10 pow 9) documents? >> >> Thanks and Regards, >> >> Shelly Singh >> Center For KNowledge Driven Information Systems, Infosys >> Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> Phone: (M) 91 992 369 7200, (VoIP)2022978622 >> >> >> >> > ---------------------------------------------------------------------
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 08:01
Hi Anshum,
I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. Help again. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 12:55 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > Hi, > > I am developing an application which uses Lucene for indexing and searching > 1 bln documents. (the document size is very small though. Each document has > a single field of 5-10 words; so I believe that my data size is within the > tested limits). > > I am using the following configuration: > 1. 1.5 gig RAM to the jvm > 2. 100GB disk space. > 3. Index creation tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > out-of-memory) > > With these settings, I am able to create an index of 100 million docs (10 > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > satisfactory for me, but nevertheless, I want to know what else can be done > to tune it further. Please help. > Also, with these settings, can I expect the time and size to grow linearly > for 1bln (10 pow 9) documents? > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > > > ---------------------------------------------------------------------
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 08:05
Hi Danil,
I get ur point. Infact, the latest readings I have for 1bln docs is also asserting the same thing. Index creation time is 2 hours.. which is fine by me... but search time is 15 secs.. which is too high for any application. I am planning to do a sharding of indices and then use a multisearcher for searching. Will that help? -----Original Message----- From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 1:06 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs The problem actually won't be the indexing part. Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise you to split index into shards, each in it's own jvm. This way you'll improve both indexing and search performance. On Tue, Aug 10, 2010 at 10:25, Anshum <[EMAIL PROTECTED]> wrote: > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am developing an application which uses Lucene for indexing and searching >> 1 bln documents. (the document size is very small though. Each document has >> a single field of 5-10 words; so I believe that my data size is within the >> tested limits). >> >> I am using the following configuration: >> 1. 1.5 gig RAM to the jvm >> 2. 100GB disk space. >> 3. Index creation tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 (if I try with a larger value, I get an >> out-of-memory) >> >> With these settings, I am able to create an index of 100 million docs (10 >> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >> satisfactory for me, but nevertheless, I want to know what else can be done >> to tune it further. Please help. >> Also, with these settings, can I expect the time and size to grow linearly >> for 1bln (10 pow 9) documents? >> >> Thanks and Regards, >> >> Shelly Singh >> Center For KNowledge Driven Information Systems, Infosys >> Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> Phone: (M) 91 992 369 7200, (VoIP)2022978622 >> >> >> >> > ---------------------------------------------------------------------
-
Re: Scaling Lucene to 1bln docsanshum.gupta@...) 2010-08-10, 08:24
Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally?
-- Anshum http://blog.anshumgupta.net Sent from BlackBerry® -----Original Message----- From: Shelly_Singh <[EMAIL PROTECTED]> Date: Tue, 10 Aug 2010 13:31:38 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. Help again. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 12:55 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > Hi, > > I am developing an application which uses Lucene for indexing and searching > 1 bln documents. (the document size is very small though. Each document has > a single field of 5-10 words; so I believe that my data size is within the > tested limits). > > I am using the following configuration: > 1. 1.5 gig RAM to the jvm > 2. 100GB disk space. > 3. Index creation tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > out-of-memory) > > With these settings, I am able to create an index of 100 million docs (10 > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > satisfactory for me, but nevertheless, I want to know what else can be done > to tune it further. Please help. > Also, with these settings, can I expect the time and size to grow linearly > for 1bln (10 pow 9) documents? > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > > > ---------------------------------------------------------------------
-
Re: Scaling Lucene to 1bln docsMichael McCandless 2010-08-10, 08:54
Correction: mergeFactor determines how many segments are merged at once.
It's IndexWriter's ramBufferSizeMB and/or maxBufferedDocs that determine how many docs are buffered in RAM before a new segment is flushed. A higher mergeFactor will require more RAM during merging, will cause longer running but fewer merges, requires more open files, and allows your index to have more segments given a certain number of indexed docs. Mike On Tue, Aug 10, 2010 at 4:24 AM, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry® > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am developing an application which uses Lucene for indexing and searching >> 1 bln documents. (the document size is very small though. Each document has >> a single field of 5-10 words; so I believe that my data size is within the >> tested limits). >> >> I am using the following configuration: >> 1. 1.5 gig RAM to the jvm >> 2. 100GB disk space. >> 3. Index creation tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 (if I try with a larger value, I get an >> out-of-memory) >> >> With these settings, I am able to create an index of 100 million docs (10 >> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >> satisfactory for me, but nevertheless, I want to know what else can be done >> to tune it further. Please help. >> Also, with these settings, can I expect the time and size to grow linearly >> for 1bln (10 pow 9) documents? >> >> Thanks and Regards, >> >> Shelly Singh >> Center For KNowledge Driven Information Systems, Infosys >> Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> Phone: (M) 91 992 369 7200, (VoIP)2022978622
-
Re: Scaling Lucene to 1bln docsfindbestopensource 2010-08-10, 11:46
Hi Shelly,
You need to reduce your maxMergeDocs. set ramBufferSizeMB to 100, which will help you to use less RAM in indexing. >>>search time is 15 secs.. How you are calculating this time. Just taking time difference before and after the search method or this involves time to parse the document object and display in the UI? Is this your first or second search? Usefully first couple of search takes more time as the index will be warmed. Take benchmark for first 10 -20 search and see if the time has come down. Is your index optimized? Optimized index may take less time to search. Regards Aditya www.findbestopensource.com On Tue, Aug 10, 2010 at 1:31 PM, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am developing an application which uses Lucene for indexing and searching >> 1 bln documents. (the document size is very small though. Each document has >> a single field of 5-10 words; so I believe that my data size is within the >> tested limits). >> >> I am using the following configuration: >> 1. 1.5 gig RAM to the jvm >> 2. 100GB disk space. >> 3. Index creation tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 (if I try with a larger value, I get an >> out-of-memory) >> >> With these settings, I am able to create an index of 100 million docs (10 >> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >> satisfactory for me, but nevertheless, I want to know what else can be done >> to tune it further. Please help. >> Also, with these settings, can I expect the time and size to grow linearly >> for 1bln (10 pow 9) documents? >> >> Thanks and Regards, >> >> Shelly Singh >> Center For KNowledge Driven Information Systems, Infosys >> Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> Phone: (M) 91 992 369 7200, (VoIP)2022978622 >> >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 12:19
No sort. I will need relevance based on TF. If I shard, I will have to search in al indices.
-----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 1:54 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? -- Anshum http://blog.anshumgupta.net Sent from BlackBerry(r) -----Original Message----- From: Shelly_Singh <[EMAIL PROTECTED]> Date: Tue, 10 Aug 2010 13:31:38 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. Help again. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 12:55 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > Hi, > > I am developing an application which uses Lucene for indexing and searching > 1 bln documents. (the document size is very small though. Each document has > a single field of 5-10 words; so I believe that my data size is within the > tested limits). > > I am using the following configuration: > 1. 1.5 gig RAM to the jvm > 2. 100GB disk space. > 3. Index creation tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > out-of-memory) > > With these settings, I am able to create an index of 100 million docs (10 > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > satisfactory for me, but nevertheless, I want to know what else can be done > to tune it further. Please help. > Also, with these settings, can I expect the time and size to grow linearly > for 1bln (10 pow 9) documents? > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > > > --------------------------------------------------------------------- ---------------------------------------------------------------------
-
Re: Scaling Lucene to 1bln docsAnshum 2010-08-10, 12:28
Searching on all indices shouldn't be that bad an idea instead of searching
a single huge index, specially considering you have a constraint on the usable memory. You could use a ParallelMultiSearcher which spawns threads to query across indexes and merges the results. What I asked was, is there a way for you to leave out a few indexes each time you want to query? something like, while designing an engine for a timeline based search, you would shard the index on timeline, and as a query would be associated with a particular period you would only query the indexes containing data for that period. This would make the data manageable and searchable within reasonable time. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 5:49 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > No sort. I will need relevance based on TF. If I shard, I will have to > search in al indices. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 1:54 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to > sort on relevance? Can you shard and restrict your search to a limited set > of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor > couple of days ago, but got an OOM, so I discarded it. Later I figured that > OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I > should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 > different threads each indexing 100mln docs coupled with a Multisearcher to > which I will feed these 10 indices. Do you think this will improve > performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs > and search time is 15 secs.. I can live with indexing time but the search > time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs > index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among > all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED] > >wrote: > > > Hi, > > > > I am developing an application which uses Lucene for indexing and > searching > > 1 bln documents. (the document size is very small though. Each document > has > > a single field of 5-10 words; so I believe that my data size is within > the > > tested limits). > > > > I am using the following configuration: > > 1. 1.5 gig RAM to the jvm > > 2. 100GB disk space. > > 3. Index creation tuning factors: > > a. mergeFactor = 10 > > b. maxFieldLength = 10 > > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an
-
RE: Scaling Lucene to 1bln docsDan OConnor 2010-08-10, 12:32
Shelly:
You wouldn't necessarily have to use a multisearcher. A suggested alternative is: - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. - have a pool of IndexSearchers for each index - when a search comes in, allocate a Searcher from each index to the search. - perform the search in parallel across all indices. - merge the results in your own code using an efficient merging algorithm. Regards, Dan -----Original Message----- From: Shelly_Singh [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 8:20 AM To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 1:54 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? -- Anshum http://blog.anshumgupta.net Sent from BlackBerry(r) -----Original Message----- From: Shelly_Singh <[EMAIL PROTECTED]> Date: Tue, 10 Aug 2010 13:31:38 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. Help again. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 12:55 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > Hi, > > I am developing an application which uses Lucene for indexing and searching > 1 bln documents. (the document size is very small though. Each document has > a single field of 5-10 words; so I believe that my data size is within the > tested limits). > > I am using the following configuration: > 1. 1.5 gig RAM to the jvm > 2. 100GB disk space. > 3. Index creation tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > out-of-memory) > > With these settings, I am able to create an index of 100 million docs (10 > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > satisfactory for me, but nevertheless, I want to know what else can be done
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 12:35
I do not see a way to optimally decide how to shard the data. Its very difficult for my purpose; and so the safe bet is to assume that all indices will need to be searched.
Okay, I can try ParallelMultiSearcher in addition to MultiSearcher. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 5:59 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Searching on all in dices shouldn't be that bad an idea instead of searching a single huge index, specially considering you have a constraint on the usable memory. You could use a ParallelMultiSearcher which spawns threads to query across indexes and merges the results. What I asked was, is there a way for you to leave out a few indexes each time you want to query? something like, while designing an engine for a timeline based search, you would shard the index on timeline, and as a query would be associated with a particular period you would only query the indexes containing data for that period. This would make the data manageable and searchable within reasonable time. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 5:49 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > No sort. I will need relevance based on TF. If I shard, I will have to > search in al indices. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 1:54 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to > sort on relevance? Can you shard and restrict your search to a limited set > of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor > couple of days ago, but got an OOM, so I discarded it. Later I figured that > OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I > should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 > different threads each indexing 100mln docs coupled with a Multisearcher to > which I will feed these 10 indices. Do you think this will improve > performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs > and search time is 15 secs.. I can live with indexing time but the search > time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs > index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among > all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED] > >wrote: > > > Hi, > > > > I am developing an application which uses Lucene for indexing and
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 12:38
- shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine.
Shelly - Actually my documents are originally database records with each being equally important. - have a pool of IndexSearchers for each index - when a search comes in, allocate a Searcher from each index to the search. - perform the search in parallel across all indices. Shelly - Is it different from MultiSearcher or ParallelMultiSearcher - merge the results in your own code using an efficient merging algorithm. -----Original Message----- From: Dan OConnor [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 6:02 PM To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Shelly: You wouldn't necessarily have to use a multisearcher. A suggested alternative is: - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. - have a pool of IndexSearchers for each index - when a search comes in, allocate a Searcher from each index to the search. - perform the search in parallel across all indices. - merge the results in your own code using an efficient merging algorithm. Regards, Dan -----Original Message----- From: Shelly_Singh [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 8:20 AM To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 1:54 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? -- Anshum http://blog.anshumgupta.net Sent from BlackBerry(r) -----Original Message----- From: Shelly_Singh <[EMAIL PROTECTED]> Date: Tue, 10 Aug 2010 13:31:38 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. Help again. -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 12:55 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Compound file structure for indexing as it generally takes more time creating a cfs index. Plus the time would not grow linearly as the larger the size of segments get, the more time it'd take to add more docs and merge those together intermittently. You may also use a multithreaded approach in case reading the source takes time in your case, though, the indexwriter would have to be shared among all threads. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***
-
Re: Scaling Lucene to 1bln docsDanil ŢORIN 2010-08-10, 12:41
I'd second that.
It doesn't have to be date for sharding. Maybe every query has some specific field, like UserId or something, so you can redirect to specific shard instead of hitting all 10 indices. You have to have some kind of narrowing: searching 1bn documents with queries that may hit all documents is useless. An user won't look on more than let say 100 results (if presented properly..maybe 1000) Those fields that narrow the result set are good candidates for sharding keys. On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: > Shelly: > > You wouldn't necessarily have to use a multisearcher. A suggested alternative is: > > - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the search. > - perform the search in parallel across all indices. > - merge the results in your own code using an efficient merging algorithm. > > Regards, > Dan > > > > > -----Original Message----- > From: Shelly_Singh [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 8:20 AM > To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 1:54 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am developing an application which uses Lucene for indexing and searching
-
Re: Scaling Lucene to 1bln docsprashant ullegaddi 2010-08-10, 12:43
You might want to take a look at RemoteSearchable (
http://lucene.apache.org/java/2_9_2/api/contrib-remote/org/apache/lucene/search/RemoteSearchable.html) -- it'll be helpful if you place shards on different servers. On Tue, Aug 10, 2010 at 6:08 PM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > - shard into 10 indices. If you need the concept of a date range search, I > would assign the documents to the shard by date, otherwise random assignment > is fine. > Shelly - Actually my documents are originally database records with each > being equally important. > > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the > search. > > - perform the search in parallel across all indices. > Shelly - Is it different from MultiSearcher or ParallelMultiSearcher > > - merge the results in your own code using an efficient merging algorithm. > > -----Original Message----- > From: Dan OConnor [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:02 PM > To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Shelly: > > You wouldn't necessarily have to use a multisearcher. A suggested > alternative is: > > - shard into 10 indices. If you need the concept of a date range search, I > would assign the documents to the shard by date, otherwise random assignment > is fine. > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the > search. > - perform the search in parallel across all indices. > - merge the results in your own code using an efficient merging algorithm. > > Regards, > Dan > > > > > -----Original Message----- > From: Shelly_Singh [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 8:20 AM > To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to > search in al indices. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 1:54 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to > sort on relevance? Can you shard and restrict your search to a limited set > of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor > couple of days ago, but got an OOM, so I discarded it. Later I figured that > OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I > should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 > different threads each indexing 100mln docs coupled with a Multisearcher to > which I will feed these 10 indices. Do you think this will improve > performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs > and search time is 15 secs.. I can live with indexing time but the search > time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs Thanks and Regards, Prashant Ullegaddi, Search and Information Extraction Lab, IIIT-Hyderabad, India.
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 12:55
Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db.
One option I can think of is- I can shard based on starting alphabet of any name. So, "Alan Mathur of New Delhi" may go to shard "A". But since the name will have 'n' tokens, and the user may type any one token, this will not work. I can further tweak this such that I index the same document into multiple indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". I am not able to think of another option. Comments welcome. -----Original Message----- From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 6:11 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs I'd second that. It doesn't have to be date for sharding. Maybe every query has some specific field, like UserId or something, so you can redirect to specific shard instead of hitting all 10 indices. You have to have some kind of narrowing: searching 1bn documents with queries that may hit all documents is useless. An user won't look on more than let say 100 results (if presented properly..maybe 1000) Those fields that narrow the result set are good candidates for sharding keys. On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: > Shelly: > > You wouldn't necessarily have to use a multisearcher. A suggested alternative is: > > - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the search. > - perform the search in parallel across all indices. > - merge the results in your own code using an efficient merging algorithm. > > Regards, > Dan > > > > > -----Original Message----- > From: Shelly_Singh [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 8:20 AM > To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 1:54 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 13:31:38 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 different threads each indexing 100mln docs coupled with a Multisearcher to which I will feed these 10 indices. Do you think this will improve performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs and search time is 15 secs.. I can live with indexing time but the search time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 12:55 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs
-
Re: Scaling Lucene to 1bln docsDanil ŢORIN 2010-08-10, 13:22
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-something....you get the point" it will go to all shards so the whole point of shards will be compromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. Depending on available hardware (1 Azul...50 EC2), expected traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), redundancy (it's a large dataset, I don't think you want to loose it), and so on...you'll have to decide how many partitions do you want. It may work with 8-10, it may need 50-64. (I usually use 2^n as it's easier to split each shard in 2 when index grows too much) On such large datasets it's a lot of tuning, custom code, and no one-size-fits-all solution. Lucene is just a tool (a fine one) but you need to use it wisely to archive great results. On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db. > > One option I can think of is- > I can shard based on starting alphabet of any name. So, "Alan Mathur of New Delhi" may go to shard "A". But since the name will have 'n' tokens, and the user may type any one token, this will not work. I can further tweak this such that I index the same document into multiple indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think of another option. > > Comments welcome. > > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:11 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > I'd second that. > > It doesn't have to be date for sharding. Maybe every query has some > specific field, like UserId or something, so you can redirect to > specific shard instead of hitting all 10 indices. > > You have to have some kind of narrowing: searching 1bn documents with > queries that may hit all documents is useless. > An user won't look on more than let say 100 results (if presented > properly..maybe 1000) > > Those fields that narrow the result set are good candidates for sharding keys. > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: >> Shelly: >> >> You wouldn't necessarily have to use a multisearcher. A suggested alternative is: >> >> - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. >> - have a pool of IndexSearchers for each index >> - when a search comes in, allocate a Searcher from each index to the search. >> - perform the search in parallel across all indices. >> - merge the results in your own code using an efficient merging algorithm. >> >> Regards, >> Dan >> >> >> >> >> -----Original Message----- >> From: Shelly_Singh [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, August 10, 2010 8:20 AM >> To: [EMAIL PROTECTED] >> Subject: RE: Scaling Lucene to 1bln docs >> >> No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. >> >> -----Original Message----- >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, August 10, 2010 1:54 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Scaling Lucene to 1bln docs >> >> Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? >> >> -- >> Anshum >> http://blog.anshumgupta.net >> >> Sent from BlackBerry(r) >> >> -----Original Message----- >> From: Shelly_Singh <[EMAIL PROTECTED]> >> Date: Tue, 10 Aug 2010 13:31:38 >> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 13:32
Hi All,
Some very promising findings... for 100 mln ( a factor of 10 less than my goal), I could bring the search speed to 'single-digit' mili seconds. The major change is that I am now optimizing the index, which I was shying from doing earlier. For fun, I am planning to take a reading of 1 bln without sharding, and then look at sharding. Dan! I got your point about why first-letter of tokens is not a good key for deciding shards, but as I foresee, my word length will be generally between 2 to 5. So, it may still be worth a try. -----Original Message----- From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 6:52 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-something....you get the point" it will go to all shards so the whole point of shards will be co mpromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. Depending on available hardware (1 Azul...50 EC2), expected traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), redundancy (it's a large dataset, I don't think you want to loose it), and so on...you'll have to decide how many partitions do you want. It may work with 8-10, it may need 50-64. (I usually use 2^n as it's easier to split each shard in 2 when index grows too much) ring do On such large datasets it's a lot of tuning, custom code, and no one-size-fits-all solution. Lucene is just a tool (a fine one) but you need to use it wisely to archive great results. On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db. > > One option I can think of is- > I can shard based on starting alphabet of any name. So, "Alan Mathur of New Delhi" may go to shard "A". But since the name will have 'n' tokens, and the user may type any one token, this will not work. I can further tweak this such that I index the same document into multiple indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think of another option. > > Comments welcome. > > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:11 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > I'd second that. > > It doesn't have to be date for sharding. Maybe every query has some > specific field, like UserId or something, so you can redirect to > specific shard instead of hitting all 10 indices. > > You have to have some kind of narrowing: searching 1bn documents with > queries that may hit all documents is useless. > An user won't look on more than let say 100 results (if presented > properly..maybe 1000) > > Those fields that narrow the result set are good candidates for sharding keys. > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: >> Shelly: >> >> You wouldn't necessarily have to use a multisearcher. A suggested alternative is: >> >> - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. >> - have a pool of IndexSearchers for each index >> - when a search comes in, allocate a Searcher from each index to the search. >> - perform the search in parallel across all indices. >> - merge the results in your own code using an efficient merging algorithm. >> >> Regards, >> Dan >> >> >> >> >> -----Original Message----- >> From: Shelly_Singh [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, August 10, 2010 8:20 AM >> To: [EMAIL PROTECTED] >> Subject: RE: Scaling Lucene to 1bln docs >> >> No sort. I will need relevance based on TF. If I shard, I will have to search in al indices.
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-10, 13:41
Hi folks,
Thanks for the excellent support n guidance on my very first day on this mailing list... At end of day, I have very optimistic results. 100bln search in less than 1ms and the index creation time is not huge either ( close to 15 minutes). I am now hitting the 1bln mark with roughly the same settings. But, I want to understand Norms and TermFilters. Can someone explain, why or why not should one use each of these and what tradeoffs does it have. Regards, Shelly -----Original Message----- From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 6:52 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-something....you get the point" it will go to all shards so the whole point of shards will be compromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. Depending on available hardware (1 Azul...50 EC2), expected traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), redundancy (it's a large dataset, I don't think you want to loose it), and so on...you'll have to decide how many partitions do you want. It may work with 8-10, it may need 50-64. (I usually use 2^n as it's easier to split each shard in 2 when index grows too much) On such large datasets it's a lot of tuning, custom code, and no one-size-fits-all solution. Lucene is just a tool (a fine one) but you need to use it wisely to archive great results. On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db. > > One option I can think of is- > I can shard based on starting alphabet of any name. So, "Alan Mathur of New Delhi" may go to shard "A". But since the name will have 'n' tokens, and the user may type any one token, this will not work. I can further tweak this such that I index the same document into multiple indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think of another option. > > Comments welcome. > > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:11 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > I'd second that. > > It doesn't have to be date for sharding. Maybe every query has some > specific field, like UserId or something, so you can redirect to > specific shard instead of hitting all 10 indices. > > You have to have some kind of narrowing: searching 1bn documents with > queries that may hit all documents is useless. > An user won't look on more than let say 100 results (if presented > properly..maybe 1000) > > Those fields that narrow the result set are good candidates for sharding keys. > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: >> Shelly: >> >> You wouldn't necessarily have to use a multisearcher. A suggested alternative is: >> >> - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. >> - have a pool of IndexSearchers for each index >> - when a search comes in, allocate a Searcher from each index to the search. >> - perform the search in parallel across all indices. >> - merge the results in your own code using an efficient merging algorithm. >> >> Regards, >> Dan >> >> >> >> >> -----Original Message----- >> From: Shelly_Singh [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, August 10, 2010 8:20 AM >> To: [EMAIL PROTECTED] >> Subject: RE: Scaling Lucene to 1bln docs >> >> No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***
-
Re: Scaling Lucene to 1bln docsanshum.gupta@...) 2010-08-10, 13:49
Hey Shelly,
If you want to get more info on lucene, I'd recommend you get a copy of lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) -- Anshum http://blog.anshumgupta.net Sent from BlackBerry® -----Original Message----- From: Shelly_Singh <[EMAIL PROTECTED]> Date: Tue, 10 Aug 2010 19:11:11 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: Scaling Lucene to 1bln docs Hi folks, Thanks for the excellent support n guidance on my very first day on this mailing list... At end of day, I have very optimistic results. 100bln search in less than 1ms and the index creation time is not huge either ( close to 15 minutes). I am now hitting the 1bln mark with roughly the same settings. But, I want to understand Norms and TermFilters. Can someone explain, why or why not should one use each of these and what tradeoffs does it have. Regards, Shelly -----Original Message----- From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 6:52 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-something....you get the point" it will go to all shards so the whole point of shards will be compromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. Depending on available hardware (1 Azul...50 EC2), expected traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), redundancy (it's a large dataset, I don't think you want to loose it), and so on...you'll have to decide how many partitions do you want. It may work with 8-10, it may need 50-64. (I usually use 2^n as it's easier to split each shard in 2 when index grows too much) On such large datasets it's a lot of tuning, custom code, and no one-size-fits-all solution. Lucene is just a tool (a fine one) but you need to use it wisely to archive great results. On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db. > > One option I can think of is- > I can shard based on starting alphabet of any name. So, "Alan Mathur of New Delhi" may go to shard "A". But since the name will have 'n' tokens, and the user may type any one token, this will not work. I can further tweak this such that I index the same document into multiple indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think of another option. > > Comments welcome. > > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:11 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > I'd second that. > > It doesn't have to be date for sharding. Maybe every query has some > specific field, like UserId or something, so you can redirect to > specific shard instead of hitting all 10 indices. > > You have to have some kind of narrowing: searching 1bn documents with > queries that may hit all documents is useless. > An user won't look on more than let say 100 results (if presented > properly..maybe 1000) > > Those fields that narrow the result set are good candidates for sharding keys. > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> wrote: >> Shelly: >> >> You wouldn't necessarily have to use a multisearcher. A suggested alternative is: >> >> - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. >> - have a pool of IndexSearchers for each index >> - when a search comes in, allocate a Searcher from each index to the search. **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***
-
Re: Scaling Lucene to 1bln docsPablo Mendes 2010-08-10, 13:51
Shelly,
Do you mind sharing with the list the final settings you used for your best results? Cheers, Pablo On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > Hey Shelly, > If you want to get more info on lucene, I'd recommend you get a copy of > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry® > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 19:11:11 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi folks, > > Thanks for the excellent support n guidance on my very first day on this > mailing list... > At end of day, I have very optimistic results. 100bln search in less than > 1ms and the index creation time is not huge either ( close to 15 minutes). > > I am now hitting the 1bln mark with roughly the same settings. But, I want > to understand Norms and TermFilters. > > Can someone explain, why or why not should one use each of these and what > tradeoffs does it have. > > Regards, > Shelly > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:52 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > That won't work...if you'll have something like "A Basic Crazy > Document E-something F-something G-something....you get the point" it > will go to all shards so the whole point of shards will be > compromised...you'll have 26 billion documents index ;) > > Looks like the only way is to search all shards. > Depending on available hardware (1 Azul...50 EC2), expected > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > redundancy (it's a large dataset, I don't think you want to loose it), > and so on...you'll have to decide how many partitions do you want. > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's > easier to split each shard in 2 when index grows too much) > > On such large datasets it's a lot of tuning, custom code, and no > one-size-fits-all solution. > Lucene is just a tool (a fine one) but you need to use it wisely to > archive great results. > > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> > wrote: > > Hmm..I get the point. But, in my application, the document is basically a > descriptive name of a particular thing. The user will search by name (or > part of name) and I need to pull out all info pointed to by that name. This > info is externalized in a db. > > > > One option I can think of is- > > I can shard based on starting alphabet of any name. So, "Alan Mathur of > New Delhi" may go to shard "A". But since the name will have 'n' tokens, and > the user may type any one token, this will not work. I can further tweak > this such that I index the same document into multiple indices (one for each > token). So, the same document may be indexed into Shard"A", "M", "N" and > "D". > > I am not able to think of another option. > > > > Comments welcome. > > > > > > -----Original Message----- > > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, August 10, 2010 6:11 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Scaling Lucene to 1bln docs > > > > I'd second that. > > > > It doesn't have to be date for sharding. Maybe every query has some > > specific field, like UserId or something, so you can redirect to > > specific shard instead of hitting all 10 indices. > > > > You have to have some kind of narrowing: searching 1bn documents with > > queries that may hit all documents is useless. > > An user won't look on more than let say 100 results (if presented > > properly..maybe 1000) > > > > Those fields that narrow the result set are good candidates for sharding > keys. > > > > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <[EMAIL PROTECTED]> > wrote: > >> Shelly:
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-11, 04:58
My final settings are:
1. 1.5 gig RAM to the jvm out of 2GB available for my desktop 2. 100GB disk space. 3. Index creation and searching tuning factors: a. mergeFactor = 10 b. maxFieldLength = 10 c. maxMergeDocs = 5000000 d. full optimize at end of index creation e. readChunkSize = 1000000 f. TermInfosIndexDivisor = 10 g. NO sharding. Single Machine. But Pablo, my document is a single field document with the the field length being 2-5 words. So, u can probably reduce it by a factor of 100 directly if u want to compare with regular docs. -----Original Message----- From: Pablo Mendes [mailto:[EMAIL PROTECTED]] Sent: Tuesday, August 10, 2010 7:22 PM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs Shelly, Do you mind sharing with the list the final settings you used for your best results? Cheers, Pablo On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > Hey Shelly, > If you want to get more info on lucene, I'd recommend you get a copy of > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry® > > -----Original Message----- > From: Shelly_Singh <[EMAIL PROTECTED]> > Date: Tue, 10 Aug 2010 19:11:11 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: RE: Scaling Lucene to 1bln docs > > Hi folks, > > Thanks for the excellent support n guidance on my very first day on this > mailing list... > At end of day, I have very optimistic results. 100bln search in less than > 1ms and the index creation time is not huge either ( close to 15 minutes). > > I am now hitting the 1bln mark with roughly the same settings. But, I want > to understand Norms and TermFilters. > > Can someone explain, why or why not should one use each of these and what > tradeoffs does it have. > > Regards, > Shelly > > -----Original Message----- > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 6:52 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > That won't work...if you'll have something like "A Basic Crazy > Document E-something F-something G-something....you get the point" it > will go to all shards so the whole point of shards will be > compromised...you'll have 26 billion documents index ;) > > Looks like the only way is to search all shards. > Depending on available hardware (1 Azul...50 EC2), expected > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > redundancy (it's a large dataset, I don't think you want to loose it), > and so on...you'll have to decide how many partitions do you want. > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's > easier to split each shard in 2 when index grows too much) > > On such large datasets it's a lot of tuning, custom code, and no > one-size-fits-all solution. > Lucene is just a tool (a fine one) but you need to use it wisely to > archive great results. > > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> > wrote: > > Hmm..I get the point. But, in my application, the document is basically a > descriptive name of a particular thing. The user will search by name (or > part of name) and I need to pull out all info pointed to by that name. This > info is externalized in a db. > > > > One option I can think of is- > > I can shard based on starting alphabet of any name. So, "Alan Mathur of > New Delhi" may go to shard "A". But since the name will have 'n' tokens, and > the user may type any one token, this will not work. I can further tweak > this such that I index the same document into multiple indices (one for each > token). So, the same document may be indexed into Shard"A", "M", "N" and > "D". > > I am not able to think of another option. > > > > Comments welcome. > > > > > > -----Original Message----- > > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] >
-
Re: Scaling Lucene to 1bln docsAnshum 2010-08-11, 05:07
So, you didn't really use the setRamBuffer.. ?
Any reasons for that? -- Anshum Gupta http://ai-cafe.blogspot.com On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > My final settings are: > 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop > 2. 100GB disk space. > 3. Index creation and searching tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 > d. full optimize at end of index creation > e. readChunkSize = 1000000 > f. TermInfosIndexDivisor = 10 > g. NO sharding. Single Machine. > > But Pablo, my document is a single field document with the the field length > being 2-5 words. So, u can probably reduce it by a factor of 100 directly if > u want to compare with regular docs. > > -----Original Message----- > From: Pablo Mendes [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 7:22 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for your best > results? > > Cheers, > Pablo > > On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] > <[EMAIL PROTECTED]>wrote: > > > Hey Shelly, > > If you want to get more info on lucene, I'd recommend you get a copy of > > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > > > -- > > Anshum > > http://blog.anshumgupta.net > > > > Sent from BlackBerry® > > > > -----Original Message----- > > From: Shelly_Singh <[EMAIL PROTECTED]> > > Date: Tue, 10 Aug 2010 19:11:11 > > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > > Reply-To: [EMAIL PROTECTED] > > Subject: RE: Scaling Lucene to 1bln docs > > > > Hi folks, > > > > Thanks for the excellent support n guidance on my very first day on this > > mailing list... > > At end of day, I have very optimistic results. 100bln search in less than > > 1ms and the index creation time is not huge either ( close to 15 > minutes). > > > > I am now hitting the 1bln mark with roughly the same settings. But, I > want > > to understand Norms and TermFilters. > > > > Can someone explain, why or why not should one use each of these and what > > tradeoffs does it have. > > > > Regards, > > Shelly > > > > -----Original Message----- > > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, August 10, 2010 6:52 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Scaling Lucene to 1bln docs > > > > That won't work...if you'll have something like "A Basic Crazy > > Document E-something F-something G-something....you get the point" it > > will go to all shards so the whole point of shards will be > > compromised...you'll have 26 billion documents index ;) > > > > Looks like the only way is to search all shards. > > Depending on available hardware (1 Azul...50 EC2), expected > > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > > redundancy (it's a large dataset, I don't think you want to loose it), > > and so on...you'll have to decide how many partitions do you want. > > > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's > > easier to split each shard in 2 when index grows too much) > > > > On such large datasets it's a lot of tuning, custom code, and no > > one-size-fits-all solution. > > Lucene is just a tool (a fine one) but you need to use it wisely to > > archive great results. > > > > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <[EMAIL PROTECTED]> > > wrote: > > > Hmm..I get the point. But, in my application, the document is basically > a > > descriptive name of a particular thing. The user will search by name (or > > part of name) and I need to pull out all info pointed to by that name. > This > > info is externalized in a db. > > > > > > One option I can think of is- > > > I can shard based on starting alphabet of any name. So, "Alan Mathur of > > New Delhi" may go to shard "A". But since the name will have 'n' tokens,
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-11, 05:24
I experimented with it, but somehow (I am not convinced why) I got poorer indexing performance with higher RAM. That was an initial experiment and I did not dig into it. But, for time being, I have acceptable indexing speed so I am only focusing on reducing search time.
Thanks and Regards, Shelly Singh Center For KNowledge Driven Information Systems, Infosys Email: [EMAIL PROTECTED] Phone: (M) 91 992 369 7200, (VoIP)2022978622 -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 11, 2010 10:38 AM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs So, you didn't really use the setRamBuffer.. ? Any reasons for that? -- Anshum Gupta http://ai-cafe.blogspot.com On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > My final settings are: > 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop > 2. 100GB disk space. > 3. Index creation and searching tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 > d. full optimize at end of index creation > e. readChunkSize = 1000000 > f. TermInfosIndexDivisor = 10 > g. NO sharding. Single Machine. > > But Pablo, my document is a single field document with the the field length > being 2-5 words. So, u can probably reduce it by a factor of 100 directly if > u want to compare with regular docs. > > -----Original Message----- > From: Pablo Mendes [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 7:22 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for your best > results? > > Cheers, > Pablo > > On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] > <[EMAIL PROTECTED]>wrote: > > > Hey Shelly, > > If you want to get more info on lucene, I'd recommend you get a copy of > > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > > > -- > > Anshum > > http://blog.anshumgupta.net > > > > Sent from BlackBerry® > > > > -----Original Message----- > > From: Shelly_Singh <[EMAIL PROTECTED]> > > Date: Tue, 10 Aug 2010 19:11:11 > > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > > Reply-To: [EMAIL PROTECTED] > > Subject: RE: Scaling Lucene to 1bln docs > > > > Hi folks, > > > > Thanks for the excellent support n guidance on my very first day on this > > mailing list... > > At end of day, I have very optimistic results. 100bln search in less than > > 1ms and the index creation time is not huge either ( close to 15 > minutes). > > > > I am now hitting the 1bln mark with roughly the same settings. But, I > want > > to understand Norms and TermFilters. > > > > Can someone explain, why or why not should one use each of these and what > > tradeoffs does it have. > > > > Regards, > > Shelly > > > > -----Original Message----- > > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, August 10, 2010 6:52 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Scaling Lucene to 1bln docs > > > > That won't work...if you'll have something like "A Basic Crazy > > Document E-something F-something G-something....you get the point" it > > will go to all shards so the whole point of shards will be > > compromised...you'll have 26 billion documents index ;) > > > > Looks like the only way is to search all shards. > > Depending on available hardware (1 Azul...50 EC2), expected > > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > > redundancy (it's a large dataset, I don't think you want to loose it), > > and so on...you'll have to decide how many partitions do you want. > > > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's > > easier to split each shard in 2 when index grows too much) > > > > On such large datasets it's a lot of tuning, custom code, and no > > one-size-fits-all solution.
-
RE: Scaling Lucene to 1bln docsShelly_Singh 2010-08-16, 07:12
Hi,
While I could get an excellent search time on 1 bln documents in lucene; when I try to retrieve the document, I am being faced by a problem. If the number of documents returned by lucene is large (in my example it is 32000), then the document retrieval time is 3 seconds. My lucene document is not big, it has 3 fields of 1-2 terms each. From my code, I could see that most of those 3 seconds go in "reader.getDoc(docId)". Is there is a better way to do this. Thanks and Regards, Shelly Singh Center For KNowledge Driven Information Systems, Infosys Email: [EMAIL PROTECTED] Phone: (M) 91 992 369 7200, (VoIP)2022978622 -----Original Message----- From: Anshum [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 11, 2010 10:38 AM To: [EMAIL PROTECTED] Subject: Re: Scaling Lucene to 1bln docs So, you didn't really use the setRamBuffer.. ? Any reasons for that? -- Anshum Gupta http://ai-cafe.blogspot.com On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > My final settings are: > 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop > 2. 100GB disk space. > 3. Index creation and searching tuning factors: > a. mergeFactor = 10 > b. maxFieldLength = 10 > c. maxMergeDocs = 5000000 > d. full optimize at end of index creation > e. readChunkSize = 1000000 > f. TermInfosIndexDivisor = 10 > g. NO sharding. Single Machine. > > But Pablo, my document is a single field document with the the field length > being 2-5 words. So, u can probably reduce it by a factor of 100 directly if > u want to compare with regular docs. > > -----Original Message----- > From: Pablo Mendes [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, August 10, 2010 7:22 PM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for your best > results? > > Cheers, > Pablo > > On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] > <[EMAIL PROTECTED]>wrote: > > > Hey Shelly, > > If you want to get more info on lucene, I'd recommend you get a copy of > > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) > > > > -- > > Anshum > > http://blog.anshumgupta.net > > > > Sent from BlackBerry® > > > > -----Original Message----- > > From: Shelly_Singh <[EMAIL PROTECTED]> > > Date: Tue, 10 Aug 2010 19:11:11 > > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > > Reply-To: [EMAIL PROTECTED] > > Subject: RE: Scaling Lucene to 1bln docs > > > > Hi folks, > > > > Thanks for the excellent support n guidance on my very first day on this > > mailing list... > > At end of day, I have very optimistic results. 100bln search in less than > > 1ms and the index creation time is not huge either ( close to 15 > minutes). > > > > I am now hitting the 1bln mark with roughly the same settings. But, I > want > > to understand Norms and TermFilters. > > > > Can someone explain, why or why not should one use each of these and what > > tradeoffs does it have. > > > > Regards, > > Shelly > > > > -----Original Message----- > > From: Danil ŢORIN [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, August 10, 2010 6:52 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Scaling Lucene to 1bln docs > > > > That won't work...if you'll have something like "A Basic Crazy > > Document E-something F-something G-something....you get the point" it > > will go to all shards so the whole point of shards will be > > compromised...you'll have 26 billion documents index ;) > > > > Looks like the only way is to search all shards. > > Depending on available hardware (1 Azul...50 EC2), expected > > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), > > redundancy (it's a large dataset, I don't think you want to loose it), > > and so on...you'll have to decide how many partitions do you want. > > > > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's
-
Re: Scaling Lucene to 1bln docsDanil ŢORIN 2010-08-16, 09:32
Nope, getDoc is the right way to do it.
Those 3 seconds are actually spent in finding proper position to read the document from, and then IO (disk spinning, head positioning,etc). 32k documents it's quite a lot. A user won't look at all these documents, at least not all at once. Maybe you could add paging, returning a page of 1000 will cut your retrieval time proportionally to ~100msec. If you use result in some kind of post-processing, maybe you can rework your code, use some kind of queue, so you can start serving documents as soon as possible, and the post-processing thread won't wait until all results are available. On Mon, Aug 16, 2010 at 10:12, Shelly_Singh <[EMAIL PROTECTED]> wrote: > Hi, > > While I could get an excellent search time on 1 bln documents in lucene; when I try to retrieve the document, I am being faced by a problem. If the number of documents returned by lucene is large (in my example it is 32000), then the document retrieval time is 3 seconds. > > My lucene document is not big, it has 3 fields of 1-2 terms each. > From my code, I could see that most of those 3 seconds go in "reader.getDoc(docId)". > Is there is a better way to do this. > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: [EMAIL PROTECTED] > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > -----Original Message----- > From: Anshum [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 11, 2010 10:38 AM > To: [EMAIL PROTECTED] > Subject: Re: Scaling Lucene to 1bln docs > > So, you didn't really use the setRamBuffer.. ? > Any reasons for that? > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh <[EMAIL PROTECTED]>wrote: > >> My final settings are: >> 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop >> 2. 100GB disk space. >> 3. Index creation and searching tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 >> d. full optimize at end of index creation >> e. readChunkSize = 1000000 >> f. TermInfosIndexDivisor = 10 >> g. NO sharding. Single Machine. >> >> But Pablo, my document is a single field document with the the field length >> being 2-5 words. So, u can probably reduce it by a factor of 100 directly if >> u want to compare with regular docs. >> >> -----Original Message----- >> From: Pablo Mendes [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, August 10, 2010 7:22 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Scaling Lucene to 1bln docs >> >> Shelly, >> Do you mind sharing with the list the final settings you used for your best >> results? >> >> Cheers, >> Pablo >> >> On Tue, Aug 10, 2010 at 3:49 PM, [EMAIL PROTECTED] >> <[EMAIL PROTECTED]>wrote: >> >> > Hey Shelly, >> > If you want to get more info on lucene, I'd recommend you get a copy of >> > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) >> > >> > -- >> > Anshum >> > http://blog.anshumgupta.net >> > >> > Sent from BlackBerry® >> > >> > -----Original Message----- >> > From: Shelly_Singh <[EMAIL PROTECTED]> >> > Date: Tue, 10 Aug 2010 19:11:11 >> > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> >> > Reply-To: [EMAIL PROTECTED] >> > Subject: RE: Scaling Lucene to 1bln docs >> > >> > Hi folks, >> > >> > Thanks for the excellent support n guidance on my very first day on this >> > mailing list... >> > At end of day, I have very optimistic results. 100bln search in less than >> > 1ms and the index creation time is not huge either ( close to 15 >> minutes). >> > >> > I am now hitting the 1bln mark with roughly the same settings. But, I >> want >> > to understand Norms and TermFilters. >> > >> > Can someone explain, why or why not should one use each of these and what >> > tradeoffs does it have. >> > >> > Regards, >> > Shelly >> > >> > -----Original Message- |