|
Ali S Kureishy
2012-04-12, 13:04
Darren Govoni
2012-04-12, 13:27
Ali S Kureishy
2012-04-12, 14:21
Darren Govoni
2012-04-12, 14:31
Otis Gospodnetic
2012-04-13, 02:23
Jan Høydahl
2012-04-13, 20:45
Ali S Kureishy
2012-04-13, 23:16
Jan Høydahl
2012-04-14, 11:17
Otis Gospodnetic
2012-04-14, 21:27
Lance Norskog
2012-04-14, 23:11
Jason Rutherglen
2012-04-15, 15:19
Jason Rutherglen
2012-04-17, 00:42
Jan Høydahl
2012-04-17, 09:03
Otis Gospodnetic
2012-04-17, 20:33
Jason Rutherglen
2012-04-17, 20:50
Lukáš Vlček
2012-04-17, 23:00
Jason Rutherglen
2012-04-18, 12:22
Lukáš Vlček
2012-04-18, 13:40
Jason Rutherglen
2012-04-18, 15:41
|
-
Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentAli S Kureishy 2012-04-12, 13:04
Hi,
I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentDarren Govoni 2012-04-12, 13:27
You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: > Hi, > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > seconds. > > Needless to mention, the search index needs to scale to 5Billion pages. It > is also possible that I might need to store multiple indexes -- one for > crawled content, and one for ancillary data that is also very large. Each > of these indices would likely require a logically distributed and > replicated index. > > However, I would like for such a system to be homogenous with the Hadoop > infrastructure that is already installed on the cluster (for the crawl). In > other words, I would much prefer if the replication and distribution of the > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > using another scalability framework (such as SolrCloud). In addition, it > would be ideal if this environment was flexible enough to be dynamically > scaled based on the size requirements of the index and the search traffic > at the time (i.e. if it is deployed on an Amazon cluster, it should be easy > enough to automatically provision additional processing power into the > cluster without requiring server re-starts). > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is > mature enough and would be the right architectural choice to go along with > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects > above. > > Lastly, how much hardware (assuming a medium sized EC2 instance) would you > estimate my needing with this setup, for regular web-data (HTML text) at > this scale? > > Any architectural guidance would be greatly appreciated. The more details > provided, the wider my grin :). > > Many many thanks in advance. > > Thanks, > Safdar
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentAli S Kureishy 2012-04-12, 14:21
Thanks Darren.
Actually, I would like the system to be homogenous - i.e., use Hadoop based tools that already provide all the necessary scaling for the lucene index (in terms of throughput, latency of writes/reads etc). Since SolrCloud adds its own layer of sharding/replication that is outside Hadoop, I feel that using SolrCloud would be redundant, and a step in the opposite direction, which is what I'm trying to avoid in the first place. Or am I mistaken? Thanks, Safdar On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <[EMAIL PROTECTED]> wrote: > You could use SolrCloud (for the automatic scaling) and just mount a > fuse[1] HDFS directory and configure solr to use that directory for its > data. > > [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS > > On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: > > Hi, > > > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > > seconds. > > > > Needless to mention, the search index needs to scale to 5Billion pages. > It > > is also possible that I might need to store multiple indexes -- one for > > crawled content, and one for ancillary data that is also very large. Each > > of these indices would likely require a logically distributed and > > replicated index. > > > > However, I would like for such a system to be homogenous with the Hadoop > > infrastructure that is already installed on the cluster (for the crawl). > In > > other words, I would much prefer if the replication and distribution of > the > > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > > using another scalability framework (such as SolrCloud). In addition, it > > would be ideal if this environment was flexible enough to be dynamically > > scaled based on the size requirements of the index and the search traffic > > at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > > enough to automatically provision additional processing power into the > > cluster without requiring server re-starts). > > > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > > be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these > is > > mature enough and would be the right architectural choice to go along > with > > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > > above. > > > > Lastly, how much hardware (assuming a medium sized EC2 instance) would > you > > estimate my needing with this setup, for regular web-data (HTML text) at > > this scale? > > > > Any architectural guidance would be greatly appreciated. The more details > > provided, the wider my grin :). > > > > Many many thanks in advance. > > > > Thanks, > > Safdar > > >
-
RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentDarren Govoni 2012-04-12, 14:31
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea.
br><br><br>------- Original Message ------- On 4/12/2012 09:21 AM Ali S Kureishy wrote:<br>Thanks Darren. <br> <br>Actually, I would like the system to be homogenous - i.e., use Hadoop based <br>tools that already provide all the necessary scaling for the lucene index <br>(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds <br>its own layer of sharding/replication that is outside Hadoop, I feel that <br>using SolrCloud would be redundant, and a step in the opposite <br>direction, which is what I'm trying to avoid in the first place. Or am I <br>mistaken? <br> <br>Thanks, <br>Safdar <br> <br> <br>On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <[EMAIL PROTECTED]> wrote: <br> <br>> You could use SolrCloud (for the automatic scaling) and just mount a <br>> fuse[1] HDFS directory and configure solr to use that directory for its <br>> data. <br>> <br>> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS <br>> <br>> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: <br>> > Hi, <br>> > <br>> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure <br>> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, <br>> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 <br>> > seconds. <br>> > <br>> > Needless to mention, the search index needs to scale to 5Billion pages. <br>> It <br>> > is also possible that I might need to store multiple indexes -- one for <br>> > crawled content, and one for ancillary data that is also very large. Each <br>> > of these indices would likely require a logically distributed and <br>> > replicated index. <br>> > <br>> > However, I would like for such a system to be homogenous with the Hadoop <br>> > infrastructure that is already installed on the cluster (for the crawl). <br>> In <br>> > other words, I would much prefer if the replication and distribution of <br>> the <br>> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of <br>> > using another scalability framework (such as SolrCloud). In addition, it <br>> > would be ideal if this environment was flexible enough to be dynamically <br>> > scaled based on the size requirements of the index and the search traffic <br>> > at the time (i.e. if it is deployed on an Amazon cluster, it should be <br>> easy <br>> > enough to automatically provision additional processing power into the <br>> > cluster without requiring server re-starts). <br>> > <br>> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would <br>> > be ideal for this scenario. I've heard mention of Solr-on-HBase, <br>> Solandra, <br>> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these <br>> is <br>> > mature enough and would be the right architectural choice to go along <br>> with <br>> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling <br>> aspects <br>> > above. <br>> > <br>> > Lastly, how much hardware (assuming a medium sized EC2 instance) would <br>> you <br>> > estimate my needing with this setup, for regular web-data (HTML text) at <br>> > this scale? <br>> > <br>> > Any architectural guidance would be greatly appreciated. The more details <br>> > provided, the wider my grin :). <br>> > <br>> > Many many thanks in advance. <br>> > <br>> > Thanks, <br>> > Safdar <br>> <br>> <br>> <br>
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentOtis Gospodnetic 2012-04-13, 02:23
Hello Ali,
> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. > Needless to mention, the search index needs to scale to 5Billion pages. It > is also possible that I might need to store multiple indexes -- one for > crawled content, and one for ancillary data that is also very large. Each > of these indices would likely require a logically distributed and > replicated index. Yup, OK. > However, I would like for such a system to be homogenous with the Hadoop > infrastructure that is already installed on the cluster (for the crawl). In > other words, I would much prefer if the replication and distribution of the > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > using another scalability framework (such as SolrCloud). In addition, it > would be ideal if this environment was flexible enough to be dynamically > scaled based on the size requirements of the index and the search traffic > at the time (i.e. if it is deployed on an Amazon cluster, it should be easy > enough to automatically provision additional processing power into the > cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is > mature enough and would be the right architectural choice to go along with > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects > above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. > Lastly, how much hardware (assuming a medium sized EC2 instance) would you > estimate my needing with this setup, for regular web-data (HTML text) at > this scale? I don't know off the topic of my head, but I'm guessing several hundred for serving search requests. HTH, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Scalable Performance Monitoring - http://sematext.com/spm/index.html > Any architectural guidance would be greatly appreciated. The more details > provided, the wider my grin :). > > Many many thanks in advance. > > Thanks, > Safdar >
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJan Høydahl 2012-04-13, 20:45
Hi,
For a web crawl+search like this you will probably need a lot of additional Big Data crunching, so a Hadoop based solution is wise. In addition to those products mentioned we also now have Amazon's own CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr (not even Lucene based), but gives you the elasticity you request I guess. If you run your Hadoop cluster in EC2 already it would be quite efficient to batch-load the crawled and processed data into a "SearchDomain" in the same availability zone. However, both cost and features may prohibit this as a realistic choice for you. It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud would not build the indexes, but be pulling pre-built indexes from HDFS down to local disk every time it's told to. Or perhaps the SolrCloud nodes could be part of the hadoop cluster, being responsible for the Reduce part building the indexes? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote: > Hello Ali, > >> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >> seconds. > > > That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. > >> Needless to mention, the search index needs to scale to 5Billion pages. It >> is also possible that I might need to store multiple indexes -- one for >> crawled content, and one for ancillary data that is also very large. Each >> of these indices would likely require a logically distributed and >> replicated index. > > > Yup, OK. > >> However, I would like for such a system to be homogenous with the Hadoop >> infrastructure that is already installed on the cluster (for the crawl). In >> other words, I would much prefer if the replication and distribution of the >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >> using another scalability framework (such as SolrCloud). In addition, it >> would be ideal if this environment was flexible enough to be dynamically >> scaled based on the size requirements of the index and the search traffic >> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >> enough to automatically provision additional processing power into the >> cluster without requiring server re-starts). > > > There is no such thing just yet. > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. > >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >> mature enough and would be the right architectural choice to go along with >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >> above. > > > Here is a summary on all of them: > * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. > * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. > * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. > * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. > * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentAli S Kureishy 2012-04-13, 23:16
Thanks Otis.
I really appreciate the details offered here. This was very helpful information. I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hello Ali, > > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > > > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > > seconds. > > > That's fine. Whether it's doable with any tech will depend on how much > hardware you give it, among other things. > > > Needless to mention, the search index needs to scale to 5Billion pages. > It > > is also possible that I might need to store multiple indexes -- one for > > crawled content, and one for ancillary data that is also very large. Each > > of these indices would likely require a logically distributed and > > replicated index. > > > Yup, OK. > > > However, I would like for such a system to be homogenous with the Hadoop > > infrastructure that is already installed on the cluster (for the crawl). > In > > other words, I would much prefer if the replication and distribution of > the > > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > > using another scalability framework (such as SolrCloud). In addition, it > > would be ideal if this environment was flexible enough to be dynamically > > scaled based on the size requirements of the index and the search traffic > > at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > > enough to automatically provision additional processing power into the > > cluster without requiring server re-starts). > > > There is no such thing just yet. > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to > automatically index HBase content, but that was either not completed or not > committed into HBase. > > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > > be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these > is > > mature enough and would be the right architectural choice to go along > with > > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > > above. > > > Here is a summary on all of them: > * Search on HBase - I assume you are referring to the same thing I > mentioned above. Not ready. > * Solandra - uses Cassandra+Solr, plus DataStax now has a different > (commercial) offering that combines search and Cassandra. Looks good. > * Lily - data stored in HBase cluster gets indexed to a separate Solr > instance(s) on the side. Not really integrated the way you want it to be. > * ElasticSearch - solid at this point, the most dynamic solution today, > can scale well (we are working on a maaaany-B documents index and hundreds > of nodes with ElasticSearch right now), etc. But again, not integrated > with Hadoop the way you want it. > * IndexTank - has some technical weaknesses, not integrated with Hadoop, > not sure about its future considering LinkedIn uses Zoie and Sensei already. > * And there is SolrCloud, which is coming soon and will be solid, but is > again not integrated. > > If I were you and I had to pick today - I'd pick ElasticSearch if I were > completely open. If I had Solr bias I'd give SolrCloud a try first. > > > Lastly, how much hardware (assuming a medium sized EC2 instance) would > you > > estimate my needing with this setup, for regular web-data (HTML text) at > > this scale? > > I don't know off the topic of my head, but I'm guessing several hundred > for serving search requests.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJan Høydahl 2012-04-14, 11:17
Hi,
This won't give you the performance you need, unless you have enough RAM on the Solr box to cache the whole index in memory. Have you tested this yourself? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. apr. 2012, at 15:27, Darren Govoni wrote: > You could use SolrCloud (for the automatic scaling) and just mount a > fuse[1] HDFS directory and configure solr to use that directory for its > data. > > [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS > > On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: >> Hi, >> >> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >> seconds. >> >> Needless to mention, the search index needs to scale to 5Billion pages. It >> is also possible that I might need to store multiple indexes -- one for >> crawled content, and one for ancillary data that is also very large. Each >> of these indices would likely require a logically distributed and >> replicated index. >> >> However, I would like for such a system to be homogenous with the Hadoop >> infrastructure that is already installed on the cluster (for the crawl). In >> other words, I would much prefer if the replication and distribution of the >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >> using another scalability framework (such as SolrCloud). In addition, it >> would be ideal if this environment was flexible enough to be dynamically >> scaled based on the size requirements of the index and the search traffic >> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >> enough to automatically provision additional processing power into the >> cluster without requiring server re-starts). >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >> mature enough and would be the right architectural choice to go along with >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >> above. >> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would you >> estimate my needing with this setup, for regular web-data (HTML text) at >> this scale? >> >> Any architectural guidance would be greatly appreciated. The more details >> provided, the wider my grin :). >> >> Many many thanks in advance. >> >> Thanks, >> Safdar > >
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentOtis Gospodnetic 2012-04-14, 21:27
Hello,
Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. Otis ---- Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html >________________________________ > From: Ali S Kureishy <[EMAIL PROTECTED]> >To: Otis Gospodnetic <[EMAIL PROTECTED]> >Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >Sent: Friday, April 13, 2012 7:16 PM >Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment > > >Thanks Otis. > > >I really appreciate the details offered here. This was very helpful information. > > >I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? > > >Thanks again! > > >Warm regards, >Safdar > > > > > >On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > >Hello Ali, >> >> >>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >> >>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>> seconds. >> >> >>That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >> >> >>> Needless to mention, the search index needs to scale to 5Billion pages. It >>> is also possible that I might need to store multiple indexes -- one for >>> crawled content, and one for ancillary data that is also very large. Each >>> of these indices would likely require a logically distributed and >>> replicated index. >> >> >>Yup, OK. >> >> >>> However, I would like for such a system to be homogenous with the Hadoop >>> infrastructure that is already installed on the cluster (for the crawl). In >>> other words, I would much prefer if the replication and distribution of the >>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>> using another scalability framework (such as SolrCloud). In addition, it >>> would be ideal if this environment was flexible enough to be dynamically >>> scaled based on the size requirements of the index and the search traffic >>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>> enough to automatically provision additional processing power into the >>> cluster without requiring server re-starts). >> >> >>There is no such thing just yet. >>There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. >> >> >>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >>> mature enough and would be the right architectural choice to go along with >>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >>> above. >> >> >>Here is a summary on all of them: >>* Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. >>* Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. >>* Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. >>* ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentLance Norskog 2012-04-14, 23:11
It sounds like you really want the final map/reduce phase to put Solr
index files into HDFS. Solr has a feature to do this called 'Embedded Solr'. This packages Solr as a library instead of an HTTP servlet. The Solr committers mostly hate it and want it to go away, but it is useful for exactly this problem. There is some integration work here, both to bolt ES to the Hadoop output libraries and also some trickery to write out the HDFS files. HDFS only appends and most of the codecs (Lucene segment formats) like to seek a lot. Then at the end it needs a way to tell SolrCloud about the files. If someone wants a great Summer Of Code project, Hadoop->Lucene indexes->SolrCloud would be a lot of fun and make you widely loved by people with money. I'm not kidding. Do a good job of this and write clean code, and you'll get offers for very cool jobs. On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello, > > Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. > > Otis > ---- > Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html > > > >>________________________________ >> From: Ali S Kureishy <[EMAIL PROTECTED]> >>To: Otis Gospodnetic <[EMAIL PROTECTED]> >>Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>Sent: Friday, April 13, 2012 7:16 PM >>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment >> >> >>Thanks Otis. >> >> >>I really appreciate the details offered here. This was very helpful information. >> >> >>I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? >> >> >>Thanks again! >> >> >>Warm regards, >>Safdar >> >> >> >> >> >>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: >> >>Hello Ali, >>> >>> >>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >>> >>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>>> seconds. >>> >>> >>>That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >>> >>> >>>> Needless to mention, the search index needs to scale to 5Billion pages. It >>>> is also possible that I might need to store multiple indexes -- one for >>>> crawled content, and one for ancillary data that is also very large. Each >>>> of these indices would likely require a logically distributed and >>>> replicated index. >>> >>> >>>Yup, OK. >>> >>> >>>> However, I would like for such a system to be homogenous with the Hadoop >>>> infrastructure that is already installed on the cluster (for the crawl). In >>>> other words, I would much prefer if the replication and distribution of the >>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>>> using another scalability framework (such as SolrCloud). In addition, it >>>> would be ideal if this environment was flexible enough to be dynamically >>>> scaled based on the size requirements of the index and the search traffic >>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>>> enough to automatically provision additional processing power into the >>>> cluster without requiring server re-starts). >>> >>> >>>There is no such thing just yet. >>>There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. >>> >>> >>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would Lance Norskog [EMAIL PROTECTED]
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJason Rutherglen 2012-04-15, 15:19
This was done in SOLR-1301 going on several years ago now.
On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > It sounds like you really want the final map/reduce phase to put Solr > index files into HDFS. Solr has a feature to do this called 'Embedded > Solr'. This packages Solr as a library instead of an HTTP servlet. The > Solr committers mostly hate it and want it to go away, but it is > useful for exactly this problem. > > There is some integration work here, both to bolt ES to the Hadoop > output libraries and also some trickery to write out the HDFS files. > HDFS only appends and most of the codecs (Lucene segment formats) like > to seek a lot. Then at the end it needs a way to tell SolrCloud about > the files. > > If someone wants a great Summer Of Code project, Hadoop->Lucene > indexes->SolrCloud would be a lot of fun and make you widely loved by > people with money. I'm not kidding. Do a good job of this and write > clean code, and you'll get offers for very cool jobs. > > On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> Hello, >> >> Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues. >> >> Otis >> ---- >> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html >> >> >> >>>________________________________ >>> From: Ali S Kureishy <[EMAIL PROTECTED]> >>>To: Otis Gospodnetic <[EMAIL PROTECTED]> >>>Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>>Sent: Friday, April 13, 2012 7:16 PM >>>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment >>> >>> >>>Thanks Otis. >>> >>> >>>I really appreciate the details offered here. This was very helpful information. >>> >>> >>>I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? >>> >>> >>>Thanks again! >>> >>> >>>Warm regards, >>>Safdar >>> >>> >>> >>> >>> >>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: >>> >>>Hello Ali, >>>> >>>> >>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >>>> >>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>>>> seconds. >>>> >>>> >>>>That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >>>> >>>> >>>>> Needless to mention, the search index needs to scale to 5Billion pages. It >>>>> is also possible that I might need to store multiple indexes -- one for >>>>> crawled content, and one for ancillary data that is also very large. Each >>>>> of these indices would likely require a logically distributed and >>>>> replicated index. >>>> >>>> >>>>Yup, OK. >>>> >>>> >>>>> However, I would like for such a system to be homogenous with the Hadoop >>>>> infrastructure that is already installed on the cluster (for the crawl). In >>>>> other words, I would much prefer if the replication and distribution of the >>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>>>> using another scalability framework (such as SolrCloud). In addition, it >>>>> would be ideal if this environment was flexible enough to be dynamically >>>>> scaled based on the size requirements of the index and the search traffic >>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>>>> enough to automatically provision additional processing power into the >>>>> cluster without requiring server re-starts). >>>> >>>> >>>>There is no such thing just yet. >>>>There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJason Rutherglen 2012-04-17, 00:42
One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello Ali, > >> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >> seconds. > > > That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. > >> Needless to mention, the search index needs to scale to 5Billion pages. It >> is also possible that I might need to store multiple indexes -- one for >> crawled content, and one for ancillary data that is also very large. Each >> of these indices would likely require a logically distributed and >> replicated index. > > > Yup, OK. > >> However, I would like for such a system to be homogenous with the Hadoop >> infrastructure that is already installed on the cluster (for the crawl). In >> other words, I would much prefer if the replication and distribution of the >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >> using another scalability framework (such as SolrCloud). In addition, it >> would be ideal if this environment was flexible enough to be dynamically >> scaled based on the size requirements of the index and the search traffic >> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >> enough to automatically provision additional processing power into the >> cluster without requiring server re-starts). > > > There is no such thing just yet. > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. > >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >> mature enough and would be the right architectural choice to go along with >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >> above. > > > Here is a summary on all of them: > * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. > * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. > * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. > * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. > * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. > * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. > > If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. > >> Lastly, how much hardware (assuming a medium sized EC2 instance) would you >> estimate my needing with this setup, for regular web-data (HTML text) at
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJan Høydahl 2012-04-17, 09:03
Hi,
I think Katta integration is nice, but it is not very real-time. What if you want both? Perhaps a Katta/SolrCloud integration could make the two frameworks play together, so that some shards in SolrCloud may be marked as "static" while others are "realtime". SolrCloud will handle indexing the realtime shards as today, but indexing the static shards will be handled by Katta. If Katta adds a shard it will tell SolrCloud by updating the ZK tree, and SolrCloud will pick up the shard and start serving search for it.. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 17. apr. 2012, at 02:42, Jason Rutherglen wrote: > One of big weaknesses of Solr Cloud (and ES?) is the lack of the > ability to redistribute shards across servers. Meaning, as a single > shard grows too large, splitting the shard, while live updates. > > How do you plan on elastically adding more servers without this feature? > > Cassandra and HBase handle elasticity in their own ways. Cassandra > has successfully implemented the Dynamo model and HBase uses the > traditional BigTable 'split'. Both systems are complex though are at > a singular level of maturity. > > Also Cassandra [successfully] implements multiple data center support, > is that available in SC or ES? > > On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> Hello Ali, >> >>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >> >>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>> seconds. >> >> >> That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >> >>> Needless to mention, the search index needs to scale to 5Billion pages. It >>> is also possible that I might need to store multiple indexes -- one for >>> crawled content, and one for ancillary data that is also very large. Each >>> of these indices would likely require a logically distributed and >>> replicated index. >> >> >> Yup, OK. >> >>> However, I would like for such a system to be homogenous with the Hadoop >>> infrastructure that is already installed on the cluster (for the crawl). In >>> other words, I would much prefer if the replication and distribution of the >>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>> using another scalability framework (such as SolrCloud). In addition, it >>> would be ideal if this environment was flexible enough to be dynamically >>> scaled based on the size requirements of the index and the search traffic >>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>> enough to automatically provision additional processing power into the >>> cluster without requiring server re-starts). >> >> >> There is no such thing just yet. >> There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. >> >>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >>> mature enough and would be the right architectural choice to go along with >>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >>> above. >> >> >> Here is a summary on all of them: >> * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. >> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. >> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. >> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentOtis Gospodnetic 2012-04-17, 20:33
I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, "overshard", and then count on redistributing shards from oversubscribed nodes to other nodes. No resharding on demand and no index/shard splitting yet.
Otis ---- Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html >________________________________ > From: Jason Rutherglen <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Monday, April 16, 2012 8:42 PM >Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment > >One of big weaknesses of Solr Cloud (and ES?) is the lack of the >ability to redistribute shards across servers. Meaning, as a single >shard grows too large, splitting the shard, while live updates. > >How do you plan on elastically adding more servers without this feature? > >Cassandra and HBase handle elasticity in their own ways. Cassandra >has successfully implemented the Dynamo model and HBase uses the >traditional BigTable 'split'. Both systems are complex though are at >a singular level of maturity. > >Also Cassandra [successfully] implements multiple data center support, >is that available in SC or ES? > >On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic ><[EMAIL PROTECTED]> wrote: >> Hello Ali, >> >>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >> >>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>> seconds. >> >> >> That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >> >>> Needless to mention, the search index needs to scale to 5Billion pages. It >>> is also possible that I might need to store multiple indexes -- one for >>> crawled content, and one for ancillary data that is also very large. Each >>> of these indices would likely require a logically distributed and >>> replicated index. >> >> >> Yup, OK. >> >>> However, I would like for such a system to be homogenous with the Hadoop >>> infrastructure that is already installed on the cluster (for the crawl). In >>> other words, I would much prefer if the replication and distribution of the >>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>> using another scalability framework (such as SolrCloud). In addition, it >>> would be ideal if this environment was flexible enough to be dynamically >>> scaled based on the size requirements of the index and the search traffic >>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>> enough to automatically provision additional processing power into the >>> cluster without requiring server re-starts). >> >> >> There is no such thing just yet. >> There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. >> >>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would >>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, >>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is >>> mature enough and would be the right architectural choice to go along with >>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects >>> above. >> >> >> Here is a summary on all of them: >> * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. >> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. >> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. >> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJason Rutherglen 2012-04-17, 20:50
> redistributing shards from oversubscribed nodes to other nodes
Redistributing shards on a live system is not possible however because the updates in-flight will likely be lost. Also it is not simple technology to build from the ground-up. As is today, one would need to schedule downtime, for multi-terabyte live realtime systems, that not acceptable and will cause the system to not meet SLAs. Solr Cloud seems limited to a simple hashing algorithm for sending updates to the appropriate shard. This is precisely what Dynamo (and Cassandra) solves, eg, elastically and dynamically rearranging the hash 'ring' both logically and physically. In addition, there is the potential for data loss which Cassandra has the technology for. On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, "overshard", and then count on redistributing shards from oversubscribed nodes to other nodes. No resharding on demand and no index/shard splitting yet. > > Otis > ---- > Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html > > > >>________________________________ >> From: Jason Rutherglen <[EMAIL PROTECTED]> >>To: [EMAIL PROTECTED] >>Sent: Monday, April 16, 2012 8:42 PM >>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment >> >>One of big weaknesses of Solr Cloud (and ES?) is the lack of the >>ability to redistribute shards across servers. Meaning, as a single >>shard grows too large, splitting the shard, while live updates. >> >>How do you plan on elastically adding more servers without this feature? >> >>Cassandra and HBase handle elasticity in their own ways. Cassandra >>has successfully implemented the Dynamo model and HBase uses the >>traditional BigTable 'split'. Both systems are complex though are at >>a singular level of maturity. >> >>Also Cassandra [successfully] implements multiple data center support, >>is that available in SC or ES? >> >>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic >><[EMAIL PROTECTED]> wrote: >>> Hello Ali, >>> >>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure >>> >>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, >>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5 >>>> seconds. >>> >>> >>> That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. >>> >>>> Needless to mention, the search index needs to scale to 5Billion pages. It >>>> is also possible that I might need to store multiple indexes -- one for >>>> crawled content, and one for ancillary data that is also very large. Each >>>> of these indices would likely require a logically distributed and >>>> replicated index. >>> >>> >>> Yup, OK. >>> >>>> However, I would like for such a system to be homogenous with the Hadoop >>>> infrastructure that is already installed on the cluster (for the crawl). In >>>> other words, I would much prefer if the replication and distribution of the >>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of >>>> using another scalability framework (such as SolrCloud). In addition, it >>>> would be ideal if this environment was flexible enough to be dynamically >>>> scaled based on the size requirements of the index and the search traffic >>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy >>>> enough to automatically provision additional processing power into the >>>> cluster without requiring server re-starts). >>> >>> >>> There is no such thing just yet. >>> There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. >>> >>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentLukáš Vlček 2012-04-17, 23:00
Hi,
speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > One of big weaknesses of Solr Cloud (and ES?) is the lack of the > ability to redistribute shards across servers. Meaning, as a single > shard grows too large, splitting the shard, while live updates. > > How do you plan on elastically adding more servers without this feature? > > Cassandra and HBase handle elasticity in their own ways. Cassandra > has successfully implemented the Dynamo model and HBase uses the > traditional BigTable 'split'. Both systems are complex though are at > a singular level of maturity. > > Also Cassandra [successfully] implements multiple data center support, > is that available in SC or ES? > > On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: > > Hello Ali, > > > >> I'm trying to setup a large scale *Crawl + Index + Search > *infrastructure > > > >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web > pages*, > >> crawled + indexed every *4 weeks, *with a search latency of less than > 0.5 > >> seconds. > > > > > > That's fine. Whether it's doable with any tech will depend on how much > hardware you give it, among other things. > > > >> Needless to mention, the search index needs to scale to 5Billion pages. > It > >> is also possible that I might need to store multiple indexes -- one for > >> crawled content, and one for ancillary data that is also very large. > Each > >> of these indices would likely require a logically distributed and > >> replicated index. > > > > > > Yup, OK. > > > >> However, I would like for such a system to be homogenous with the Hadoop > >> infrastructure that is already installed on the cluster (for the > crawl). In > >> other words, I would much prefer if the replication and distribution of > the > >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead > of > >> using another scalability framework (such as SolrCloud). In addition, it > >> would be ideal if this environment was flexible enough to be dynamically > >> scaled based on the size requirements of the index and the search > traffic > >> at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > >> enough to automatically provision additional processing power into the > >> cluster without requiring server re-starts). > > > > > > There is no such thing just yet. > > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt > to automatically index HBase content, but that was either not completed or > not committed into HBase. > > > >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem > would > >> be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of > these is > >> mature enough and would be the right architectural choice to go along > with > >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > >> above. > > > > > > Here is a summary on all of them: > > * Search on HBase - I assume you are referring to the same thing I > mentioned above. Not ready. > > * Solandra - uses Cassandra+Solr, plus DataStax now has a different > (commercial) offering that combines search and Cassandra. Looks good. > > * Lily - data stored in HBase cluster gets indexed to a separate Solr > instance(s) on the side. Not really integrated the way you want it to be.
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJason Rutherglen 2012-04-18, 12:22
I'm curious how on the fly updates are handled as a new shard is added
to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[EMAIL PROTECTED]> wrote: > Hi, > > speaking about ES I think it would be fair to mention that one has to > specify number of shards upfront when the index is created - that is > correct, however, it is possible to give index one or more aliases which > basically means that you can add new indices on the fly and give them same > alias which is then used to search against. Given that you can add/remove > indices, nodes and aliases on the fly I think there is a way how to handle > growing data set with ease. If anyone is interested such scenario has been > discussed in detail in ES mail list. > > Regards, > Lukas > > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the >> ability to redistribute shards across servers. Meaning, as a single >> shard grows too large, splitting the shard, while live updates. >> >> How do you plan on elastically adding more servers without this feature? >> >> Cassandra and HBase handle elasticity in their own ways. Cassandra >> has successfully implemented the Dynamo model and HBase uses the >> traditional BigTable 'split'. Both systems are complex though are at >> a singular level of maturity. >> >> Also Cassandra [successfully] implements multiple data center support, >> is that available in SC or ES? >> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic >> <[EMAIL PROTECTED]> wrote: >> > Hello Ali, >> > >> >> I'm trying to setup a large scale *Crawl + Index + Search >> *infrastructure >> > >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web >> pages*, >> >> crawled + indexed every *4 weeks, *with a search latency of less than >> 0.5 >> >> seconds. >> > >> > >> > That's fine. Whether it's doable with any tech will depend on how much >> hardware you give it, among other things. >> > >> >> Needless to mention, the search index needs to scale to 5Billion pages. >> It >> >> is also possible that I might need to store multiple indexes -- one for >> >> crawled content, and one for ancillary data that is also very large. >> Each >> >> of these indices would likely require a logically distributed and >> >> replicated index. >> > >> > >> > Yup, OK. >> > >> >> However, I would like for such a system to be homogenous with the Hadoop >> >> infrastructure that is already installed on the cluster (for the >> crawl). In >> >> other words, I would much prefer if the replication and distribution of >> the >> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead >> of >> >> using another scalability framework (such as SolrCloud). In addition, it >> >> would be ideal if this environment was flexible enough to be dynamically >> >> scaled based on the size requirements of the index and the search >> traffic >> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be >> easy >> >> enough to automatically provision additional processing power into the >> >> cluster without requiring server re-starts). >> > >> > >> > There is no such thing just yet. >> > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt >> to automatically index HBase content, but that was either not completed or >> not committed into HBase. >> > >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem >> would >> >> be ideal for this scenario. I've heard mention of Solr-on-HBase, >> Solandra, >> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of >> these is >> >> mature enough and would be the right architectural choice to go along >> with >> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling >> aspects >> >> above. >> > >> > >> > Here is a summary on all of them: >> > * Search on HBase - I assume you are referring to the same thing I
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentLukáš Vlček 2012-04-18, 13:40
AFAIK it can not. You can only add new shards by creating a new index and
you will then need to index new data into that new index. Index aliases are useful mainly for searching part. So it means that you need to plan for this when you implement your indexing logic. On the other hand the query logic does not need to change as you only add new indices and give them all the same alias. I am not an expert on this but I think that index splitting and re-sharding can be expensive for [near] real-time search system and the point is that you can probably use different techniques to support your large scale needs. Index aliasing and routing in elasticsearch can help a lot in supporting various large scale data scenarios, check the following thread in ES ML for some examples: https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ Just to sum it up, the fact that elasticsearch does have fixed number of shards per index and does not support resharding and index splitting does not mean you can not scale your data easily. (I was not following this whole thread in every detail. So may be you may have specific needs that can be solved only by splitting or resharding, in such case I would recommend you to ask on ES ML with further questions, I do not want to run into system X vs system Y flame here...) Regards, Lukas On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > I'm curious how on the fly updates are handled as a new shard is added > to an alias. Eg, how does the system know to which shard to send an > update? > > On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > speaking about ES I think it would be fair to mention that one has to > > specify number of shards upfront when the index is created - that is > > correct, however, it is possible to give index one or more aliases which > > basically means that you can add new indices on the fly and give them > same > > alias which is then used to search against. Given that you can add/remove > > indices, nodes and aliases on the fly I think there is a way how to > handle > > growing data set with ease. If anyone is interested such scenario has > been > > discussed in detail in ES mail list. > > > > Regards, > > Lukas > > > > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the > >> ability to redistribute shards across servers. Meaning, as a single > >> shard grows too large, splitting the shard, while live updates. > >> > >> How do you plan on elastically adding more servers without this feature? > >> > >> Cassandra and HBase handle elasticity in their own ways. Cassandra > >> has successfully implemented the Dynamo model and HBase uses the > >> traditional BigTable 'split'. Both systems are complex though are at > >> a singular level of maturity. > >> > >> Also Cassandra [successfully] implements multiple data center support, > >> is that available in SC or ES? > >> > >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic > >> <[EMAIL PROTECTED]> wrote: > >> > Hello Ali, > >> > > >> >> I'm trying to setup a large scale *Crawl + Index + Search > >> *infrastructure > >> > > >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web > >> pages*, > >> >> crawled + indexed every *4 weeks, *with a search latency of less than > >> 0.5 > >> >> seconds. > >> > > >> > > >> > That's fine. Whether it's doable with any tech will depend on how > much > >> hardware you give it, among other things. > >> > > >> >> Needless to mention, the search index needs to scale to 5Billion > pages. > >> It > >> >> is also possible that I might need to store multiple indexes -- one > for > >> >> crawled content, and one for ancillary data that is also very large. > >> Each > >> >> of these indices would likely require a logically distributed and > >> >> replicated index. > >> > > >> >
-
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environmentJason Rutherglen 2012-04-18, 15:41
The main point being made is established NoSQL solutions (eg,
Cassandra, HBase, et al) have solved the update problem (among many other scalability issues, for several years). If an update is being performed and it is not known where the record exists, the update capability of the system is inefficient. In addition, in a production system, the mere possibility of losing data, or inaccurate updates is usually a red flag. On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček <[EMAIL PROTECTED]> wrote: > AFAIK it can not. You can only add new shards by creating a new index and > you will then need to index new data into that new index. Index aliases are > useful mainly for searching part. So it means that you need to plan for > this when you implement your indexing logic. On the other hand the query > logic does not need to change as you only add new indices and give them all > the same alias. > > I am not an expert on this but I think that index splitting and re-sharding > can be expensive for [near] real-time search system and the point is that > you can probably use different techniques to support your large scale > needs. Index aliasing and routing in elasticsearch can help a lot in > supporting various large scale data scenarios, check the following thread > in ES ML for some examples: > https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ > > Just to sum it up, the fact that elasticsearch does have fixed number of > shards per index and does not support resharding and index splitting does > not mean you can not scale your data easily. > > (I was not following this whole thread in every detail. So may be you may > have specific needs that can be solved only by splitting or resharding, in > such case I would recommend you to ask on ES ML with further questions, I > do not want to run into system X vs system Y flame here...) > > Regards, > Lukas > > On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> I'm curious how on the fly updates are handled as a new shard is added >> to an alias. Eg, how does the system know to which shard to send an >> update? >> >> On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > speaking about ES I think it would be fair to mention that one has to >> > specify number of shards upfront when the index is created - that is >> > correct, however, it is possible to give index one or more aliases which >> > basically means that you can add new indices on the fly and give them >> same >> > alias which is then used to search against. Given that you can add/remove >> > indices, nodes and aliases on the fly I think there is a way how to >> handle >> > growing data set with ease. If anyone is interested such scenario has >> been >> > discussed in detail in ES mail list. >> > >> > Regards, >> > Lukas >> > >> > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the >> >> ability to redistribute shards across servers. Meaning, as a single >> >> shard grows too large, splitting the shard, while live updates. >> >> >> >> How do you plan on elastically adding more servers without this feature? >> >> >> >> Cassandra and HBase handle elasticity in their own ways. Cassandra >> >> has successfully implemented the Dynamo model and HBase uses the >> >> traditional BigTable 'split'. Both systems are complex though are at >> >> a singular level of maturity. >> >> >> >> Also Cassandra [successfully] implements multiple data center support, >> >> is that available in SC or ES? >> >> >> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic >> >> <[EMAIL PROTECTED]> wrote: >> >> > Hello Ali, >> >> > >> >> >> I'm trying to setup a large scale *Crawl + Index + Search >> >> *infrastructure >> >> > >> >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web >> >> pages*, >> >> >> crawled + indexed every *4 weeks, *with a search latency of less than |