|
SUJIT PAL
2012-02-22, 03:45
Julien Nioche
2012-02-22, 06:12
Lewis John Mcgibbney
2012-02-22, 11:01
SUJIT PAL
2012-02-22, 18:16
Markus Jelsma
2012-02-22, 18:24
SUJIT PAL
2012-02-22, 19:27
Markus Jelsma
2012-02-22, 22:04
Lewis John Mcgibbney
2012-02-23, 11:43
SUJIT PAL
2012-02-24, 06:35
SUJIT PAL
2012-02-29, 01:31
|
-
[nutchgora] - proposal to support distributed indexingSUJIT PAL 2012-02-22, 03:45
Hi,
I need to move the SOLR based search platform to a distributed setup, and therefore need to be able to write to multiple SOLR servers from Nutch (working on the nutchgora branch, so this may be specific to this branch). Here is what I think I need to do... Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it converts the WebPage to a NutchDocument, then passes the NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter adds the NutchDocument to a queue and when the commit size is exceeded, writes out the queue and does a commit (and another one in the shutdown step). My proposal is to specify the SolrConstants.SERVER_URL parameter as a comma-separated list of URLs. The SolrWriter splits this parameter by "," and creates an array of server URLs and the same size array of inputDocs queue. It then takes the URL, runs it through a hashMod partitioner and writes it out to the inputDocs queue pointed by the partition. Then my pages get split up into a number of SOLR servers, where I can query them in a distributed fashion (according to the SOLR docs, it is advisable to do this in a random manner to make sure the (unreliable) idf values do not influence scores from one server too much). Is this a reasonable way to go about this? Or is there a simpler method I am overlooking? TIA for any help you can provide. -sujit
-
Re: [nutchgora] - proposal to support distributed indexingJulien Nioche 2012-02-22, 06:12
Hi Sujit,
Sounds good. A nice way of doing it would be to make so that people can define how to partition over the SOLR instances in the way they want e.g. consistent hashing, URL range or crawldb metadata by taking a class name as parameter. Does not need to be pluggable I think. I had implemented something along these lines some time ago for a customer but could not release it open source. Feel free to open a JIRA to comment on this issue and attach a patch. Thanks Julien On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: > Hi, > > I need to move the SOLR based search platform to a distributed setup, and > therefore need to be able to write to multiple SOLR servers from Nutch > (working on the nutchgora branch, so this may be specific to this branch). > Here is what I think I need to do... > > Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it > converts the WebPage to a NutchDocument, then passes the NutchDocument to > the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter > adds the NutchDocument to a queue and when the commit size is exceeded, > writes out the queue and does a commit (and another one in the shutdown > step). > > My proposal is to specify the SolrConstants.SERVER_URL parameter as a > comma-separated list of URLs. The SolrWriter splits this parameter by "," > and creates an array of server URLs and the same size array of inputDocs > queue. It then takes the URL, runs it through a hashMod partitioner and > writes it out to the inputDocs queue pointed by the partition. > > Then my pages get split up into a number of SOLR servers, where I can > query them in a distributed fashion (according to the SOLR docs, it is > advisable to do this in a random manner to make sure the (unreliable) idf > values do not influence scores from one server too much). > > Is this a reasonable way to go about this? Or is there a simpler method I > am overlooking? > > TIA for any help you can provide. > > -sujit > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
-
Re: [nutchgora] - proposal to support distributed indexingLewis John Mcgibbney 2012-02-22, 11:01
Hi.
There was an issue [0] opened for this some time ago and it looks that apart from the (bare minimal) description, there has been no work done on it. Would be a real nice feature to have. [0] https://issues.apache.org/jira/browse/NUTCH-945 On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi Sujit, > > Sounds good. A nice way of doing it would be to make so that people can > define how to partition over the SOLR instances in the way they want e.g. > consistent hashing, URL range or crawldb metadata by taking a class name as > parameter. Does not need to be pluggable I think. I had implemented > something along these lines some time ago for a customer but could not > release it open source. > > Feel free to open a JIRA to comment on this issue and attach a patch. > > Thanks > > Julien > > On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I need to move the SOLR based search platform to a distributed setup, and > > therefore need to be able to write to multiple SOLR servers from Nutch > > (working on the nutchgora branch, so this may be specific to this > branch). > > Here is what I think I need to do... > > > > Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it > > converts the WebPage to a NutchDocument, then passes the NutchDocument to > > the appropriate NutchIndexWriter (SolrWriter in this case). The > SolrWriter > > adds the NutchDocument to a queue and when the commit size is exceeded, > > writes out the queue and does a commit (and another one in the shutdown > > step). > > > > My proposal is to specify the SolrConstants.SERVER_URL parameter as a > > comma-separated list of URLs. The SolrWriter splits this parameter by "," > > and creates an array of server URLs and the same size array of inputDocs > > queue. It then takes the URL, runs it through a hashMod partitioner and > > writes it out to the inputDocs queue pointed by the partition. > > > > Then my pages get split up into a number of SOLR servers, where I can > > query them in a distributed fashion (according to the SOLR docs, it is > > advisable to do this in a random manner to make sure the (unreliable) idf > > values do not influence scores from one server too much). > > > > Is this a reasonable way to go about this? Or is there a simpler method I > > am overlooking? > > > > TIA for any help you can provide. > > > > -sujit > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- *Lewis*
-
Re: [nutchgora] - proposal to support distributed indexingSUJIT PAL 2012-02-22, 18:16
Thanks Julien and Lewis.
Being able to specify the partitioner class sounds good - I am thinking that perhaps they could all be impls of the Hadoop org.apache.hadoop.mapreduce.Partitioner interface. Would it be okay if I annotated NUTCH-945 saying that I am working on providing a patch for the NutchGora branch initially (I haven't looked at the head code yet, its likely to be slightly different), and then try to port the change over to the head? -sujit On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: > Hi. > > There was an issue [0] opened for this some time ago and it looks that > apart from the (bare minimal) description, there has been no work done on > it. > > Would be a real nice feature to have. > > [0] https://issues.apache.org/jira/browse/NUTCH-945 > > On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < > [EMAIL PROTECTED]> wrote: > >> Hi Sujit, >> >> Sounds good. A nice way of doing it would be to make so that people can >> define how to partition over the SOLR instances in the way they want e.g. >> consistent hashing, URL range or crawldb metadata by taking a class name as >> parameter. Does not need to be pluggable I think. I had implemented >> something along these lines some time ago for a customer but could not >> release it open source. >> >> Feel free to open a JIRA to comment on this issue and attach a patch. >> >> Thanks >> >> Julien >> >> On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I need to move the SOLR based search platform to a distributed setup, and >>> therefore need to be able to write to multiple SOLR servers from Nutch >>> (working on the nutchgora branch, so this may be specific to this >> branch). >>> Here is what I think I need to do... >>> >>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it >>> converts the WebPage to a NutchDocument, then passes the NutchDocument to >>> the appropriate NutchIndexWriter (SolrWriter in this case). The >> SolrWriter >>> adds the NutchDocument to a queue and when the commit size is exceeded, >>> writes out the queue and does a commit (and another one in the shutdown >>> step). >>> >>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a >>> comma-separated list of URLs. The SolrWriter splits this parameter by "," >>> and creates an array of server URLs and the same size array of inputDocs >>> queue. It then takes the URL, runs it through a hashMod partitioner and >>> writes it out to the inputDocs queue pointed by the partition. >>> >>> Then my pages get split up into a number of SOLR servers, where I can >>> query them in a distributed fashion (according to the SOLR docs, it is >>> advisable to do this in a random manner to make sure the (unreliable) idf >>> values do not influence scores from one server too much). >>> >>> Is this a reasonable way to go about this? Or is there a simpler method I >>> am overlooking? >>> >>> TIA for any help you can provide. >>> >>> -sujit >>> >>> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > *Lewis*
-
Re: [nutchgora] - proposal to support distributed indexingMarkus Jelsma 2012-02-22, 18:24
Hi,
We're in the process of testing Solr trunk's cloud features that recently includes initial work for distributed indexing. With it, there is no need anymore for doing the partitioning client site because Solr will forward the input documents to the proper shard. Solr uses the MurMur hashing algorithm to decide the target shard so i would stick to that in any case. Anyway, with Solr being able to handle incoming documents on any node, and distributing them appropriately there is no need anymore for hashing at all. What we do need to to select a target server from a pool per batch. Committing is not needed if soft autocommitting is enabled, quite useful for Solr's new NRT features. If Solr 4.0 is released in the coming months (and that's what it looks like) i would suggest to patch Nutch to allow for a list of Solr server URL's instead of doing partitioning on the client site. In our case we don't even need a pool of Solr servers in Nutch to select from because we pass the documents through a proxy that is aware of running and offline servers. Markus > Thanks Julien and Lewis. > > Being able to specify the partitioner class sounds good - I am thinking > that perhaps they could all be impls of the Hadoop > org.apache.hadoop.mapreduce.Partitioner interface. > > Would it be okay if I annotated NUTCH-945 saying that I am working on > providing a patch for the NutchGora branch initially (I haven't looked at > the head code yet, its likely to be slightly different), and then try to > port the change over to the head? > > -sujit > > On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: > > Hi. > > > > There was an issue [0] opened for this some time ago and it looks that > > apart from the (bare minimal) description, there has been no work done on > > it. > > > > Would be a real nice feature to have. > > > > [0] https://issues.apache.org/jira/browse/NUTCH-945 > > > > On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < > > > > [EMAIL PROTECTED]> wrote: > >> Hi Sujit, > >> > >> Sounds good. A nice way of doing it would be to make so that people can > >> define how to partition over the SOLR instances in the way they want > >> e.g. consistent hashing, URL range or crawldb metadata by taking a > >> class name as parameter. Does not need to be pluggable I think. I had > >> implemented something along these lines some time ago for a customer > >> but could not release it open source. > >> > >> Feel free to open a JIRA to comment on this issue and attach a patch. > >> > >> Thanks > >> > >> Julien > >> > >> On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: > >>> Hi, > >>> > >>> I need to move the SOLR based search platform to a distributed setup, > >>> and therefore need to be able to write to multiple SOLR servers from > >>> Nutch (working on the nutchgora branch, so this may be specific to > >>> this > >> > >> branch). > >> > >>> Here is what I think I need to do... > >>> > >>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where > >>> it converts the WebPage to a NutchDocument, then passes the > >>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this > >>> case). The > >> > >> SolrWriter > >> > >>> adds the NutchDocument to a queue and when the commit size is exceeded, > >>> writes out the queue and does a commit (and another one in the shutdown > >>> step). > >>> > >>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a > >>> comma-separated list of URLs. The SolrWriter splits this parameter by > >>> "," and creates an array of server URLs and the same size array of > >>> inputDocs queue. It then takes the URL, runs it through a hashMod > >>> partitioner and writes it out to the inputDocs queue pointed by the > >>> partition. > >>> > >>> Then my pages get split up into a number of SOLR servers, where I can > >>> query them in a distributed fashion (according to the SOLR docs, it is > >>> advisable to do this in a random manner to make sure the (unreliable)
-
Re: [nutchgora] - proposal to support distributed indexingSUJIT PAL 2012-02-22, 19:27
Thanks Marcus, I guess I'll probably still need to build nutch side partitioning for myself since I am on Solr 3.5, it would be throw-away code, to be changed when I get on to 4.x.
-sujit On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote: > Hi, > > We're in the process of testing Solr trunk's cloud features that recently > includes initial work for distributed indexing. With it, there is no need > anymore for doing the partitioning client site because Solr will forward the > input documents to the proper shard. Solr uses the MurMur hashing algorithm to > decide the target shard so i would stick to that in any case. > > Anyway, with Solr being able to handle incoming documents on any node, and > distributing them appropriately there is no need anymore for hashing at all. > What we do need to to select a target server from a pool per batch. > Committing is not needed if soft autocommitting is enabled, quite useful for > Solr's new NRT features. > > If Solr 4.0 is released in the coming months (and that's what it looks like) i > would suggest to patch Nutch to allow for a list of Solr server URL's instead > of doing partitioning on the client site. > > In our case we don't even need a pool of Solr servers in Nutch to select from > because we pass the documents through a proxy that is aware of running and > offline servers. > > Markus > >> Thanks Julien and Lewis. >> >> Being able to specify the partitioner class sounds good - I am thinking >> that perhaps they could all be impls of the Hadoop >> org.apache.hadoop.mapreduce.Partitioner interface. >> >> Would it be okay if I annotated NUTCH-945 saying that I am working on >> providing a patch for the NutchGora branch initially (I haven't looked at >> the head code yet, its likely to be slightly different), and then try to >> port the change over to the head? >> >> -sujit >> >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: >>> Hi. >>> >>> There was an issue [0] opened for this some time ago and it looks that >>> apart from the (bare minimal) description, there has been no work done on >>> it. >>> >>> Would be a real nice feature to have. >>> >>> [0] https://issues.apache.org/jira/browse/NUTCH-945 >>> >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < >>> >>> [EMAIL PROTECTED]> wrote: >>>> Hi Sujit, >>>> >>>> Sounds good. A nice way of doing it would be to make so that people can >>>> define how to partition over the SOLR instances in the way they want >>>> e.g. consistent hashing, URL range or crawldb metadata by taking a >>>> class name as parameter. Does not need to be pluggable I think. I had >>>> implemented something along these lines some time ago for a customer >>>> but could not release it open source. >>>> >>>> Feel free to open a JIRA to comment on this issue and attach a patch. >>>> >>>> Thanks >>>> >>>> Julien >>>> >>>> On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: >>>>> Hi, >>>>> >>>>> I need to move the SOLR based search platform to a distributed setup, >>>>> and therefore need to be able to write to multiple SOLR servers from >>>>> Nutch (working on the nutchgora branch, so this may be specific to >>>>> this >>>> >>>> branch). >>>> >>>>> Here is what I think I need to do... >>>>> >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where >>>>> it converts the WebPage to a NutchDocument, then passes the >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this >>>>> case). The >>>> >>>> SolrWriter >>>> >>>>> adds the NutchDocument to a queue and when the commit size is exceeded, >>>>> writes out the queue and does a commit (and another one in the shutdown >>>>> step). >>>>> >>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a >>>>> comma-separated list of URLs. The SolrWriter splits this parameter by >>>>> "," and creates an array of server URLs and the same size array of >>>>> inputDocs queue. It then takes the URL, runs it through a hashMod
-
Re: [nutchgora] - proposal to support distributed indexingMarkus Jelsma 2012-02-22, 22:04
In that case the algorithm doesn't matter as you still need to reindex the
corpus if you upgrade to 4.x. Cheers! > Thanks Marcus, I guess I'll probably still need to build nutch side > partitioning for myself since I am on Solr 3.5, it would be throw-away > code, to be changed when I get on to 4.x. > > -sujit > > On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote: > > Hi, > > > > We're in the process of testing Solr trunk's cloud features that recently > > includes initial work for distributed indexing. With it, there is no need > > anymore for doing the partitioning client site because Solr will forward > > the input documents to the proper shard. Solr uses the MurMur hashing > > algorithm to decide the target shard so i would stick to that in any > > case. > > > > Anyway, with Solr being able to handle incoming documents on any node, > > and distributing them appropriately there is no need anymore for hashing > > at all. What we do need to to select a target server from a pool per > > batch. Committing is not needed if soft autocommitting is enabled, quite > > useful for Solr's new NRT features. > > > > If Solr 4.0 is released in the coming months (and that's what it looks > > like) i would suggest to patch Nutch to allow for a list of Solr server > > URL's instead of doing partitioning on the client site. > > > > In our case we don't even need a pool of Solr servers in Nutch to select > > from because we pass the documents through a proxy that is aware of > > running and offline servers. > > > > Markus > > > >> Thanks Julien and Lewis. > >> > >> Being able to specify the partitioner class sounds good - I am thinking > >> that perhaps they could all be impls of the Hadoop > >> org.apache.hadoop.mapreduce.Partitioner interface. > >> > >> Would it be okay if I annotated NUTCH-945 saying that I am working on > >> providing a patch for the NutchGora branch initially (I haven't looked > >> at the head code yet, its likely to be slightly different), and then > >> try to port the change over to the head? > >> > >> -sujit > >> > >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: > >>> Hi. > >>> > >>> There was an issue [0] opened for this some time ago and it looks that > >>> apart from the (bare minimal) description, there has been no work done > >>> on it. > >>> > >>> Would be a real nice feature to have. > >>> > >>> [0] https://issues.apache.org/jira/browse/NUTCH-945 > >>> > >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < > >>> > >>> [EMAIL PROTECTED]> wrote: > >>>> Hi Sujit, > >>>> > >>>> Sounds good. A nice way of doing it would be to make so that people > >>>> can define how to partition over the SOLR instances in the way they > >>>> want e.g. consistent hashing, URL range or crawldb metadata by taking > >>>> a class name as parameter. Does not need to be pluggable I think. I > >>>> had implemented something along these lines some time ago for a > >>>> customer but could not release it open source. > >>>> > >>>> Feel free to open a JIRA to comment on this issue and attach a patch. > >>>> > >>>> Thanks > >>>> > >>>> Julien > >>>> > >>>> On 22 February 2012 03:45, SUJIT PAL <[EMAIL PROTECTED]> wrote: > >>>>> Hi, > >>>>> > >>>>> I need to move the SOLR based search platform to a distributed setup, > >>>>> and therefore need to be able to write to multiple SOLR servers from > >>>>> Nutch (working on the nutchgora branch, so this may be specific to > >>>>> this > >>>> > >>>> branch). > >>>> > >>>>> Here is what I think I need to do... > >>>>> > >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where > >>>>> it converts the WebPage to a NutchDocument, then passes the > >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this > >>>>> case). The > >>>> > >>>> SolrWriter > >>>> > >>>>> adds the NutchDocument to a queue and when the commit size is > >>>>> exceeded, writes out the queue and does a commit (and another one in > >>>>> the shutdown step).
-
Re: [nutchgora] - proposal to support distributed indexingLewis John Mcgibbney 2012-02-23, 11:43
Hi Sujit,
On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <[EMAIL PROTECTED]> wrote: > Being able to specify the partitioner class sounds good - I am thinking > that perhaps they could all be impls of the Hadoop > org.apache.hadoop.mapreduce.Partitioner interface. > Sounds good! > > Would it be okay if I annotated NUTCH-945 saying that I am working on > providing a patch for the NutchGora branch initially (I haven't looked at > the head code yet, its likely to be slightly different), and then try to > port the change over to the head? > Yes please fire ahead and if you are able to implement this feature then please attach your patch and we can hopefully review. Based on Markus' comments I think that although things over @ Solr development 4.X are scope for change in the 'near' future, I think this would be useful for people in the meantime. Thank you
-
Re: [nutchgora] - proposal to support distributed indexingSUJIT PAL 2012-02-24, 06:35
Hi Lewis,
Ok, thanks, I will attach the patch to NUTCH-945 after I am done with it, and update this thread as well... -sujit On Feb 23, 2012, at 3:43 AM, Lewis John Mcgibbney wrote: > Hi Sujit, > > > On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <[EMAIL PROTECTED]> wrote: > >> Being able to specify the partitioner class sounds good - I am thinking >> that perhaps they could all be impls of the Hadoop >> org.apache.hadoop.mapreduce.Partitioner interface. >> > > Sounds good! > > >> >> Would it be okay if I annotated NUTCH-945 saying that I am working on >> providing a patch for the NutchGora branch initially (I haven't looked at >> the head code yet, its likely to be slightly different), and then try to >> port the change over to the head? >> > > Yes please fire ahead and if you are able to implement this feature then > please attach your patch and we can hopefully review. Based on Markus' > comments I think that although things over @ Solr development 4.X are scope > for change in the 'near' future, I think this would be useful for people in > the meantime. > > Thank you
-
Re: [nutchgora] - proposal to support distributed indexingSUJIT PAL 2012-02-29, 01:31
I have updated a patch for NUTCH-945. It works locally as described in the JIRA.
-sujit On Feb 23, 2012, at 10:35 PM, SUJIT PAL wrote: > Hi Lewis, > > Ok, thanks, I will attach the patch to NUTCH-945 after I am done with it, and update this thread as well... > > -sujit > > On Feb 23, 2012, at 3:43 AM, Lewis John Mcgibbney wrote: > >> Hi Sujit, >> >> >> On Wed, Feb 22, 2012 at 6:16 PM, SUJIT PAL <[EMAIL PROTECTED]> wrote: >> >>> Being able to specify the partitioner class sounds good - I am thinking >>> that perhaps they could all be impls of the Hadoop >>> org.apache.hadoop.mapreduce.Partitioner interface. >>> >> >> Sounds good! >> >> >>> >>> Would it be okay if I annotated NUTCH-945 saying that I am working on >>> providing a patch for the NutchGora branch initially (I haven't looked at >>> the head code yet, its likely to be slightly different), and then try to >>> port the change over to the head? >>> >> >> Yes please fire ahead and if you are able to implement this feature then >> please attach your patch and we can hopefully review. Based on Markus' >> comments I think that although things over @ Solr development 4.X are scope >> for change in the 'near' future, I think this would be useful for people in >> the meantime. >> >> Thank you > |