|
Ning Li
2008-02-06, 18:57
Clay Webster
2008-02-06, 19:33
J. Delgado
2008-02-06, 19:44
Ian Holsman
2008-02-06, 20:57
Ning Li
2008-02-07, 00:08
Ning Li
2008-02-07, 00:08
Ning Li
2008-02-07, 00:17
Ian Holsman
2008-02-07, 01:31
Doug Cutting
2008-02-07, 18:37
Andrzej Bialecki
2008-02-07, 19:53
Doug Cutting
2008-02-07, 20:33
Ning Li
2008-02-07, 22:34
Srikant Jakilinki
2008-02-09, 07:15
Stefan Groschupf
2008-04-04, 02:44
Stefan Groschupf
2008-04-04, 02:51
Richard Marr
2008-08-21, 15:36
marcus clemens
2008-08-21, 15:51
Stefan Groschupf
2008-08-21, 17:41
Richard Marr
2008-08-22, 10:01
Stefan Groschupf
2008-08-22, 10:42
Richard Marr
2008-08-22, 14:04
Richard Marr
2008-08-22, 16:17
|
-
Lucene-based Distributed Index Leveraging HadoopNing Li 2008-02-06, 18:57
There have been several proposals for a Lucene-based distributed index
architecture. 1) Doug Cutting's "Index Server Project Proposal" at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html 2) Solr's "Distributed Search" at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's "Distributed Lucene" at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed "index" is partitioned into "shards". Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more "shard servers". Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A "master" keeps track of the shard servers and the shards being served by them. An "application" updates and queries the global index through an "index client". An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index client then calls the shard server search() method for each shard of the index, merges the results and returns them to the application. The index client caches the mapping between shards and shard servers by periodically calling the master's getShardLocations() method. UPDATING THE INDEX USING MAP/REDUCE To update the index, an application sends an update request to an index client. The index client then calls the master's updateShards() method, which schedules a Map/Reduce job to update the index. The Map/Reduce job updates the shards in parallel and copies the new index files of each shard (i.e., Lucene instance) to HDFS. The updateShards() method includes a "configuration", which provides information for updating the shards. More specifically, the configuration includes the following information: - Input path. This provides the location of updated documents, e.g., HDFS files or directories, or HBase tables. - Input formatter. This specifies how to format the input documents. - Analysis. This defines the analyzer to use on the input. The analyzer determines whether a document is being inserted, updated, or deleted. For inserts or updates, the analyzer also converts each input document into a Lucene document. The Map phase of the Map/Reduce job formats and analyzes the input (in parallel), while the Reduce phase collects and applies the updates to each Lucene instance (again in parallel). The updates are applied using the local file system where a Reduce task runs and then copied back to HDFS. For example, if the updates caused a new Lucene segment to be created, the new segment would be created on the local file system first, and then copied back to HDFS. When the Map/Reduce job completes, a "new version" of the index is ready to be queried. It is important to note that the new version of the index is not derived from scratch. By leveraging Lucene's update algorithm, the new version of each Lucene instance will share as many files as possible as the previous version. ENSURING INDEX CONSISTENCY At any point in time, an index client always has a consistent view of the shards in the index. The results of a search query include either all or none of a recent update to the index. The details of the algorithm to accomplish this are omitted here, but the basic flow is pretty simple. After the Map/Reduce job to update the shards completes, the master will tell each shard server to "prepare" the new version of the index. After all the shard servers have responded affirmatively to the "prepare" message, the new index is ready to be queried. An index client will then lazily learn about the new index when it makes its next getShardLocations() call to the master. In essence, a lazy two-phase commit protocol is used, with "prepare" and "commit" messages piggybacked on heartbeats. After a shard has switched to the new index, the Lucene files in the old index that are no longer needed can safely be deleted. ACHIEVING FAULT-TOLERANCE We rely on the fault-tolerance of Map/Reduce to guarantee that an index up
-
RE: Lucene-based Distributed Index Leveraging HadoopClay Webster 2008-02-06, 19:33
There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yonik and I had talked about it a bunch -- but that was long ago. --cw Clay Webster tel:1.908.541.3724 Associate VP, Platform Infrastructure http://www.cnet.com CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] > -----Original Message----- > From: Ning Li [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, February 06, 2008 1:57 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; solr- > [EMAIL PROTECTED] > Subject: Lucene-based Distributed Index Leveraging Hadoop > > There have been several proposals for a Lucene-based distributed index > architecture. > 1) Doug Cutting's "Index Server Project Proposal" at > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html > 2) Solr's "Distributed Search" at > http://wiki.apache.org/solr/DistributedSearch > 3) Mark Butler's "Distributed Lucene" at > http://wiki.apache.org/hadoop/DistributedLucene > > We have also been working on a Lucene-based distributed index > architecture. > Our design differs from the above proposals in the way it leverages > Hadoop > as much as possible. In particular, HDFS is used to reliably store > Lucene > instances, Map/Reduce is used to analyze documents and update Lucene > instances > in parallel, and Hadoop's IPC framework is used. Our design is geared > for > applications that require a highly scalable index and where batch > updates > to each Lucene instance are acceptable (verses finer-grained document > at > a time updates). > > We have a working implementation of our design and are in the process > of evaluating its performance. An overview of our design is provided > below. > We welcome feedback and would like to know if you are interested in > working > on it. If so, we would be happy to make the code publicly available. At > the > same time, we would like to collaborate with people working on existing > proposals and see if we can consolidate our efforts. > > TERMINOLOGY > A distributed "index" is partitioned into "shards". Each shard > corresponds > to > a Lucene instance and contains a disjoint subset of the documents in > the > index. > Each shard is stored in HDFS and served by one or more "shard servers". > Here > we only talk about a single distributed index, but in practice multiple > indexes > can be supported. > > A "master" keeps track of the shard servers and the shards being served > by > them. An "application" updates and queries the global index through an > "index client". An index client communicates with the shard servers to > execute a query. > > KEY RPC METHODS > This section lists the key RPC methods in our design. To simplify the > discussion, some of their parameters have been omitted. > > On the Shard Servers > // Execute a query on this shard server's Lucene instance. > // This method is called by an index client. > SearchResults search(Query query); > > On the Master > // Tell the master to update the shards, i.e., Lucene instances. > // This method is called by an index client. > boolean updateShards(Configuration conf); > > // Ask the master where the shards are located. > // This method is called by an index client. > LocatedShards getShardLocations(); > > // Send a heartbeat to the master. This method is called by a > // shard server. In the response, the master informs the > // shard server when to switch to a newer version of the index. > ShardServerCommand sendHeartbeat(); > > QUERYING THE INDEX > To query the index, an application sends a search request to an index > client. > The index client then calls the shard server search() method for each > shard > of the index, merges the results and returns them to the application. index switched index around
-
Re: Lucene-based Distributed Index Leveraging HadoopJ. Delgado 2008-02-06, 19:44
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this? J.D. On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > > There seem to be a few other players in this space too. > > Are you from Rackspace? > (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- > query-terabytes-data) > > AOL also has a Hadoop/Solr project going on. > > CNET does not have much brewing there. Although Yonik and I had > talked about it a bunch -- but that was long ago. > > --cw > > Clay Webster tel:1.908.541.3724 > Associate VP, Platform Infrastructure http://www.cnet.com > CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] > > > > -----Original Message----- > > From: Ning Li [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, February 06, 2008 1:57 PM > > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; solr- > > [EMAIL PROTECTED] > > Subject: Lucene-based Distributed Index Leveraging Hadoop > > > > There have been several proposals for a Lucene-based distributed index > > architecture. > > 1) Doug Cutting's "Index Server Project Proposal" at > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html > > 2) Solr's "Distributed Search" at > > http://wiki.apache.org/solr/DistributedSearch > > 3) Mark Butler's "Distributed Lucene" at > > http://wiki.apache.org/hadoop/DistributedLucene > > > > We have also been working on a Lucene-based distributed index > > architecture. > > Our design differs from the above proposals in the way it leverages > > Hadoop > > as much as possible. In particular, HDFS is used to reliably store > > Lucene > > instances, Map/Reduce is used to analyze documents and update Lucene > > instances > > in parallel, and Hadoop's IPC framework is used. Our design is geared > > for > > applications that require a highly scalable index and where batch > > updates > > to each Lucene instance are acceptable (verses finer-grained document > > at > > a time updates). > > > > We have a working implementation of our design and are in the process > > of evaluating its performance. An overview of our design is provided > > below. > > We welcome feedback and would like to know if you are interested in > > working > > on it. If so, we would be happy to make the code publicly available. > At > > the > > same time, we would like to collaborate with people working on > existing > > proposals and see if we can consolidate our efforts. > > > > TERMINOLOGY > > A distributed "index" is partitioned into "shards". Each shard > > corresponds > > to > > a Lucene instance and contains a disjoint subset of the documents in > > the > > index. > > Each shard is stored in HDFS and served by one or more "shard > servers". > > Here > > we only talk about a single distributed index, but in practice > multiple > > indexes > > can be supported. > > > > A "master" keeps track of the shard servers and the shards being > served > > by > > them. An "application" updates and queries the global index through an > > "index client". An index client communicates with the shard servers to > > execute a query. > > > > KEY RPC METHODS > > This section lists the key RPC methods in our design. To simplify the > > discussion, some of their parameters have been omitted. > > > > On the Shard Servers > > // Execute a query on this shard server's Lucene instance. > > // This method is called by an index client. > > SearchResults search(Query query); > > > > On the Master > > // Tell the master to update the shards, i.e., Lucene instances. > > // This method is called by an index client. > > boolean updateShards(Configuration conf); > > > > // Ask the master where the shards are located. > > // This method is called by an index client. > > LocatedShards getShardLocations(); > > > > // Send a heartbeat to the master. This method is called by a > > // shard server. In the response, the master informs the
-
Re: Lucene-based Distributed Index Leveraging HadoopIan Holsman 2008-02-06, 20:57
Clay Webster wrote:
> There seem to be a few other players in this space too. > > Are you from Rackspace? > (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- > query-terabytes-data) > > AOL also has a Hadoop/Solr project going on. > > CNET does not have much brewing there. Although Yonik and I had > talked about it a bunch -- but that was long ago. > Hi. AOL has a couple of projects going on in the lucene/hadoop/solr space, and we will be pushing more stuff out as we can. We don't have anything going with solr over hadoop at the moment. I'm not sure if this would be better than what SOLR-303 does, but you should have a look at the work being done there. One of the things you mentioned is that the data sets are disjoint. SOLR-303 doesn't require this, and allows us to have a document stored in multiple shards (with different caching/update characteristics). > --cw > > Clay Webster tel:1.908.541.3724 > Associate VP, Platform Infrastructure http://www.cnet.com > CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] > > > > >
-
Re: Lucene-based Distributed Index Leveraging HadoopNing Li 2008-02-07, 00:08
No. I'm curious too. :)
On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote: > I assume that Google also has distributed index over their > GFS/MapReduce implementation. Any idea how they achieve this? > > J.D. >
-
Re: Lucene-based Distributed Index Leveraging HadoopNing Li 2008-02-07, 00:08
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system. Do they plan to open-source it? Is the AOL project an open source project? On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > > There seem to be a few other players in this space too. > > Are you from Rackspace? > (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- > query-terabytes-data) > > AOL also has a Hadoop/Solr project going on. > > CNET does not have much brewing there. Although Yonik and I had > talked about it a bunch -- but that was long ago. > > --cw > > Clay Webster tel:1.908.541.3724 > Associate VP, Platform Infrastructure http://www.cnet.com > CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] > >
-
Re: Lucene-based Distributed Index Leveraging HadoopNing Li 2008-02-07, 00:17
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exists) if there is enough interest. Making Solr work on top of such a system could be an important goal and SOLR-303 is a big part of it in that case. I should have made it clear that disjoint data sets are not a requirement of the system. On Feb 6, 2008 12:57 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > Hi. > AOL has a couple of projects going on in the lucene/hadoop/solr space, > and we will be pushing more stuff out as we can. We don't have anything > going with solr over hadoop at the moment. > > I'm not sure if this would be better than what SOLR-303 does, but you > should have a look at the work being done there. > > One of the things you mentioned is that the data sets are disjoint. > SOLR-303 doesn't require this, and allows us to have a document stored > in multiple shards (with different caching/update characteristics). > >
-
Re: Lucene-based Distributed Index Leveraging HadoopIan Holsman 2008-02-07, 01:31
Ning Li wrote:
> One main focus is to provide fault-tolerance in this distributed index > system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging > results from multiple shards right now. We'd like to start an open source > project for a fault-tolerant distributed index system (or join if one > already exists) if there is enough interest. Making Solr work on top of such > a system could be an important goal and SOLR-303 is a big part of it in that > case. > I guess it depends on how you set up your shards in 303. We plan on having a master/slave relationship on each shard, so that each shard would sync the same way solr does currently. regards Ian > I should have made it clear that disjoint data sets are not a requirement of > the system. > > > On Feb 6, 2008 12:57 PM, Ian Holsman <[EMAIL PROTECTED]> wrote: > > >> Hi. >> AOL has a couple of projects going on in the lucene/hadoop/solr space, >> and we will be pushing more stuff out as we can. We don't have anything >> going with solr over hadoop at the moment. >> >> I'm not sure if this would be better than what SOLR-303 does, but you >> should have a look at the work being done there. >> >> One of the things you mentioned is that the data sets are disjoint. >> SOLR-303 doesn't require this, and allows us to have a document stored >> in multiple shards (with different caching/update characteristics). >> >> >> > >
-
Re: Lucene-based Distributed Index Leveraging HadoopDoug Cutting 2008-02-07, 18:37
Ning,
I am also interested in starting a new project in this area. The approach I have in mind is slightly different, but hopefully we can come to some agreement and collaborate. My current thinking is that the Solr search API is the appropriate model. Solr's facets are an important feature that require low-level support to be practical. Thus a useful distributed search system should support facets from the outset, rather than attempt to graft them on later. In particular, I believe this requirement mandates disjoint shards. My primary difference with your proposal is that I would like to support online indexing. Documents could be inserted and removed directly, and shards would synchronize changes amongst replicas, with an "eventual consistency" model. Indexes would not be stored in HDFS, but directly on the local disk of each node. Hadoop would perhaps not play a role. In many ways this would resemble CouchDB, but with explicit support for sharding and failover from the outset. A particular client should be able to provide a consistent read/write view by bonding to particular replicas of a shard. Thus a user who makes a modification should be able to generally see that modification in results immediately, while other users, talking to different replicas, may not see it until synchronization is complete. There are many unresolved issues in my mind around sharding and replication that I hope to reach some clarity on before beginning implementation. Does this sound like it could be of interest to you? Cheers, Doug Ning Li wrote: > There have been several proposals for a Lucene-based distributed index > architecture. > 1) Doug Cutting's "Index Server Project Proposal" at > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html > 2) Solr's "Distributed Search" at > http://wiki.apache.org/solr/DistributedSearch > 3) Mark Butler's "Distributed Lucene" at > http://wiki.apache.org/hadoop/DistributedLucene > > We have also been working on a Lucene-based distributed index architecture. > Our design differs from the above proposals in the way it leverages Hadoop > as much as possible. In particular, HDFS is used to reliably store Lucene > instances, Map/Reduce is used to analyze documents and update Lucene > instances > in parallel, and Hadoop's IPC framework is used. Our design is geared for > applications that require a highly scalable index and where batch updates > to each Lucene instance are acceptable (verses finer-grained document at > a time updates). > > We have a working implementation of our design and are in the process > of evaluating its performance. An overview of our design is provided below. > We welcome feedback and would like to know if you are interested in working > on it. If so, we would be happy to make the code publicly available. At the > same time, we would like to collaborate with people working on existing > proposals and see if we can consolidate our efforts. > > TERMINOLOGY > A distributed "index" is partitioned into "shards". Each shard corresponds > to > a Lucene instance and contains a disjoint subset of the documents in the > index. > Each shard is stored in HDFS and served by one or more "shard servers". Here > we only talk about a single distributed index, but in practice multiple > indexes > can be supported. > > A "master" keeps track of the shard servers and the shards being served by > them. An "application" updates and queries the global index through an > "index client". An index client communicates with the shard servers to > execute a query. > > KEY RPC METHODS > This section lists the key RPC methods in our design. To simplify the > discussion, some of their parameters have been omitted. > > On the Shard Servers > // Execute a query on this shard server's Lucene instance. > // This method is called by an index client. > SearchResults search(Query query); > > On the Master > // Tell the master to update the shards, i.e., Lucene instances.
-
Re: Lucene-based Distributed Index Leveraging HadoopAndrzej Bialecki 2008-02-07, 19:53
Doug Cutting wrote:
> Ning, > > I am also interested in starting a new project in this area. The > approach I have in mind is slightly different, but hopefully we can come > to some agreement and collaborate. I'm interested in this too. > My current thinking is that the Solr search API is the appropriate > model. Solr's facets are an important feature that require low-level > support to be practical. Thus a useful distributed search system should > support facets from the outset, rather than attempt to graft them on > later. In particular, I believe this requirement mandates disjoint shards. I agree - shards should be disjoint also because if we eventually want to manage multiple replicas of each shard across the cluster (for reliability and performance) then overlapping documents would complicate both the query dispatching process and the merging of partial result sets. > My primary difference with your proposal is that I would like to support > online indexing. Documents could be inserted and removed directly, and > shards would synchronize changes amongst replicas, with an "eventual > consistency" model. Indexes would not be stored in HDFS, but directly > on the local disk of each node. Hadoop would perhaps not play a role. > In many ways this would resemble CouchDB, but with explicit support for > sharding and failover from the outset. It's true that searching over HDFS is slow - but I'd hate to lose all other HDFS benefits and have to start from scratch ... I wonder what would be the performance of FsDirectory over an HDFS index that is "pinned" to a local disk, i.e. a full local replica is available, with block size of each index file equal to the file size. > A particular client should be able to provide a consistent read/write > view by bonding to particular replicas of a shard. Thus a user who > makes a modification should be able to generally see that modification > in results immediately, while other users, talking to different > replicas, may not see it until synchronization is complete. This requires that we use versioning, and that we have a "shard manager" that knows the latest versions of each shard among the whole active set - or that clients discover this dynamically by querying the shard servers every now and then. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Lucene-based Distributed Index Leveraging HadoopDoug Cutting 2008-02-07, 20:33
[ No longer cross-posting to java-dev and solr-user. ]
Andrzej Bialecki wrote: >> A particular client should be able to provide a consistent read/write >> view by bonding to particular replicas of a shard. Thus a user who >> makes a modification should be able to generally see that modification >> in results immediately, while other users, talking to different >> replicas, may not see it until synchronization is complete. > > This requires that we use versioning, and that we have a "shard manager" > that knows the latest versions of each shard among the whole active set > - or that clients discover this dynamically by querying the shard > servers every now and then. Yes, there needs to be a master that knows the shard hash function. However, I'm not sure what you mean by "versioning". In general, there is no "latest" version of a shard. Different shards have had different updates, and must, between themselves, resolve conflicts. A client would generally talk to just one replica of each shard. This is like CouchDB. If different fields of a document are modified on different shards, then the changes can be merged. Edits to a text field might sometimes even be mergable. But, in general, if two shards both contain unmergable changes to the same field, one will win and one will lose. Similarly, if a document id is deleted in one shard and added in another at approximately the same time, then the addition would generally win. Thus if a single client switches which shard replica it talks to, then it could possibly lose deletions. Or if different clients attempt to modify the same document, one clients changes may be overwritten by the other. This is similar to the way that Amazon's Dynamo works: in the case of failures, shopping cart deletions can be lost, and deleted things may thus re-appear in one's shopping cart. This happens rarely, and confirmation is required before final sale, so it is not a big problem. Perhaps conflicts can be flagged and manually resolved by the application. Or perhaps clocks can be sufficiently synchronized that the vast majority of conflicts can be automatically resolved correctly. Doug
-
Re: Lucene-based Distributed Index Leveraging HadoopNing Li 2008-02-07, 22:34
Doug,
I'm looking forward to the collaboration! > My current thinking is that the Solr search API is the appropriate > model. Solr's facets are an important feature that require low-level I'm thinking, can we make the type of shard updater/searcher and result merger configurable in a general distributed index system? Vanilla Lucene is one type. Solr is another. Nutch could have one. Applications can write their customized type (must be Lucene-based). In case of a Solr-typed system, for example, an application sends a search request to an index client. The index client sends the search request to shard servers which host Solr searchers. The index client uses the Solr result merger to merge the results from all the shards and returns the merged result to the application. > My primary difference with your proposal is that I would like to support > online indexing. Documents could be inserted and removed directly, and > shards would synchronize changes amongst replicas, with an "eventual > consistency" model. I've been thinking about batch update vs. online update. :) Is it possible to support both efficiently in one system? We may say that a system which supports online update can handle batch update. However, it depends on whether the updates on a shard server are lost when the server goes down. In a system targeting batch update, the entirety of a batch update can simply be guaranteed by a map/reduce job. Your thoughts? The online update you described here is different from the one you described in the Index Server Project proposal a while ago. It was multi-reader single-writer before. Now it's multi-reader multi-writer with eventual consistency. Is it because it is a more general usage scenario that you think that latter supports? Regards, Ning
-
Re: Lucene-based Distributed Index Leveraging HadoopSrikant Jakilinki 2008-02-09, 07:15
Hi Ning,
In continuation with our offline conversation, here is a public expression of interest in your work and a description of our work. Sorry for the length in advance and I hope that the folk will be able to collaborate and/or share experiences and/or give us some pointers... 1) We are trying to leverage Lucene on Hadoop for blog archiving and searching i.e. ever-increasing data (in terabytes) on commodity hardware in a generic LAN. These machines are not hi-spec nor are dedicated but actually used within the lab by users for day to day tasks. Unfortunately, Nutch and Solr are not applicable to our situation - atleast directly. Think of us as an academic oriented Technorati 2) There are 2 aspects.One is that we want to archive the blogposts that we hit under a UUID/timestamp taxonomy. This archive can be used for many things like cached copies, diffing, surf acceleration etc. The other aspect is to archive the indexes. You see, the indexes have a lifecycle. For simplicity sake, an index consists of one days worth of blogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy. Ideally, we want to store an indefinite archive of blogposts and their indexes side-by-side but 1 year or 365 days is a start 3) We want to use the taxonomical name of the post as a specific ID field in the Lucene index and want to get away with not storing the content of the post at all but only a file pointer/reference to it. This we hope will keep the index sizes low but the fact remains that this is a case of multiple threads on multiple JVMs handling multiple indexes on multiple machines. Further, the posts and indexes are mostly WORM but there may be situations where they have to be updated. For example, if some blog posts have edited content or have to be removed for copyright, or updated with metadata like rank. There is some duplication detection work that has to be done here but it is out of scope for now. And oh, the lab is a Linux-Windows environment 4) Our first port of call is to have Hadoop running on this group of machines (without clustering or load balancing or grid or master/slave mumbo jumbo) in the simplest way possible. The goal being to make applications see the bunch of machines as a reliable, scalable, fault-tolerant, average-performing file store with simple, file CRUD operations. For example, the blog crawler should be able to put the blogposts in this HDFS in live or in batch mode. With about 20 machines and each being installed with a 240GB drive for the experiment, we have about 4.5 TB of storage available 5) Next we want to handle Lucene and exploit the structure of its index and the algorithms behind it. Since a Lucene index is a directory of files, we intend to 'tag' the files as belonging to one index and store them on the HDFS. At any instant in time, an index can be regenerated and used. The regenerated index is however not used directly from HDFS but copied into the local filesystem of the indexer/searcher. This copy is subject to change and every once in a while, the constituent files in the HDFS are overwritten with the latest files. Hence, naming is quite important to us. Even so, we feel that the number of files that have to be updated are quite less and that we can use MD5 sums to make sure we only update the content changed files. However, this means that out of 4.5 TB available, we use half of it for archival and the other half for searching. Even so, we should be able to store a years worth of posts and indexes. Disks are no problem 6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file segments on the HDFS. Suppose there are N machines online, then each machine will have to own 365/N indexes. N constantly keeps changing but at any instant the 365 indexes should be live and we are working on the best way to achieve this kind of 'fair' autonomic computing cloud where when a machine goes down, the other machines will add some indexes to their kitty. If a machine is added, then it relieves other machines of some indexes. The searcher runs on each of these machines and is a service (IP:port) and queries are served using a ParallelMultiSearch() [on the machines] and a MultiSearch() [within the machines] so that we need not have an unmanageable number of JVMs per machine. Atmost, 1 for Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be used for search if it supports multiple indexes available on the same machine As you can see, this is not a simple endeavour and it is obvious, I suppose, that we are still in theory stage and only now getting to know the Lucene projects better. There is a huge body of work, albeit not acknowledged in the scientific community as it should be, and I want to say kudos to all who have been responsible for it. I wish and hope to utilize the collective consciousness to mount our challenge. Any pointers, code, help, collaboration et al. for any of the 6 points above - it goes with saying/asking - is welcome and look forward to share our experiences in a formal written discourse as and when we have them. Cheers, Srikant Ning Li wrote: Find out how you can get spam free email. http://www.bluebottle.com/tag/3
-
Re: Lucene-based Distributed Index Leveraging HadoopStefan Groschupf 2008-04-04, 02:44
Hi All,
we are also very much interested in such a system and actually have to realize such a system for an project within the next 3 month. I would prefer to work on a open source solution instead of doing another one behind closed doors, though we would need to start coding pretty soon. We have 3 fulltime developers we could contribute for this time to such a project. I'm happy to do all the organisational work like setting up the complete infrastructure etc to get it started. I suggest we start with an sourceforge project since this is fast to setup and if we qualify for apache as an lucene or hadoop subproject migrate there later, or is it easy to start a apache incubator project? We might just need a nice name for the project. Doug, any idea? :-) Should we start from scratch or with a code contribution? Someone still want to contribute its implementation? Thanks. Stefan On Feb 6, 2008, at 10:57 AM, Ning Li wrote: > There have been several proposals for a Lucene-based distributed index > architecture. > 1) Doug Cutting's "Index Server Project Proposal" at > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html > 2) Solr's "Distributed Search" at > http://wiki.apache.org/solr/DistributedSearch > 3) Mark Butler's "Distributed Lucene" at > http://wiki.apache.org/hadoop/DistributedLucene > > We have also been working on a Lucene-based distributed index > architecture. > Our design differs from the above proposals in the way it leverages > Hadoop > as much as possible. In particular, HDFS is used to reliably store > Lucene > instances, Map/Reduce is used to analyze documents and update Lucene > instances > in parallel, and Hadoop's IPC framework is used. Our design is > geared for > applications that require a highly scalable index and where batch > updates > to each Lucene instance are acceptable (verses finer-grained > document at > a time updates). > > We have a working implementation of our design and are in the process > of evaluating its performance. An overview of our design is provided > below. > We welcome feedback and would like to know if you are interested in > working > on it. If so, we would be happy to make the code publicly available. > At the > same time, we would like to collaborate with people working on > existing > proposals and see if we can consolidate our efforts. > > TERMINOLOGY > A distributed "index" is partitioned into "shards". Each shard > corresponds > to > a Lucene instance and contains a disjoint subset of the documents in > the > index. > Each shard is stored in HDFS and served by one or more "shard > servers". Here > we only talk about a single distributed index, but in practice > multiple > indexes > can be supported. > > A "master" keeps track of the shard servers and the shards being > served by > them. An "application" updates and queries the global index through an > "index client". An index client communicates with the shard servers to > execute a query. > > KEY RPC METHODS > This section lists the key RPC methods in our design. To simplify the > discussion, some of their parameters have been omitted. > > On the Shard Servers > // Execute a query on this shard server's Lucene instance. > // This method is called by an index client. > SearchResults search(Query query); > > On the Master > // Tell the master to update the shards, i.e., Lucene instances. > // This method is called by an index client. > boolean updateShards(Configuration conf); > > // Ask the master where the shards are located. > // This method is called by an index client. > LocatedShards getShardLocations(); > > // Send a heartbeat to the master. This method is called by a > // shard server. In the response, the master informs the > // shard server when to switch to a newer version of the index. > ShardServerCommand sendHeartbeat(); > > QUERYING THE INDEX > To query the index, an application sends a search request to an index ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
-
Re: Lucene-based Distributed Index Leveraging HadoopStefan Groschupf 2008-04-04, 02:51
> Should we start from scratch or with a code contribution?
> Someone still want to contribute its implementation? I just noticed - to late though - Ning already contributed the code to hadoop. So I guess my question should be rephrased what is the idea of moving this into a own project?
-
Re: Lucene-based Distributed Index Leveraging HadoopRichard Marr 2008-08-21, 15:36
> Stefan Groschupf (4 Apr) wrote:
> > I just noticed - to late though - Ning already contributed the code to > hadoop. So I guess my question should be rephrased what is the idea of > moving this into a own project? Hi all, It was interesting to hear Mark Butler present his implementation of Distributed Lucene at the Hadoop User Group meeting in London on Tuesday. There's obviously been quite a bit of discussion on the subject, and lots of interested parties. Mark, not sure if you're on this list but thanks for sharing. Is this the forum to ask about open projects? I'm interested in joining a project as long as it's goals aren't too distant to what I'm looking for. Based mostly on gut feeling I'd rather go for a stand-alone project that wasn't dependent on HDFS/Hadoop, but willing to be convinced otherwise. Rich
-
RE: Lucene-based Distributed Index Leveraging Hadoopmarcus clemens 2008-08-21, 15:51
message to Mark Butler . i am looking for candidates that have lucene exp for contract and permanent positions can you please send me your cv > Date: Thu, 21 Aug 2008 16:36:01 +0100> From: [EMAIL PROTECTED]> To: [EMAIL PROTECTED]> Subject: Re: Lucene-based Distributed Index Leveraging Hadoop> > > Stefan Groschupf (4 Apr) wrote:> >> > I just noticed - to late though - Ning already contributed the code to> > hadoop. So I guess my question should be rephrased what is the idea of> > moving this into a own project?> > > Hi all,> > It was interesting to hear Mark Butler present his implementation of> Distributed Lucene at the Hadoop User Group meeting in London on> Tuesday. There's obviously been quite a bit of discussion on the> subject, and lots of interested parties. Mark, not sure if you're on> this list but thanks for sharing.> > Is this the forum to ask about open projects? I'm interested in> joining a project as long as it's goals aren't too distant to what I'm> looking for. Based mostly on gut feeling I'd rather go for a> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing> to be convinced otherwise.> > Rich _________________________________________________________________ Get Hotmail on your mobile from Vodafone http://clk.atdmt.com/UKM/go/107571435/direct/01/
-
Re: Lucene-based Distributed Index Leveraging HadoopStefan Groschupf 2008-08-21, 17:41
>
Hi All, Hi Mark, > It was interesting to hear Mark Butler present his implementation of > Distributed Lucene at the Hadoop User Group meeting in London on > Tuesday. There's obviously been quite a bit of discussion on the > subject, and lots of interested parties. Mark, not sure if you're on > this list but thanks for sharing. Is there any material published about this? I would be very interested to see Marks slides and hear about the discussion. > Is this the forum to ask about open projects? I'm interested in > joining a project as long as it's goals aren't too distant to what I'm > looking for. Based mostly on gut feeling I'd rather go for a > stand-alone project that wasn't dependent on HDFS/Hadoop, but willing > to be convinced otherwise. Rich, as you know there are a couple project in this area solar, compass, dlucene and katta and since all are open source I guess the easiest way to be involved is to join the mailing lists. I only can speak for katta - we are very interested in getting more people involved to get other perspective. There is quite some activity in our project since our project is part of a upcoming production system, but low traffic in mailing list (So far all developers work in the same room). You can find our mailing list on our source forge page: http://katta.wiki.sourceforge.net/ Please keep in mind that katta is very young and compass or solr might be more interesting if you need something working now, though they might have different goals and focus than dlucene or katta. Stefan Groschupf
-
Re: Lucene-based Distributed Index Leveraging HadoopRichard Marr 2008-08-22, 10:01
Stefan,
> Is there any material published about this? I would be very interested to > see Marks slides and hear about the discussion. I believe all the slides will be available. I'll post a link as soon as I have one. > Please keep in mind that katta is very young and compass or solr might be > more interesting if you need something working now, though they might have > different goals and focus than dlucene or katta. I am looking to have something working relatively quickly, but my performance needs and use cases are relatively modest (for now) so some degree of string and sticky tape in the implementation is okay in the short term. My main aim is to ensure that whatever I implement scales horizontally without too much drama. In terms of which project best fits my needs my gut feeling is that dlucene is pretty close. It supports incremental updates, and doesn't build in dependencies on systems like HDFS or Terracotta (I don't yet understand all the implications of those systems so would rather keep things simple if possible). The obvious drawback being that dlucene doesn't seem to be an active public project. Thanks for the reply Stefan. I'll certainly be taking a look through the code for Katta since no doubt there's a lot to learn in there. All the best... Rich
-
Re: Lucene-based Distributed Index Leveraging HadoopStefan Groschupf 2008-08-22, 10:42
Hi,
> In terms of which project best fits my needs my gut feeling is that > dlucene is pretty close. It supports incremental updates, and doesn't > build in dependencies on systems like HDFS or Terracotta (I don't yet > understand all the implications of those systems so would rather keep > things simple if possible). Upgrades... The way we solve this with katta is that we simply deploy a new small index and use * in the client instead of a fixed index name. Than once a night we merge all the small indexes (since this slows down things) together to a big new index. To solve the problem of duplicate documents each document gets a timestamp and in the client we do a simple dedub based on a key and use always the latest document with the latest time stamp. Dependencies... Katta is independent of those technologies, it is lucene, zookeeper and hadoop RPI (instead of RMI, http or Apache Mina). Though we support loading index shards from a hadoop file system, but you also can load them from a mounted remote hdd NAS or what ever you like > The obvious drawback being that dlucene > doesn't seem to be an active public project. Mark need to answer this but dlucene is checked in to the katta svn and I saw Marko checking in changes to dlucene. There was a discussion between Mark and me to bring dlucene and katta together and I really would love to see that happen but unfortunately we had a lot of pressure from our customer to deliver something so we had to focus on other things. More developers getting involved would clearly help here.. :-) > > > Thanks for the reply Stefan. I'll certainly be taking a look through > the code for Katta since no doubt there's a lot to learn in there. Katta will be deployed into a production system of our customer in less than 4 weeks - so we working hard to iron out issues. However katta is running since 6 weeks in a 10 node test environment with heavy load. Stefan
-
Re: Lucene-based Distributed Index Leveraging HadoopRichard Marr 2008-08-22, 14:04
Stefan,
I've got a lot of reading and learning to do :o) Thanks for the info, and good luck with your deployment. Rich
-
Re: Lucene-based Distributed Index Leveraging HadoopRichard Marr 2008-08-22, 16:17
2008/8/21 Stefan Groschupf <[EMAIL PROTECTED]>:
> > Is there any material published about this? I would be very interested to > see Marks slides and hear about the discussion. > In case anybody wants to see Marl's talk, the slides and video are here: http://skillsmatter.com/podcast/home/distributed-lucene-for-hadoop Rich |