|
cristi prodan
2010-03-19, 13:34
Ted Dunning
2010-03-19, 17:47
cristi prodan
2010-03-23, 08:35
Cristian Prodan
2010-04-01, 06:32
Robin Anil
2010-04-01, 06:36
Cristian Prodan
2010-04-01, 07:51
Tanya Gupta
2010-04-02, 09:14
Ankur C. Goel
2010-03-22, 08:03
Sean Owen
2010-03-19, 13:44
|
-
My ideas for GSoC 2010cristi prodan 2010-03-19, 13:34
Dear Mahout community,
My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was very impressed by the power of these frameowrks and their great potential. For this reason I would like to submit a proposal for this year Google Summer of Code competition. I have looked at the proposals made by Robin on JIRA (https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021). I have stopped at two ideas. I would like to ask for your help in deciding which idea would be best to pick. Since I've never done GSoC before, I'm hoping someone would advise on the size of the project (too small or two big for the summer period) and mostly, it's importance for the Mahout framwork. After hearing your answers my intentions are to fully focus on the thourough research of a single idea. IDEA 1 - MinHash clustering --------------------------- The first idea come after taking a look at Google 's paper on collaborative filtering for their news system[2]. In that paper, I looked at MinHash clustering. My first question is: is MinHash clustering considered cool ? If yes, than I would like to take a stab at implementing it. The paper also describes the implementation in a MapReduce style. Since this is only a suggestion I will not elaborate very much on the solution now. I would like to ask you weather this might be considered a good choice (i.e. important for the framework to have something like this) and if this is a big enough project. IDEA 2 - Additions to Taste Recommender --------------------------------------- As a second idea for this competition, was to add some capabilities to the Taste framework. I have revised a couple of papers from the Netflix contest winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. My idea was to implement a parallel prediction blending support by using linear regression or any other machine learning method - but so far I didn't got to a point where I would have a clear solution of this. I'm preparing my disertation paper on recommender systems and this was the first idea I got when thinking about participating to GSoC. If you have any ideas on this and want to share them, I would be very thankful. Thank you in advance. Best regards, Cristi. BIBLIOGRAPHY: --------------- [1] Owen, Anil - Mahout in Action. Manning, 2010. [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News Personalization: Scalable Online Collaborative Filtering, WWW 2007. +
cristi prodan 2010-03-19, 13:34
-
Re: My ideas for GSoC 2010Ted Dunning 2010-03-19, 17:47
Minhash clustering is important for duplicate detection. You can also
do simhash clustering<http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which might be simpler to implement. I can imagine an implementation where the map generates the simhash and emits multiple copies keyed on part of the hash. Near duplicates would thus be grouped together in the reduce where an in-memory duplicate detection could be done. Sean's intuitions about project size are generally better than mine, but if you limit you are strict about not increasing scope, I think you could succeed with this project. On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <[EMAIL PROTECTED]>wrote: > IDEA 1 - MinHash clustering > --------------------------- > The first idea come after taking a look at Google 's paper on collaborative > filtering for their news system[2]. In that paper, I looked at MinHash > clustering. > My first question is: is MinHash clustering considered cool ? If yes, than > I would like to take a stab at implementing it. > The paper also describes the implementation in a MapReduce style. Since > this is only a suggestion I will not elaborate very much on the solution > now. I would like to ask you weather this might be considered a good choice > (i.e. important for the framework to have something like this) and if this > is a big enough project. > +
Ted Dunning 2010-03-19, 17:47
-
Re: My ideas for GSoC 2010cristi prodan 2010-03-23, 08:35
Hello again,
First of all, thank you all for taking time to answer my ideas. Based on your thoughts, I have been digging around, and the project I would very much like to propose is a MapReduce implementation of the simhash algorithm. Thank you Ted for the great idea! I'm keen on starting implementing a new algorithm for Mahout, which seems manageable and doable during the summer period and which has a clear scope. I have noticed quite a few comments, advising about not taking a project which would be to big and I intend to follow them. After taking a deep look at the article pointed out by Ted, as well as other articles and simlar ideas (Charikar's algorithm, Ankur's patch for Minhash), I've decided to write some notes on the algorithm and a schedule for implemeting the idea during the competition. I kindly ask for your feedback on this proposal. 1. Problem and outline of simhash --------------------------------------- Detecting similar files and cliassifying documents requires complex heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm [1] relies on computing a hash function that hashes similar files to similar values. The file similarities would then be determined by comparing the pre-sorted hash key values (and reducing the complexity to O(n log n)). In order to futher improve the detection mechanism, the algorithm will also store some auxiliary data, used to compute the hash keys. This will be used as a second filter, i.e. after the hash key comparison indicates that two files are potentiallly similar. Properties for the similarity function and the algorithm: - very similar files map to very similar or even the same hash key; - distance between keys should be some measure of the difference between files. This would lead to keys proportional to file sizes and this would create false positives. Some auxiliary data will provide an easy and efficient way of refining the similarity detection. - the metric used will be a binary metric (simhash operats at byte level). - given a similarity metric, there needs to be a threshold to determine how close within the metric files need to be to count as similar. I would like my implementation of the simhash algorithm to answer two questions: - retrieving a set of files, similar to a given file; - retrieving all pairs of similar files. 2. simhash implementation ------------------------------ Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to count the occurances of these tags within the file under processing. 2.1 Computing the hash function for each file ------------------------------------------------- for each directory for each file within directory - scan through the file and look for matches for each tag; - detection window is moved one bit at a time - if a match if found - set a skip_counter and then decrement it for the following bits; - a count of matches is stored for each tag, and these are stored as SUM_TABLE_ENTRIES - the KEY for the current file is computed as a function of the SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] - after this processing, we store: | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | The key function could also be implemented to take into account file extensions. 2.2 Finding similarities ---------------------------- for each FILE SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); for each SIMILAR_FILE if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) discard the query else compute the distance between SIMLAR_FILE and FILE ' s SUM_TABLE_ENTRIES (i.e. "the sum of the absolute values of the diffs between their entries) if the computed distance is within a tollerance or is equal to 0 then "The files are identical" 3. Distributing simhash on hadoop -----------------------------------The algorithm above is very suited for a MapReduce implementation. 1. Map In this phase we do the computation of the hash for a file. It outputs (File, simhash_key). 2. Reduce Once every item has a simhash, group everyfile with the same simhash into a cluster, sorted by the simhashkey (the key is an integer [#1]) . QUESTIONS SO FAR: 1. For the distributed version, I believe there should be a prepocessing phase where data about files should be stored on HDFS, in a proper vector format ? Would it be better to store this in HBase, Cassandra or something similar ? 2. Would it be better in the reduce phase to actually eliminate the duplicates and create a single sorted-by-hash file ? 4. Implementation schedule *** May 24 - Jul 12: * Week 1: write documentation for the algorithm on the wiki and start planning the objects would interact (classes, methods etc.). * Week 2,3,4: write the algorithm for hashing of the files. * Week 5: Thoroughly test * Week 6: Document the MapReduce implementation - 2 days. Organize the data structures 3 - days. * Week 7,8,9: Write the distributed MapReduce Implementation of the algorithm along with tests. Test as much as possible on different datasets. *** Jul 16 - Aug 9: * Week 10, 11, 12, 13: Continue with tests on different datasets for the distributed implementation. Also if time permits, I would like to make graphs and publish results with the carried tests. I'm very open to suggestions regarding this plan, so any observation is highly appreciated. Thank you again for taking time to read this. Regards, Cristi. [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, December 13, 2007. On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote: +
cristi prodan 2010-03-23, 08:35
-
Re: My ideas for GSoC 2010Cristian Prodan 2010-04-01, 06:32
Hi,
Can anyone please point me a good data set on which I might try SimHash clustering ? Thank you, Cristi On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan <[EMAIL PROTECTED]>wrote: > Hello again, > > First of all, thank you all for taking time to answer my ideas. Based on > your thoughts, I have been digging around, and the project I would very > much like to propose is a MapReduce implementation of the simhash algorithm. > Thank you Ted for the great idea! I'm keen on starting implementing a new > algorithm for Mahout, which seems manageable and doable during the summer > period and which has a clear scope. I have noticed quite a few comments, > advising about not taking a project which would be to big and I intend to > follow them. > > After taking a deep look at the article pointed out by Ted, as well as > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for > Minhash), I've decided to write some notes on the algorithm and a schedule > for implemeting the idea during the competition. I kindly ask for your > feedback on this proposal. > > > 1. Problem and outline of simhash > --------------------------------------- > > Detecting similar files and cliassifying documents requires complex > heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm > [1] relies on computing a hash function that hashes similar files to similar > values. The file similarities would then be determined by comparing the > pre-sorted hash key values (and reducing the complexity to O(n log n)). > In order to futher improve the detection mechanism, the algorithm will also > store some auxiliary data, used to compute the hash keys. This will be used > as a second filter, i.e. after the hash key comparison indicates that two > files are potentiallly similar. > > Properties for the similarity function and the algorithm: > - very similar files map to very similar or even the same hash key; > - distance between keys should be some measure of the difference between > files. This would lead to keys proportional to file sizes and this would > create false positives. Some auxiliary data will provide an easy and > efficient way of refining the similarity detection. > - the metric used will be a binary metric (simhash operats at byte level). > - given a similarity metric, there needs to be a threshold to determine how > close within the metric files need to be to count as similar. > > I would like my implementation of the simhash algorithm to answer two > questions: > - retrieving a set of files, similar to a given file; > - retrieving all pairs of similar files. > > > 2. simhash implementation > ------------------------------ > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to > count the occurances of these tags within the file under processing. > > > 2.1 Computing the hash function for each file > ------------------------------------------------- > for each directory > for each file within directory > - scan through the file and look for matches for each tag; > - detection window is moved one bit at a time > - if a match if found > - set a skip_counter and then decrement it for the following bits; > - a count of matches is stored for each tag, and these are stored as > SUM_TABLE_ENTRIES > - the KEY for the current file is computed as a function of the > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] > - after this processing, we store: > | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | > > The key function could also be implemented to take into account file > extensions. > > > 2.2 Finding similarities > ---------------------------- > for each FILE > SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); > for each SIMILAR_FILE > if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) > discard the query > else > compute the distance between SIMLAR_FILE and FILE ' s +
Cristian Prodan 2010-04-01, 06:32
-
Re: My ideas for GSoC 2010Robin Anil 2010-04-01, 06:36
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
and couple of overlapping ones. You can easily find issues with the clustering code with that dataset. Once its done you can try bigger datasets like wikipedia Robin On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan <[EMAIL PROTECTED]>wrote: > Hi, > > Can anyone please point me a good data set on which I might try SimHash > clustering ? > Thank you, > > Cristi > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan > <[EMAIL PROTECTED]>wrote: > > > Hello again, > > > > First of all, thank you all for taking time to answer my ideas. Based on > > your thoughts, I have been digging around, and the project I would very > > much like to propose is a MapReduce implementation of the simhash > algorithm. > > Thank you Ted for the great idea! I'm keen on starting implementing a new > > algorithm for Mahout, which seems manageable and doable during the summer > > period and which has a clear scope. I have noticed quite a few comments, > > advising about not taking a project which would be to big and I intend to > > follow them. > > > > After taking a deep look at the article pointed out by Ted, as well as > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for > > Minhash), I've decided to write some notes on the algorithm and a > schedule > > for implemeting the idea during the competition. I kindly ask for your > > feedback on this proposal. > > > > > > 1. Problem and outline of simhash > > --------------------------------------- > > > > Detecting similar files and cliassifying documents requires complex > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash > algorithm > > [1] relies on computing a hash function that hashes similar files to > similar > > values. The file similarities would then be determined by comparing the > > pre-sorted hash key values (and reducing the complexity to O(n log n)). > > In order to futher improve the detection mechanism, the algorithm will > also > > store some auxiliary data, used to compute the hash keys. This will be > used > > as a second filter, i.e. after the hash key comparison indicates that two > > files are potentiallly similar. > > > > Properties for the similarity function and the algorithm: > > - very similar files map to very similar or even the same hash key; > > - distance between keys should be some measure of the difference between > > files. This would lead to keys proportional to file sizes and this would > > create false positives. Some auxiliary data will provide an easy and > > efficient way of refining the similarity detection. > > - the metric used will be a binary metric (simhash operats at byte > level). > > - given a similarity metric, there needs to be a threshold to determine > how > > close within the metric files need to be to count as similar. > > > > I would like my implementation of the simhash algorithm to answer two > > questions: > > - retrieving a set of files, similar to a given file; > > - retrieving all pairs of similar files. > > > > > > 2. simhash implementation > > ------------------------------ > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is > to > > count the occurances of these tags within the file under processing. > > > > > > 2.1 Computing the hash function for each file > > ------------------------------------------------- > > for each directory > > for each file within directory > > - scan through the file and look for matches for each tag; > > - detection window is moved one bit at a time > > - if a match if found > > - set a skip_counter and then decrement it for the following bits; > > - a count of matches is stored for each tag, and these are stored as > > SUM_TABLE_ENTRIES > > - the KEY for the current file is computed as a function of the > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] > > - after this processing, we store: > > | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | +
Robin Anil 2010-04-01, 06:36
-
Re: My ideas for GSoC 2010Cristian Prodan 2010-04-01, 07:51
Thanks Robin, I will try have a look at that.
Cristi. On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics > and couple of overlapping ones. You can easily find issues with the > clustering code with that dataset. Once its done you can try bigger > datasets > like wikipedia > > Robin > > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan > <[EMAIL PROTECTED]>wrote: > > > Hi, > > > > Can anyone please point me a good data set on which I might try SimHash > > clustering ? > > Thank you, > > > > Cristi > > > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan > > <[EMAIL PROTECTED]>wrote: > > > > > Hello again, > > > > > > First of all, thank you all for taking time to answer my ideas. Based > on > > > your thoughts, I have been digging around, and the project I would > very > > > much like to propose is a MapReduce implementation of the simhash > > algorithm. > > > Thank you Ted for the great idea! I'm keen on starting implementing a > new > > > algorithm for Mahout, which seems manageable and doable during the > summer > > > period and which has a clear scope. I have noticed quite a few > comments, > > > advising about not taking a project which would be to big and I intend > to > > > follow them. > > > > > > After taking a deep look at the article pointed out by Ted, as well as > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch > for > > > Minhash), I've decided to write some notes on the algorithm and a > > schedule > > > for implemeting the idea during the competition. I kindly ask for your > > > feedback on this proposal. > > > > > > > > > 1. Problem and outline of simhash > > > --------------------------------------- > > > > > > Detecting similar files and cliassifying documents requires complex > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash > > algorithm > > > [1] relies on computing a hash function that hashes similar files to > > similar > > > values. The file similarities would then be determined by comparing the > > > pre-sorted hash key values (and reducing the complexity to O(n log n)). > > > In order to futher improve the detection mechanism, the algorithm will > > also > > > store some auxiliary data, used to compute the hash keys. This will be > > used > > > as a second filter, i.e. after the hash key comparison indicates that > two > > > files are potentiallly similar. > > > > > > Properties for the similarity function and the algorithm: > > > - very similar files map to very similar or even the same hash key; > > > - distance between keys should be some measure of the difference > between > > > files. This would lead to keys proportional to file sizes and this > would > > > create false positives. Some auxiliary data will provide an easy and > > > efficient way of refining the similarity detection. > > > - the metric used will be a binary metric (simhash operats at byte > > level). > > > - given a similarity metric, there needs to be a threshold to determine > > how > > > close within the metric files need to be to count as similar. > > > > > > I would like my implementation of the simhash algorithm to answer two > > > questions: > > > - retrieving a set of files, similar to a given file; > > > - retrieving all pairs of similar files. > > > > > > > > > 2. simhash implementation > > > ------------------------------ > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is > > to > > > count the occurances of these tags within the file under processing. > > > > > > > > > 2.1 Computing the hash function for each file > > > ------------------------------------------------- > > > for each directory > > > for each file within directory > > > - scan through the file and look for matches for each tag; > > > - detection window is moved one bit at a time > > > - if a match if found > > > - set a skip_counter and then decrement it for the following +
Cristian Prodan 2010-04-01, 07:51
-
Re: My ideas for GSoC 2010Tanya Gupta 2010-04-02, 09:14
Hi
There are couple of questions I would like to ask. 1. What type of clustering would you like me to use? Is K-Means good enough? 2.Can you tell me more about the map reduce code that you would have written. Or first do I need to implement that as well using Hadoop? Thanking You Tanya On Thu, Apr 1, 2010 at 1:21 PM, Cristian Prodan <[EMAIL PROTECTED]>wrote: > Thanks Robin, I will try have a look at that. > > Cristi. > > On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > > > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics > > and couple of overlapping ones. You can easily find issues with the > > clustering code with that dataset. Once its done you can try bigger > > datasets > > like wikipedia > > > > Robin > > > > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan > > <[EMAIL PROTECTED]>wrote: > > > > > Hi, > > > > > > Can anyone please point me a good data set on which I might try SimHash > > > clustering ? > > > Thank you, > > > > > > Cristi > > > > > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Hello again, > > > > > > > > First of all, thank you all for taking time to answer my ideas. Based > > on > > > > your thoughts, I have been digging around, and the project I would > > very > > > > much like to propose is a MapReduce implementation of the simhash > > > algorithm. > > > > Thank you Ted for the great idea! I'm keen on starting implementing a > > new > > > > algorithm for Mahout, which seems manageable and doable during the > > summer > > > > period and which has a clear scope. I have noticed quite a few > > comments, > > > > advising about not taking a project which would be to big and I > intend > > to > > > > follow them. > > > > > > > > After taking a deep look at the article pointed out by Ted, as well > as > > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch > > for > > > > Minhash), I've decided to write some notes on the algorithm and a > > > schedule > > > > for implemeting the idea during the competition. I kindly ask for > your > > > > feedback on this proposal. > > > > > > > > > > > > 1. Problem and outline of simhash > > > > --------------------------------------- > > > > > > > > Detecting similar files and cliassifying documents requires complex > > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash > > > algorithm > > > > [1] relies on computing a hash function that hashes similar files to > > > similar > > > > values. The file similarities would then be determined by comparing > the > > > > pre-sorted hash key values (and reducing the complexity to O(n log > n)). > > > > In order to futher improve the detection mechanism, the algorithm > will > > > also > > > > store some auxiliary data, used to compute the hash keys. This will > be > > > used > > > > as a second filter, i.e. after the hash key comparison indicates that > > two > > > > files are potentiallly similar. > > > > > > > > Properties for the similarity function and the algorithm: > > > > - very similar files map to very similar or even the same hash key; > > > > - distance between keys should be some measure of the difference > > between > > > > files. This would lead to keys proportional to file sizes and this > > would > > > > create false positives. Some auxiliary data will provide an easy and > > > > efficient way of refining the similarity detection. > > > > - the metric used will be a binary metric (simhash operats at byte > > > level). > > > > - given a similarity metric, there needs to be a threshold to > determine > > > how > > > > close within the metric files need to be to count as similar. > > > > > > > > I would like my implementation of the simhash algorithm to answer two > > > > questions: > > > > - retrieving a set of files, similar to a given file; > > > > - retrieving all pairs of similar files. > > > > > > > > > > > > 2. simhash implementation +
Tanya Gupta 2010-04-02, 09:14
-
Re: My ideas for GSoC 2010Ankur C. Goel 2010-03-22, 08:03
Since Sean already answered IDEA-2, I'll reply to IDEA 1.
Minhash (and Shingling in general) are very efficient clustering techniques that have traditionally been employed by Search engines for near-duplicate detection of web documents. They are known to be efficient and effective at web-scale. Also they have been applied successfully to "near-neighbor or similar ..." type of problems in recommendations, search, and various other domains on text, audio and image data. So I think they are quite "cool". I personally have experimented with them in recommendations and found them to be surprisingly more effective than I anticipated especially when dealing with high dimensional data. I have a basic implementation that I submitted to mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome to work on it if you'd like. Improving and integrating it with our recommender would make up for a good GSOC project IMO. Regards -@nkur 3/19/10 7:04 PM, "cristi prodan" <[EMAIL PROTECTED]> wrote: Dear Mahout community, My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was very impressed by the power of these frameowrks and their great potential. For this reason I would like to submit a proposal for this year Google Summer of Code competition. I have looked at the proposals made by Robin on JIRA (https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021). I have stopped at two ideas. I would like to ask for your help in deciding which idea would be best to pick. Since I've never done GSoC before, I'm hoping someone would advise on the size of the project (too small or two big for the summer period) and mostly, it's importance for the Mahout framwork. After hearing your answers my intentions are to fully focus on the thourough research of a single idea. IDEA 1 - MinHash clustering --------------------------- The first idea come after taking a look at Google 's paper on collaborative filtering for their news system[2]. In that paper, I looked at MinHash clustering. My first question is: is MinHash clustering considered cool ? If yes, than I would like to take a stab at implementing it. The paper also describes the implementation in a MapReduce style. Since this is only a suggestion I will not elaborate very much on the solution now. I would like to ask you weather this might be considered a good choice (i.e. important for the framework to have something like this) and if this is a big enough project. IDEA 2 - Additions to Taste Recommender --------------------------------------- As a second idea for this competition, was to add some capabilities to the Taste framework. I have revised a couple of papers from the Netflix contest winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. My idea was to implement a parallel prediction blending support by using linear regression or any other machine learning method - but so far I didn't got to a point where I would have a clear solution of this. I'm preparing my disertation paper on recommender systems and this was the first idea I got when thinking about participating to GSoC. If you have any ideas on this and want to share them, I would be very thankful. Thank you in advance. Best regards, Cristi. BIBLIOGRAPHY: --------------- [1] Owen, Anil - Mahout in Action. Manning, 2010. [2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News Personalization: Scalable Online Collaborative Filtering, WWW 2007. +
Ankur C. Goel 2010-03-22, 08:03
-
Re: My ideas for GSoC 2010Sean Owen 2010-03-19, 13:44
I think that's a fine project indeed. It sounds even a little
ambitious for a GSoC project. Understanding, implementing, and parallelizing this approach is not trivial. If you want to propose it, sure, but scaling it back a little is probably OK too. As always it's best to propose a simple project that you can complete, document and test, rather than a big project you half-finish. On Fri, Mar 19, 2010 at 1:34 PM, cristi prodan <[EMAIL PROTECTED]> wrote: > IDEA 2 - Additions to Taste Recommender > --------------------------------------- > As a second idea for this competition, was to add some capabilities to the Taste framework. I have revised a couple of papers from the Netflix contest winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. My idea was to implement a parallel prediction blending support by using linear regression or any other machine learning method - but so far I didn't got to a point where I would have a clear solution of this. I'm preparing my disertation paper on recommender systems and this was the first idea I got when thinking about participating to GSoC. If you have any ideas on this and want to share them, I would be very thankful. +
Sean Owen 2010-03-19, 13:44
|