|
|
Sean Owen 2011-08-17, 09:48
Hi all, I'm again seeing the issue count tend to pile up. I try to run through regularly to resolve anything addressed to me, and even things that aren't but that I am confident enough to fix. It would be great if everyone could do the same in a spare 1-2 hours this week, if only to say "yes, go ahead on that patch" or "no I don't think this is a good idea". Especially the committers who have not been active in a while.
To me, this is the most essential work we can do, because without responses from those with power to commit, new community members get the message that their contributions are ignored, or that nobody's home. That's no good. Understanding that individuals may not have time to actively write their own new changes and improvements, it seems that the least we can all do is involve and respond to external input, to bring in those who want to make changes.
I'd also like to sweep through the issues that have not been touched in 6+ months and close some that just do not seem to be getting any traction or attention. The theory is that closing stuff that by all accounts won't get looked at better communicates what's coming in the project, and focuses attention on issues that might get looked at.
Before I start that though, would welcome anyone to peek at everything that's open and assign, comment, ping, etc. anything that needs to be kept alive.
Sebastian Schelter 2011-08-17, 10:13
I ran through the issues and compiled a list of tasks I'll care about, please keep them open. MAHOUT-710 Implementing K-Trusses
I'll pick up work on that in a few weeks. MAHOUT-777 Improve TransposeJob to use a Combiner
Patch needs review from Jake (or anyone else willing to do that) MAHOUT-767 Improve RowSimilarityJob performance for count-based distance measures
Currently working on that. MAHOUT-773 Implement Random Walk with Restarts
I have a working patch, but Ted suggested the use of random projections, I'll have a look into that if I get hold of a linear algebra guru :) MAHOUT-737 Implicit Alternating Least Squares SVD
This one needs input from Tamas Jambor, the code looked good, yet he has to rework it to avoid a matrix inversion. MAHOUT-609 Add an option to make RecommenderJob write out it's computed item similarities
This should be a small change that I'll add in some spare time or before the next release.
--sebastian
On 17.08.2011 11:48, Sean Owen wrote: > Hi all, I'm again seeing the issue count tend to pile up. I try to run > through regularly to resolve anything addressed to me, and even things that > aren't but that I am confident enough to fix. It would be great if everyone > could do the same in a spare 1-2 hours this week, if only to say "yes, go > ahead on that patch" or "no I don't think this is a good idea". Especially > the committers who have not been active in a while. > > To me, this is the most essential work we can do, because without responses > from those with power to commit, new community members get the message that > their contributions are ignored, or that nobody's home. That's no good. > Understanding that individuals may not have time to actively write their own > new changes and improvements, it seems that the least we can all do is > involve and respond to external input, to bring in those who want to make > changes. > > I'd also like to sweep through the issues that have not been touched in 6+ > months and close some that just do not seem to be getting any traction or > attention. The theory is that closing stuff that by all accounts won't get > looked at better communicates what's coming in the project, and focuses > attention on issues that might get looked at. > > Before I start that though, would welcome anyone to peek at everything > that's open and assign, comment, ping, etc. anything that needs to be kept > alive. >
Grant Ingersoll 2011-08-17, 10:44
More later, but…
On Aug 17, 2011, at 5:13 AM, Sebastian Schelter wrote:
> > MAHOUT-767 Improve RowSimilarityJob performance for count-based distance measures > > Currently working on that.
I'm looking to test what you have on the ASF mail archives in the coming few weeks.
Shannon Quinn 2011-08-17, 14:25
My issues of interest:
MAHOUT-516: Eigencuts produces unexpected results This is finding a decent heuristic for automatically determining the degree fed to the Lanczos solver.
MAHOUT-517: Eigencuts needs an output format Something better than System.out.println()
MAHOUT-518: Implement Affinity Preprocessing for Eigencuts and Spectral KMeans Have a map job sit in front of eigencuts/spectral k-means that converts some standard input format (perhaps CSV?) into the affinity matrix used by the algorithms.
MAHOUT-524: DisplaySpectralKMeans example fails Something strange in the clusters shown in the example.
MAHOUT-537: Bring DistributedRowMatrix into compliance with Hadoop 0.20.2 Somewhat on hold until a later version of Hadoop.
I'm sorry for my lack of activity; my summer internship with Google kept me busier than I'd expected. However, with the return of PhD research, my advisor wants to very quickly--as in, the next couple months--push out a prototype for our framework that uses Mahout, so these issues should be attended to very, very soon.
Shannon
On 8/17/11 6:44 AM, Grant Ingersoll wrote: > More later, but� > > On Aug 17, 2011, at 5:13 AM, Sebastian Schelter wrote: > >> MAHOUT-767 Improve RowSimilarityJob performance for count-based distance measures >> >> Currently working on that. > I'm looking to test what you have on the ASF mail archives in the coming few weeks. >
Ted Dunning 2011-08-17, 14:51
Just holler.
On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> MAHOUT-773 Implement Random Walk with Restarts > > I have a working patch, but Ted suggested the use of random projections, > I'll have a look into that if I get hold of a linear algebra guru :) >
Frank Scholten 2011-08-17, 15:57
MAHOUT-612: Simplify configuring and running Mahout MapReduce jobs from Java using Java bean configuration
Currently working on Fuzzy K-Means configuration
On Wed, Aug 17, 2011 at 4:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Just holler. > > On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > >> MAHOUT-773 Implement Random Walk with Restarts >> >> I have a working patch, but Ted suggested the use of random projections, >> I'll have a look into that if I get hold of a linear algebra guru :) >> >
Sebastian Schelter 2011-08-17, 16:09
Thanks for offering your help, but I guess it need something more than someone helping per mail...
I need someone to sit down with me in my office and answer lots of possibly very embarassing questions :)
--sebastian
On 17.08.2011 16:51, Ted Dunning wrote: > Just holler. > > On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > MAHOUT-773 Implement Random Walk with Restarts > > I have a working patch, but Ted suggested the use of random > projections, I'll have a look into that if I get hold of a linear > algebra guru :) > >
Dmitriy Lyubimov 2011-08-17, 19:01
I will take a look although there seem to be a lot of new stuff i don't have time to read the science for.
On top of it, i was planning some improvements on SSVD scaling and getting rid of current limitations for some time now, such as
-- SSVD-wide enhancements: to allow better wide scaling, in summary to billions of non-zero elements per row: -- remove at least k+p rows per map task limiation without causing "supersplits" by allowing blocked QR pushdown to reducers (or perhaps even automatic pushdown, i am not sure if it is possible). -- I have already used SSVD code that equips vector with a preprocessor via Configured hadoop interface allowing on-the fly random projection which allows to randomly project very long rows without ever loadnig them in memory
-- "SSVD-tall" improvements: to allow more vertical scaling (currently thought to be at about billion rows with a lot of memory) by introducing more bottom-up divide-and-conquer QR steps in the middle.
Unfortunately, i see most of those improvements (except for preprocessor improvement probably, and perhaps QR pushdown) as purely theoretical challenge as i am yet to find a use case for them either myself or in public, hence it is merely a theoretical scale interest right now. Dense matrix even of million by million is already 5 to 8 Tb input file, which is a challenge to find for me, much less benchmark on a thousand-node cluster, and this case is thought to be already well covered even by current code. Potential challenge to it is high deviation of nonzero elements in the input (so that it may be million on average with spikes to a billion or so which would mean a 8G sized vector).
Given i seem to be burried in ever-increasing work and household tasks, i don't see myself doing much of that except for what improvements already exist on the side, in the next 6 months or so.
-d
On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Hi all, I'm again seeing the issue count tend to pile up. I try to run > through regularly to resolve anything addressed to me, and even things that > aren't but that I am confident enough to fix. It would be great if everyone > could do the same in a spare 1-2 hours this week, if only to say "yes, go > ahead on that patch" or "no I don't think this is a good idea". Especially > the committers who have not been active in a while. > > To me, this is the most essential work we can do, because without responses > from those with power to commit, new community members get the message that > their contributions are ignored, or that nobody's home. That's no good. > Understanding that individuals may not have time to actively write their > own > new changes and improvements, it seems that the least we can all do is > involve and respond to external input, to bring in those who want to make > changes. > > I'd also like to sweep through the issues that have not been touched in 6+ > months and close some that just do not seem to be getting any traction or > attention. The theory is that closing stuff that by all accounts won't get > looked at better communicates what's coming in the project, and focuses > attention on issues that might get looked at. > > Before I start that though, would welcome anyone to peek at everything > that's open and assign, comment, ping, etc. anything that needs to be kept > alive. >
Ted Dunning 2011-08-17, 19:13
Here are my thoughts so far: http://dl.dropbox.com/u/36863361/sd-2.pdfand tex source: http://dl.dropbox.com/u/36863361/sd-2.texI think that this gets rid of the QR steps. I am still debugging the case of a singular matrix, but that shouldn't apply to any real cases. On Wed, Aug 17, 2011 at 12:01 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote: > I will take a look although there seem to be a lot of new stuff i don't > have > time to read the science for. > > On top of it, i was planning some improvements on SSVD scaling and getting > rid of current limitations for some time now, such as > > -- SSVD-wide enhancements: to allow better wide scaling, in summary to > billions of non-zero elements per row: > -- remove at least k+p rows per map task limiation without causing > "supersplits" by allowing blocked QR pushdown to reducers (or perhaps even > automatic pushdown, i am not sure if it is possible). > -- I have already used SSVD code that equips vector with a preprocessor > via Configured hadoop interface allowing on-the fly random projection which > allows to randomly project very long rows without ever loadnig them in > memory > > -- "SSVD-tall" improvements: to allow more vertical scaling (currently > thought to be at about billion rows with a lot of memory) by introducing > more bottom-up divide-and-conquer QR steps in the middle. > > Unfortunately, i see most of those improvements (except for preprocessor > improvement probably, and perhaps QR pushdown) as purely theoretical > challenge as i am yet to find a use case for them either myself or in > public, hence it is merely a theoretical scale interest right now. Dense > matrix even of million by million is already 5 to 8 Tb input file, which is > a challenge to find for me, much less benchmark on a thousand-node cluster, > and this case is thought to be already well covered even by current code. > Potential challenge to it is high deviation of nonzero elements in the > input > (so that it may be million on average with spikes to a billion or so which > would mean a 8G sized vector). > > Given i seem to be burried in ever-increasing work and household tasks, i > don't see myself doing much of that except for what improvements already > exist on the side, in the next 6 months or so. > > -d > > On Wed, Aug 17, 2011 at 2:48 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > Hi all, I'm again seeing the issue count tend to pile up. I try to run > > through regularly to resolve anything addressed to me, and even things > that > > aren't but that I am confident enough to fix. It would be great if > everyone > > could do the same in a spare 1-2 hours this week, if only to say "yes, go > > ahead on that patch" or "no I don't think this is a good idea". > Especially > > the committers who have not been active in a while. > > > > To me, this is the most essential work we can do, because without > responses > > from those with power to commit, new community members get the message > that > > their contributions are ignored, or that nobody's home. That's no good. > > Understanding that individuals may not have time to actively write their > > own > > new changes and improvements, it seems that the least we can all do is > > involve and respond to external input, to bring in those who want to make > > changes. > > > > I'd also like to sweep through the issues that have not been touched in > 6+ > > months and close some that just do not seem to be getting any traction or > > attention. The theory is that closing stuff that by all accounts won't > get > > looked at better communicates what's coming in the project, and focuses > > attention on issues that might get looked at. > > > > Before I start that though, would welcome anyone to peek at everything > > that's open and assign, comment, ping, etc. anything that needs to be > kept > > alive. > > >
Grant Ingersoll 2011-08-18, 13:37
On Aug 17, 2011, at 11:09 AM, Sebastian Schelter wrote: > Thanks for offering your help, but I guess it need something more than someone helping per mail... > > I need someone to sit down with me in my office and answer lots of possibly very embarassing questions :) I doubt you are alone in those questions. > > --sebastian > > On 17.08.2011 16:51, Ted Dunning wrote: >> Just holler. >> >> On Wed, Aug 17, 2011 at 3:13 AM, Sebastian Schelter <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> MAHOUT-773 Implement Random Walk with Restarts >> >> I have a working patch, but Ted suggested the use of random >> projections, I'll have a look into that if I get hold of a linear >> algebra guru :) >> >> > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
Grant Ingersoll 2011-08-18, 13:44
I intend to get through M-688 and M-627 soon. I'd appreciate some other eyeballs on M-627. I think M-399 warrants more interest, but I also seem to recall Jake saying he has a pretty significant overhaul of LDA coming anyway, so it may not be worth the time. Seems like with this push, we could get to 0.6 sometime in late Sept or Oct.? On Aug 17, 2011, at 4:48 AM, Sean Owen wrote: > Hi all, I'm again seeing the issue count tend to pile up. I try to run > through regularly to resolve anything addressed to me, and even things that > aren't but that I am confident enough to fix. It would be great if everyone > could do the same in a spare 1-2 hours this week, if only to say "yes, go > ahead on that patch" or "no I don't think this is a good idea". Especially > the committers who have not been active in a while. > > To me, this is the most essential work we can do, because without responses > from those with power to commit, new community members get the message that > their contributions are ignored, or that nobody's home. That's no good. > Understanding that individuals may not have time to actively write their own > new changes and improvements, it seems that the least we can all do is > involve and respond to external input, to bring in those who want to make > changes. > > I'd also like to sweep through the issues that have not been touched in 6+ > months and close some that just do not seem to be getting any traction or > attention. The theory is that closing stuff that by all accounts won't get > looked at better communicates what's coming in the project, and focuses > attention on issues that might get looked at. > > Before I start that though, would welcome anyone to peek at everything > that's open and assign, comment, ping, etc. anything that needs to be kept > alive. -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
Sean Owen 2011-08-18, 22:53
I think all the renewed activity is great. Next week, I will update JIRA to reflect these comments, and perhaps close out some items that are not mentioned here. We seem to make a release after about 6 months or 150 JIRA issues. There's no hard rules about that, but seems like a fine pace to date. That would put us to a new release around January next year by default. Hey, if there's a surge of activity, let's make it sooner. On Thu, Aug 18, 2011 at 2:44 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > I intend to get through M-688 and M-627 soon. I'd appreciate some other > eyeballs on M-627. I think M-399 warrants more interest, but I also seem > to recall Jake saying he has a pretty significant overhaul of LDA coming > anyway, so it may not be worth the time. > > Seems like with this push, we could get to 0.6 sometime in late Sept or > Oct.? > > > On Aug 17, 2011, at 4:48 AM, Sean Owen wrote: > > > Hi all, I'm again seeing the issue count tend to pile up. I try to run > > through regularly to resolve anything addressed to me, and even things > that > > aren't but that I am confident enough to fix. It would be great if > everyone > > could do the same in a spare 1-2 hours this week, if only to say "yes, go > > ahead on that patch" or "no I don't think this is a good idea". > Especially > > the committers who have not been active in a while. > > > > To me, this is the most essential work we can do, because without > responses > > from those with power to commit, new community members get the message > that > > their contributions are ignored, or that nobody's home. That's no good. > > Understanding that individuals may not have time to actively write their > own > > new changes and improvements, it seems that the least we can all do is > > involve and respond to external input, to bring in those who want to make > > changes. > > > > I'd also like to sweep through the issues that have not been touched in > 6+ > > months and close some that just do not seem to be getting any traction or > > attention. The theory is that closing stuff that by all accounts won't > get > > looked at better communicates what's coming in the project, and focuses > > attention on issues that might get looked at. > > > > Before I start that though, would welcome anyone to peek at everything > > that's open and assign, comment, ping, etc. anything that needs to be > kept > > alive. > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com> >
|
|