I’ll create a feature branch of Mahout in my git repo for simplicity (we are in code freeze for Mahout right now) Then if you could peel off you changes and make a PR against it. Everyone can have a look before any change is made to the ASF repos.

Do a PR against this <>, even if it’s not working we can take a look. The branch right now is just a snapshot of the current master in code freeze.

Mahout has always had methods to work with different levels of sparsity and you may have found a missing point to optimize. Let’s hope so.
On Aug 21, 2017, at 11:47 AM, Andrew Palumbo <[EMAIL PROTECTED]> wrote:

I should mention that the densisty is currently set quite high, and we've been discussing a user defined setting for this.  Something that we have not worked in yet.

From: Andrew Palumbo <[EMAIL PROTECTED]>
Sent: Monday, August 21, 2017 2:44:35 PM
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)
We do currently have optimizations based on density analysis in use e.g.: in AtB.

+1 to PR. thanks for pointing this out.

From: Pat Ferrel <[EMAIL PROTECTED]>
Sent: Monday, August 21, 2017 2:26:58 PM
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

That looks like ancient code from the old mapreduce days. If is passes unit tests create a PR.

Just a guess here but there are times when this might not speed up thing but slow them down. However for vey sparse matrixes that you might see in CF this could work quite well. Some of the GPU optimization will eventually be keyed off the density of a matrix, or selectable from knowing it’s characteristics.

I use this code all the time and would be very interested in a version that works with CF style very sparse matrices.

Long story short, create a PR so the optimizer guys can think through the implications. If I can also test it I have some large real-world data where I can test real-world speedup.
On Aug 21, 2017, at 10:53 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote:

Interesting indeed. What is “massive”? Does the change pass all unit tests?
On Aug 17, 2017, at 1:04 PM, Scruggs, Matt <[EMAIL PROTECTED]> wrote:

Thanks for the remarks guys!

I profiled the code running locally on my machine and discovered this loop is where these setQuick() and getQuick() calls originate (during matrix Kryo deserialization), and as you can see the complexity of this 2D loop can be very high:
Recall that this algorithm uses SparseRowMatrix whose rows are SequentialAccessSparseVector, so all this looping seems unnecessary. I created a new subclass of SparseRowMatrix that overrides that assign(matrix, function) method, and instead of looping through all the columns of each row, it calls SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells with values. I also had to customize MahoutKryoRegistrator a bit with a new default serializer for this new matrix class. This yielded a massive performance boost and I verified that the results match exactly for several test cases and datasets. I realize this could have side-effects in some cases, but I'm not using any other part of Mahout, only SimilaritAnalysis.cooccurrencesIDSs().

Any thoughts / comments?

On 8/16/17, 8:29 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:

> It is common with large numerical codes that things run faster in memory on
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB