We do currently have optimizations based on density analysis in use e.g.: in AtB.https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431
+1 to PR. thanks for pointing this out.
From: Pat Ferrel <[EMAIL PROTECTED]>
Sent: Monday, August 21, 2017 2:26:58 PM
To: [EMAIL PROTECTED]
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)
That looks like ancient code from the old mapreduce days. If is passes unit tests create a PR.
Just a guess here but there are times when this might not speed up thing but slow them down. However for vey sparse matrixes that you might see in CF this could work quite well. Some of the GPU optimization will eventually be keyed off the density of a matrix, or selectable from knowing it’s characteristics.
I use this code all the time and would be very interested in a version that works with CF style very sparse matrices.
Long story short, create a PR so the optimizer guys can think through the implications. If I can also test it I have some large real-world data where I can test real-world speedup.
On Aug 21, 2017, at 10:53 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote:
Interesting indeed. What is “massive”? Does the change pass all unit tests?
On Aug 17, 2017, at 1:04 PM, Scruggs, Matt <[EMAIL PROTECTED]> wrote:
Thanks for the remarks guys!
I profiled the code running locally on my machine and discovered this loop is where these setQuick() and getQuick() calls originate (during matrix Kryo deserialization), and as you can see the complexity of this 2D loop can be very high:https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240
Recall that this algorithm uses SparseRowMatrix whose rows are SequentialAccessSparseVector, so all this looping seems unnecessary. I created a new subclass of SparseRowMatrix that overrides that assign(matrix, function) method, and instead of looping through all the columns of each row, it calls SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells with values. I also had to customize MahoutKryoRegistrator a bit with a new default serializer for this new matrix class. This yielded a massive performance boost and I verified that the results match exactly for several test cases and datasets. I realize this could have side-effects in some cases, but I'm not using any other part of Mahout, only SimilaritAnalysis.cooccurrencesIDSs().
Any thoughts / comments?
On 8/16/17, 8:29 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: