Interesting indeed. What is “massive”? Does the change pass all unit tests?
On Aug 17, 2017, at 1:04 PM, Scruggs, Matt <[EMAIL PROTECTED]> wrote:
Thanks for the remarks guys!
I profiled the code running locally on my machine and discovered this loop is where these setQuick() and getQuick() calls originate (during matrix Kryo deserialization), and as you can see the complexity of this 2D loop can be very high:https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240
Recall that this algorithm uses SparseRowMatrix whose rows are SequentialAccessSparseVector, so all this looping seems unnecessary. I created a new subclass of SparseRowMatrix that overrides that assign(matrix, function) method, and instead of looping through all the columns of each row, it calls SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells with values. I also had to customize MahoutKryoRegistrator a bit with a new default serializer for this new matrix class. This yielded a massive performance boost and I verified that the results match exactly for several test cases and datasets. I realize this could have side-effects in some cases, but I'm not using any other part of Mahout, only SimilaritAnalysis.cooccurrencesIDSs().
Any thoughts / comments?
On 8/16/17, 8:29 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: