Is it possible to add it to Mahout so as to get the unit tests run? If so we also have a bunch of integration tests as well as my real-world data.

Again, I don’t see anything wrong with skipping zeros in any case but this method is known to be slower for certain types of math (IIRC). So I’d bet the unit test will pass. But like I said in the other answer we may need to figure out how to use this in the best way. First is to measure it’s impact.

This is complicated by the GPU integration that is happening as we write, so those guys will have to help. They have already seen benefits of GPUs drop with low density so this could be the opposite case, bringing more benefit with less density.

As to being “post serialization” we should see your entire jobs logs but in most cases nothing is done on a cluster without first once running through serialization—in other words nothing can be done until the data is on the executors. There are lots of in-memory operations, of course so in those cases there would be no serialization. Intuition says that if this does speed things up it will be for CCO or, more generally, for certain types of matrices and that itself may be a very important case.
On Aug 21, 2017, at 11:25 AM, Scruggs, Matt <[EMAIL PROTECTED]> wrote:

Good question :D

For the dataset I mentioned in my first message, the entire run is almost 10x faster (I expect that speedup to be non-linear since it nearly eliminates a for loop...bigger gains for bigger datasets). It's possible there are other sections of the code I can't override (e.g. before serialization of the matrices) that could take advantage of the sparse matrix / vector APIs.

I didn't make these changes in Mahout itself. I created a custom Kryo deserializer within my app to deserialize SparseRowMatrix instances to a custom subclass of SparseRowMatrix; this new class overrides the AbstractMatrix implementation of apply(matrix, function). FWIW, all of my app's existing tests pass and it produces the same results as before these changes.
On 8/21/17, 1:53 PM, "Pat Ferrel" <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB