+1 to an h2o profile do you want to target 0.13.1 for this.? I would like to keep officially supporting h2o as long as we can, since it highlights the abstraction so well (using custom H20Matrix classes rather than Mahout Matrices).
>Maybe its time to drop H2O "official support" and move Flink Batch / H2O
into a "mahout/community/engines" folder.
Interesting idea re: "mahout/community/engines" folder, not sure how it would make a difference, but
> I'd put FlinkStreaming as another community engine.
> Speaking of Beam, I've heard rumblings here and there of people talking
about making a Beam engine- this might motivate people to get started (no
one person feels responsible for "boiling the ocean" and throwing down an
entire engine in one go- but instead can hack out the portions they need.
> If we did that, I'd say- by convention we need a Markdown document in
mahout/community/engines that has a table of what is implemented on what.
Agreed - at least a single Doc. We have to be very careful about ending up in into "what is mahout territory".
I believe that at this Juncture though we need to come up with a solid structure to avoid confusion. Yes streaming engines will be very useful.
Maybe someone could start a GDoc to begin outlining a streaming engine plan - It would be good to structure streaming engines in the same way as we now have batch IMO - something like a "streaming-math-scala" with a familiar DSL for eg Flink and Spark Streaming, Though its easy to see quickly how these two would differ, and in some ways would not be as easy to "extend" modules as we do for batch.
I would think that this should be targeting 0.14.x, which we'd discussed long ago as being mainly an algo series of releases, however adding in Streaming Engines would be similar to adding in an algo which probes for MPI - something that we've discussed as well for 0.14.x.
I would like to get JCuda in 0.13.1 or 0.13.2 (if we do an 0.13.2 and depending on the timeline of 0.13.1)
From: Andrew Palumbo <[EMAIL PROTECTED]>
Sent: Tuesday, September 5, 2017 5:04:40 PM
To: [EMAIL PROTECTED]
Subject: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.
I've found a need for the sorting a Drm as well as In-core matrices, something like eg.: DrmLike.sortByColumn(...). I would like to implement this at the math-scala engine neutral level with pass through functions to underlying back ends.
In-core would be engine neutral by current design (in-core matrices are all Mahout matrices with the exception of h2o.. which causes some concern.)
For Spark, we can use RDD.sortBy(...).
Flink we can use DataSet.sortPartition(...).setParallelism(1). (There may be a better method will look deeper).
h2o has an implementation, I'm sure, but this brings me to a more important point: If we want to stub out a method in a back end module, Eg: h2o, which test suites do we want make a requirements?
We've not set any specific rules for which test suites must pass for each module. We've had a soft requirement for inheriting and passing all test suites from math-scala.
Setting a rule for this is something that we need to IMO.
An easy option that I'm thinking would be to set the current core math-scala suites as a requirement, and then allow for an optional suite for methods which will be stubbed out.