Saikat Kanjilal

2017-02-09, 20:59

Jim Jagielski

2017-02-17, 14:04

Trevor Grant

2017-02-17, 14:56

Saikat Kanjilal

2017-02-17, 16:15

Jim Jagielski

2017-02-17, 16:18

Saikat Kanjilal

2017-02-17, 17:23

Andrew Palumbo

2017-02-17, 18:11

Dmitriy Lyubimov

2017-02-17, 21:34

Dmitriy Lyubimov

2017-02-17, 21:45

Saikat Kanjilal

2017-02-25, 22:42

Jim Jagielski

2017-03-03, 12:09

Dmitriy Lyubimov

2017-03-03, 20:28

Dmitriy Lyubimov

2017-03-03, 20:31

Dmitriy Lyubimov

2017-03-03, 20:36

- Mahout
- mail # dev
- Contributing an algorithm for samsara

Trevor et al,

I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

mahout.apache.org

Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

Any thoughts/guidance or recommendations would be very helpful.

Thanks in advance.

I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

mahout.apache.org

Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

Any thoughts/guidance or recommendations would be very helpful.

Thanks in advance.

My own thoughts are that logistic regression seems a more "generalized"

and hence more useful algo to be factored in... At least in the

use cases that I've been toying with.

So I'd like to help out with that if wanted...

> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Trevor et al,

>

> I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

>

> Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

> mahout.apache.org

> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

>

>

>

> Any thoughts/guidance or recommendations would be very helpful.

> Thanks in advance.

and hence more useful algo to be factored in... At least in the

use cases that I've been toying with.

So I'd like to help out with that if wanted...

> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Trevor et al,

>

> I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

>

> Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

> mahout.apache.org

> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

>

>

>

> Any thoughts/guidance or recommendations would be very helpful.

> Thanks in advance.

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

> My own thoughts are that logistic regression seems a more "generalized"

> and hence more useful algo to be factored in... At least in the

> use cases that I've been toying with.

>

> So I'd like to help out with that if wanted...

>

> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

> >

> > Trevor et al,

> >

> > I'd like to contribute an algorithm or two in samsara using spark as I

> would like to do a compare and contrast with mahout with R server for a

> data science pipeline, machine learning repo that I'm working on, in

> looking at the list of algorithms (https://mahout.apache.org/

> users/basics/algorithms.html) is there an algorithm for spark that would

> be beneficial for the community, my use cases would typically be around

> clustering or real time machine learning for building recommendations on

> the fly. The algorithms I see that could potentially be useful are: 1)

> Matrix Factorization with ALS 2) Logistic regression with SVD.

> >

> > Apache Mahout: Scalable machine learning and data mining<

> https://mahout.apache.org/users/basics/algorithms.html>

> > mahout.apache.org

> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

> Flink; Mahout Math-Scala Core Library and Scala DSL

> >

> >

> >

> > Any thoughts/guidance or recommendations would be very helpful.

> > Thanks in advance.

>

>

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

> My own thoughts are that logistic regression seems a more "generalized"

> and hence more useful algo to be factored in... At least in the

> use cases that I've been toying with.

>

> So I'd like to help out with that if wanted...

>

> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

> >

> > Trevor et al,

> >

> > I'd like to contribute an algorithm or two in samsara using spark as I

> would like to do a compare and contrast with mahout with R server for a

> data science pipeline, machine learning repo that I'm working on, in

> looking at the list of algorithms (https://mahout.apache.org/

> users/basics/algorithms.html) is there an algorithm for spark that would

> be beneficial for the community, my use cases would typically be around

> clustering or real time machine learning for building recommendations on

> the fly. The algorithms I see that could potentially be useful are: 1)

> Matrix Factorization with ALS 2) Logistic regression with SVD.

> >

> > Apache Mahout: Scalable machine learning and data mining<

> https://mahout.apache.org/users/basics/algorithms.html>

> > mahout.apache.org

> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

> Flink; Mahout Math-Scala Core Library and Scala DSL

> >

> >

> >

> > Any thoughts/guidance or recommendations would be very helpful.

> > Thanks in advance.

>

>

Jim,

What do you say we start with ALS and then tackle glm?

Sent from my iPhone

> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>

> Jim is right, and I would take it one further and say, it would be best to

> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

> from there a Logistic regression is a trivial extension.

>

> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

> in neck first for both Jim and Saikat...

>

> MAHOUT-1928 and MAHOUT-1929

>

> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>

> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

> in there.

>

> If you have an algorithm you are particularly intimate with, or explicitly

> need/want- feel free to open a JIRA and assign to yourself.

>

> There is also a case to be made for implementing the ALS...

>

> 1) It's a much better 'beginner' project.

> 2) Mahout has some world class Recommenders, a toy ALS implementation might

> help us think through how the other reccomenders (e.g. CCO) will 'fit' into

> the framework. E.g. ALS being the toy-prototype reccomender that helps us

> think through building out that section of the framework.

>

>

>

> Trevor Grant

> Data Scientist

> https://github.com/rawkintrevo

> http://stackexchange.com/users/3002022/rawkintrevo

> http://trevorgrant.org

>

> *"Fortunate is he, who is able to know the causes of things." -Virgil*

>

>

>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>

>> My own thoughts are that logistic regression seems a more "generalized"

>> and hence more useful algo to be factored in... At least in the

>> use cases that I've been toying with.

>>

>> So I'd like to help out with that if wanted...

>>

>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>>>

>>> Trevor et al,

>>>

>>> I'd like to contribute an algorithm or two in samsara using spark as I

>> would like to do a compare and contrast with mahout with R server for a

>> data science pipeline, machine learning repo that I'm working on, in

>> looking at the list of algorithms (https://mahout.apache.org/

>> users/basics/algorithms.html) is there an algorithm for spark that would

>> be beneficial for the community, my use cases would typically be around

>> clustering or real time machine learning for building recommendations on

>> the fly. The algorithms I see that could potentially be useful are: 1)

>> Matrix Factorization with ALS 2) Logistic regression with SVD.

>>>

>>> Apache Mahout: Scalable machine learning and data mining<

>> https://mahout.apache.org/users/basics/algorithms.html>

>>> mahout.apache.org

>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

>> Flink; Mahout Math-Scala Core Library and Scala DSL

>>>

>>>

>>>

>>> Any thoughts/guidance or recommendations would be very helpful.

>>> Thanks in advance.

>>

>>

What do you say we start with ALS and then tackle glm?

Sent from my iPhone

> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>

> Jim is right, and I would take it one further and say, it would be best to

> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

> from there a Logistic regression is a trivial extension.

>

> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

> in neck first for both Jim and Saikat...

>

> MAHOUT-1928 and MAHOUT-1929

>

> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>

> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

> in there.

>

> If you have an algorithm you are particularly intimate with, or explicitly

> need/want- feel free to open a JIRA and assign to yourself.

>

> There is also a case to be made for implementing the ALS...

>

> 1) It's a much better 'beginner' project.

> 2) Mahout has some world class Recommenders, a toy ALS implementation might

> help us think through how the other reccomenders (e.g. CCO) will 'fit' into

> the framework. E.g. ALS being the toy-prototype reccomender that helps us

> think through building out that section of the framework.

>

>

>

> Trevor Grant

> Data Scientist

> https://github.com/rawkintrevo

> http://stackexchange.com/users/3002022/rawkintrevo

> http://trevorgrant.org

>

> *"Fortunate is he, who is able to know the causes of things." -Virgil*

>

>

>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>

>> My own thoughts are that logistic regression seems a more "generalized"

>> and hence more useful algo to be factored in... At least in the

>> use cases that I've been toying with.

>>

>> So I'd like to help out with that if wanted...

>>

>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>>>

>>> Trevor et al,

>>>

>>> I'd like to contribute an algorithm or two in samsara using spark as I

>> would like to do a compare and contrast with mahout with R server for a

>> data science pipeline, machine learning repo that I'm working on, in

>> looking at the list of algorithms (https://mahout.apache.org/

>> users/basics/algorithms.html) is there an algorithm for spark that would

>> be beneficial for the community, my use cases would typically be around

>> clustering or real time machine learning for building recommendations on

>> the fly. The algorithms I see that could potentially be useful are: 1)

>> Matrix Factorization with ALS 2) Logistic regression with SVD.

>>>

>>> Apache Mahout: Scalable machine learning and data mining<

>> https://mahout.apache.org/users/basics/algorithms.html>

>>> mahout.apache.org

>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

>> Flink; Mahout Math-Scala Core Library and Scala DSL

>>>

>>>

>>>

>>> Any thoughts/guidance or recommendations would be very helpful.

>>> Thanks in advance.

>>

>>

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Jim,

> What do you say we start with ALS and then tackle glm?

>

>

> Sent from my iPhone

>

>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>>

>> Jim is right, and I would take it one further and say, it would be best to

>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

>> from there a Logistic regression is a trivial extension.

>>

>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

>> in neck first for both Jim and Saikat...

>>

>> MAHOUT-1928 and MAHOUT-1929

>>

>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>>

>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

>> in there.

>>

>> If you have an algorithm you are particularly intimate with, or explicitly

>> need/want- feel free to open a JIRA and assign to yourself.

>>

>> There is also a case to be made for implementing the ALS...

>>

>> 1) It's a much better 'beginner' project.

>> 2) Mahout has some world class Recommenders, a toy ALS implementation might

>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into

>> the framework. E.g. ALS being the toy-prototype reccomender that helps us

>> think through building out that section of the framework.

>>

>>

>>

>> Trevor Grant

>> Data Scientist

>> https://github.com/rawkintrevo

>> http://stackexchange.com/users/3002022/rawkintrevo

>> http://trevorgrant.org

>>

>> *"Fortunate is he, who is able to know the causes of things." -Virgil*

>>

>>

>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>>

>>> My own thoughts are that logistic regression seems a more "generalized"

>>> and hence more useful algo to be factored in... At least in the

>>> use cases that I've been toying with.

>>>

>>> So I'd like to help out with that if wanted...

>>>

>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>>>>

>>>> Trevor et al,

>>>>

>>>> I'd like to contribute an algorithm or two in samsara using spark as I

>>> would like to do a compare and contrast with mahout with R server for a

>>> data science pipeline, machine learning repo that I'm working on, in

>>> looking at the list of algorithms (https://mahout.apache.org/

>>> users/basics/algorithms.html) is there an algorithm for spark that would

>>> be beneficial for the community, my use cases would typically be around

>>> clustering or real time machine learning for building recommendations on

>>> the fly. The algorithms I see that could potentially be useful are: 1)

>>> Matrix Factorization with ALS 2) Logistic regression with SVD.

>>>>

>>>> Apache Mahout: Scalable machine learning and data mining<

>>> https://mahout.apache.org/users/basics/algorithms.html>

>>>> mahout.apache.org

>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

>>> Flink; Mahout Math-Scala Core Library and Scala DSL

>>>>

>>>>

>>>>

>>>> Any thoughts/guidance or recommendations would be very helpful.

>>>> Thanks in advance.

>>>

>>>

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Jim,

> What do you say we start with ALS and then tackle glm?

>

>

> Sent from my iPhone

>

>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>>

>> Jim is right, and I would take it one further and say, it would be best to

>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

>> from there a Logistic regression is a trivial extension.

>>

>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

>> in neck first for both Jim and Saikat...

>>

>> MAHOUT-1928 and MAHOUT-1929

>>

>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>>

>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

>> in there.

>>

>> If you have an algorithm you are particularly intimate with, or explicitly

>> need/want- feel free to open a JIRA and assign to yourself.

>>

>> There is also a case to be made for implementing the ALS...

>>

>> 1) It's a much better 'beginner' project.

>> 2) Mahout has some world class Recommenders, a toy ALS implementation might

>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into

>> the framework. E.g. ALS being the toy-prototype reccomender that helps us

>> think through building out that section of the framework.

>>

>>

>>

>> Trevor Grant

>> Data Scientist

>> https://github.com/rawkintrevo

>> http://stackexchange.com/users/3002022/rawkintrevo

>> http://trevorgrant.org

>>

>> *"Fortunate is he, who is able to know the causes of things." -Virgil*

>>

>>

>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>>

>>> My own thoughts are that logistic regression seems a more "generalized"

>>> and hence more useful algo to be factored in... At least in the

>>> use cases that I've been toying with.

>>>

>>> So I'd like to help out with that if wanted...

>>>

>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>>>>

>>>> Trevor et al,

>>>>

>>>> I'd like to contribute an algorithm or two in samsara using spark as I

>>> would like to do a compare and contrast with mahout with R server for a

>>> data science pipeline, machine learning repo that I'm working on, in

>>> looking at the list of algorithms (https://mahout.apache.org/

>>> users/basics/algorithms.html) is there an algorithm for spark that would

>>> be beneficial for the community, my use cases would typically be around

>>> clustering or real time machine learning for building recommendations on

>>> the fly. The algorithms I see that could potentially be useful are: 1)

>>> Matrix Factorization with ALS 2) Logistic regression with SVD.

>>>>

>>>> Apache Mahout: Scalable machine learning and data mining<

>>> https://mahout.apache.org/users/basics/algorithms.html>

>>>> mahout.apache.org

>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

>>> Flink; Mahout Math-Scala Core Library and Scala DSL

>>>>

>>>>

>>>>

>>>> Any thoughts/guidance or recommendations would be very helpful.

>>>> Thanks in advance.

>>>

>>>

To start this off I figure we should spend some time understanding the current implementations and theory before we dig deep into implementing this in mahout:

1) https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative ...<https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/>

bugra.github.io

Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user u ...

2) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

[https://avatars1.githubusercontent.com/u/47359?v=3&s=400]<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

spark/ALS.scala at master · apache/spark · GitHub<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

github.com

spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.

3) https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala

mahout/ALS.scala at master · apache/mahout · GitHub<https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala>

github.com

mahout - Mirror of Apache Mahout

4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/

Alternating Least Squares – Data Science Made Simpler<https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/>

datasciencemadesimpler.wordpress.com

Collaborative Filtering. Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste ...

Jim I would suggest we spend some time researching and digging into these resources and circle back next week to get this off the ground, let me know if you want to meet offline as well, I would recommend the next steps is a design proposal to the dev list of how the implementation will fit into the current samsara algorithms, what do you think?

Regards

________________________________

From: Jim Jagielski <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 8:18 AM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Jim,

> What do you say we start with ALS and then tackle glm?

>

>

> Sent from my iPhone

>

>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>>

>> Jim is right, and I would take it one further and say, it would be best to

>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

>> from there a Logistic regression is a trivial extension.

>>

>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

>> in neck first for both Jim and Saikat...

>>

>> MAHOUT-1928 and MAHOUT-1929

>>

>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>>

>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

>> in there.

>>

>> If you have an algorithm you are particularly intimate with, or explicitly

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

1) https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative ...<https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/>

bugra.github.io

Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user u ...

2) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

[https://avatars1.githubusercontent.com/u/47359?v=3&s=400]<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

spark/ALS.scala at master · apache/spark · GitHub<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

github.com

spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.

3) https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala

mahout/ALS.scala at master · apache/mahout · GitHub<https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala>

github.com

mahout - Mirror of Apache Mahout

4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/

Alternating Least Squares – Data Science Made Simpler<https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/>

datasciencemadesimpler.wordpress.com

Collaborative Filtering. Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste ...

Jim I would suggest we spend some time researching and digging into these resources and circle back next week to get this off the ground, let me know if you want to meet offline as well, I would recommend the next steps is a design proposal to the dev list of how the implementation will fit into the current samsara algorithms, what do you think?

Regards

________________________________

From: Jim Jagielski <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 8:18 AM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Jim,

> What do you say we start with ALS and then tackle glm?

>

>

> Sent from my iPhone

>

>> On Feb 17, 2017, at 6:56 AM, Trevor Grant <[EMAIL PROTECTED]> wrote:

>>

>> Jim is right, and I would take it one further and say, it would be best to

>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

>> from there a Logistic regression is a trivial extension.

>>

>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

>> in neck first for both Jim and Saikat...

>>

>> MAHOUT-1928 and MAHOUT-1929

>>

>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

>>

>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

>> in there.

>>

>> If you have an algorithm you are particularly intimate with, or explicitly

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

+1 to glms

Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------

From: Trevor Grant <[EMAIL PROTECTED]>

Date: 02/17/2017 6:56 AM (GMT-08:00)

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

> My own thoughts are that logistic regression seems a more "generalized"

> and hence more useful algo to be factored in... At least in the

> use cases that I've been toying with.

>

> So I'd like to help out with that if wanted...

>

> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

> >

> > Trevor et al,

> >

> > I'd like to contribute an algorithm or two in samsara using spark as I

> would like to do a compare and contrast with mahout with R server for a

> data science pipeline, machine learning repo that I'm working on, in

> looking at the list of algorithms (https://mahout.apache.org/

> users/basics/algorithms.html) is there an algorithm for spark that would

> be beneficial for the community, my use cases would typically be around

> clustering or real time machine learning for building recommendations on

> the fly. The algorithms I see that could potentially be useful are: 1)

> Matrix Factorization with ALS 2) Logistic regression with SVD.

> >

> > Apache Mahout: Scalable machine learning and data mining<

> https://mahout.apache.org/users/basics/algorithms.html>

> > mahout.apache.org

> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

> Flink; Mahout Math-Scala Core Library and Scala DSL

> >

> >

> >

> > Any thoughts/guidance or recommendations would be very helpful.

> > Thanks in advance.

>

>

Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------

From: Trevor Grant <[EMAIL PROTECTED]>

Date: 02/17/2017 6:56 AM (GMT-08:00)

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

> My own thoughts are that logistic regression seems a more "generalized"

> and hence more useful algo to be factored in... At least in the

> use cases that I've been toying with.

>

> So I'd like to help out with that if wanted...

>

> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

> >

> > Trevor et al,

> >

> > I'd like to contribute an algorithm or two in samsara using spark as I

> would like to do a compare and contrast with mahout with R server for a

> data science pipeline, machine learning repo that I'm working on, in

> looking at the list of algorithms (https://mahout.apache.org/

> users/basics/algorithms.html) is there an algorithm for spark that would

> be beneficial for the community, my use cases would typically be around

> clustering or real time machine learning for building recommendations on

> the fly. The algorithms I see that could potentially be useful are: 1)

> Matrix Factorization with ALS 2) Logistic regression with SVD.

> >

> > Apache Mahout: Scalable machine learning and data mining<

> https://mahout.apache.org/users/basics/algorithms.html>

> > mahout.apache.org

> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O

> Flink; Mahout Math-Scala Core Library and Scala DSL

> >

> >

> >

> > Any thoughts/guidance or recommendations would be very helpful.

> > Thanks in advance.

>

>

Jim,

if ALS is of interest, and as far as weighed ALS is concerned (since we

already have trivial regularized ALS in the "decompositions" package),

here's uncommitted samsara-compatible patch from a while back:

https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and

regularization rates (paper references are given). We combine both

approaches in one (which is novel, i guess, but yet simple enough).

Obviously the final solver can also be used as pure reg rate regularized if

wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike

it was done over there (as a use case sort problem that takes input before

even features were extracted), we split the problem into pure algebraic

solver (double-weighed ALS math) and leave the feature extraction outside

of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation

does not necessarily leave the space for feature extraction that is

different from described use case of partially consumed streamed videos in

the paper. (e.g., instead of videos one could count visits or clicks or

add-to-cart events which may need additional hyperparameter found for them

as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all

hyperparameters require multidimensional crossvalidation and optimization.

I think i mentioned it before as list of desired solutions, as it stands,

Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of

multidimensional hyperparameter optimization. One of them comes from the

fitter (reg rate, or base reg rate in case of weighed regularization), and

the others come from feature extraction process. E.g., in original paper

they introduce (at least) 2 formulas to extract measure weighs from the

streaming video observations, and each of them had one parameter, alhpa,

which in context of the whole problem becomes effectively yet another

hyperparameter to fit. In other use cases when your confidence measurement

may be coming from different sources and observations, the confidence

extraction may actually have even more hyperparameters to fit than just

one. And when we have a multidimensional case, simple approaches (like grid

or random search) become either cost prohibitive or ineffective, due to the

curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction

with multidimensional bayesian optimizer, but the company that i wrote it

for did not have it approved for contribution (unlike weighed als) at that

time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask

questions, and we could worry about hyperparameter optimization a bit later

and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren

etc.), this is an ideal use case for more general R-like formula approach,

which is also on desired list of things to have.

So i guess we have 3 problems really here:

(1) double-weighed ALS

(2) bayesian optimization and crossvalidation in an n-dimensional

hyperparameter space

(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]> wrote:

> +1 to glms

>

>

>

> Sent from my Verizon Wireless 4G LTE smartphone

>

>

> -------- Original message --------

> From: Trevor Grant <[EMAIL PROTECTED]>

> Date: 02/17/2017 6:56 AM (GMT-08:00)

> To: [EMAIL PROTECTED]

> Subject: Re: Contributing an algorithm for samsara

>

> Jim is right, and I would take it one further and say, it would be best to

> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

> from there a Logistic regression is a trivial extension.

>

> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

if ALS is of interest, and as far as weighed ALS is concerned (since we

already have trivial regularized ALS in the "decompositions" package),

here's uncommitted samsara-compatible patch from a while back:

https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and

regularization rates (paper references are given). We combine both

approaches in one (which is novel, i guess, but yet simple enough).

Obviously the final solver can also be used as pure reg rate regularized if

wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike

it was done over there (as a use case sort problem that takes input before

even features were extracted), we split the problem into pure algebraic

solver (double-weighed ALS math) and leave the feature extraction outside

of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation

does not necessarily leave the space for feature extraction that is

different from described use case of partially consumed streamed videos in

the paper. (e.g., instead of videos one could count visits or clicks or

add-to-cart events which may need additional hyperparameter found for them

as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all

hyperparameters require multidimensional crossvalidation and optimization.

I think i mentioned it before as list of desired solutions, as it stands,

Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of

multidimensional hyperparameter optimization. One of them comes from the

fitter (reg rate, or base reg rate in case of weighed regularization), and

the others come from feature extraction process. E.g., in original paper

they introduce (at least) 2 formulas to extract measure weighs from the

streaming video observations, and each of them had one parameter, alhpa,

which in context of the whole problem becomes effectively yet another

hyperparameter to fit. In other use cases when your confidence measurement

may be coming from different sources and observations, the confidence

extraction may actually have even more hyperparameters to fit than just

one. And when we have a multidimensional case, simple approaches (like grid

or random search) become either cost prohibitive or ineffective, due to the

curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction

with multidimensional bayesian optimizer, but the company that i wrote it

for did not have it approved for contribution (unlike weighed als) at that

time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask

questions, and we could worry about hyperparameter optimization a bit later

and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren

etc.), this is an ideal use case for more general R-like formula approach,

which is also on desired list of things to have.

So i guess we have 3 problems really here:

(1) double-weighed ALS

(2) bayesian optimization and crossvalidation in an n-dimensional

hyperparameter space

(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]> wrote:

> +1 to glms

>

>

>

> Sent from my Verizon Wireless 4G LTE smartphone

>

>

> -------- Original message --------

> From: Trevor Grant <[EMAIL PROTECTED]>

> Date: 02/17/2017 6:56 AM (GMT-08:00)

> To: [EMAIL PROTECTED]

> Subject: Re: Contributing an algorithm for samsara

>

> Jim is right, and I would take it one further and say, it would be best to

> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

> from there a Logistic regression is a trivial extension.

>

> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

> hyperparameter to fit. In other use cases when your confidence measurement

> may be coming from different sources and observations, the confidence

> extraction may actually have even more hyperparameters to fit than just

> one. And when we have a multidimensional case, simple approaches (like grid

> or random search) become either cost prohibitive or ineffective, due to the

> curse of dimensionality.

>

> At the time i was contributing that method, i was using it in conjunction

> with multidimensional bayesian optimizer, but the company that i wrote it

> for did not have it approved for contribution (unlike weighed als) at that

> time.

>

> Anyhow, perhaps you could read the algebra in both ALS papers there and

> ask questions, and we could worry about hyperparameter optimization a bit

> later and performance a bit later.

>

> On the feature extraction front (as in implicit feedback als per Koren

> etc.), this is an ideal use case for more general R-like formula approach,

> which is also on desired list of things to have.

>

> So i guess we have 3 problems really here:

> (1) double-weighed ALS

> (2) bayesian optimization and crossvalidation in an n-dimensional

> hyperparameter space

> (3) feature extraction per (preferrably R-like) formula.

>

>

> -d

>

>

> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]>

> wrote:

>

>> +1 to glms

>>

>>

>>

>> Sent from my Verizon Wireless 4G LTE smartphone

>>

>>

>> -------- Original message --------

>> From: Trevor Grant <[EMAIL PROTECTED]>

>> Date: 02/17/2017 6:56 AM (GMT-08:00)

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

> hyperparameter to fit. In other use cases when your confidence measurement

> may be coming from different sources and observations, the confidence

> extraction may actually have even more hyperparameters to fit than just

> one. And when we have a multidimensional case, simple approaches (like grid

> or random search) become either cost prohibitive or ineffective, due to the

> curse of dimensionality.

>

> At the time i was contributing that method, i was using it in conjunction

> with multidimensional bayesian optimizer, but the company that i wrote it

> for did not have it approved for contribution (unlike weighed als) at that

> time.

>

> Anyhow, perhaps you could read the algebra in both ALS papers there and

> ask questions, and we could worry about hyperparameter optimization a bit

> later and performance a bit later.

>

> On the feature extraction front (as in implicit feedback als per Koren

> etc.), this is an ideal use case for more general R-like formula approach,

> which is also on desired list of things to have.

>

> So i guess we have 3 problems really here:

> (1) double-weighed ALS

> (2) bayesian optimization and crossvalidation in an n-dimensional

> hyperparameter space

> (3) feature extraction per (preferrably R-like) formula.

>

>

> -d

>

>

> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]>

> wrote:

>

>> +1 to glms

>>

>>

>>

>> Sent from my Verizon Wireless 4G LTE smartphone

>>

>>

>> -------- Original message --------

>> From: Trevor Grant <[EMAIL PROTECTED]>

>> Date: 02/17/2017 6:56 AM (GMT-08:00)

Dmitry,

I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state, given that, here are some thoughts/questions:

1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

More later as I read through the papers.

________________________________

From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 1:45 PM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

github.com

mahout - Mirror of Apache Mahout

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

[MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>

issues.apache.org

Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state, given that, here are some thoughts/questions:

1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

More later as I read through the papers.

________________________________

From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 1:45 PM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

github.com

mahout - Mirror of Apache Mahout

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

issues.apache.org

Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

> On Feb 25, 2017, at 5:41 PM, Saikat Kanjilal <[EMAIL PROTECTED]> wrote:

>

> Dmitry,

>

> I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state

+1. It looks v. impressive.

> , given that, here are some thoughts/questions:

>

>

> 1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

>

> 2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

Would it make sense to keep them as-is, and "pull them out", as

it were, should they prove to be wanted/needed by the other algo users?

>

> 3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

>

>

>

> More later as I read through the papers.

>

>

> ________________________________

> From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

> Sent: Friday, February 17, 2017 1:45 PM

> To: [EMAIL PROTECTED]

> Subject: Re: Contributing an algorithm for samsara

>

> in particular, this is the samsara implementation of double-weighed als :

> https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

> MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

> github.com

> mahout - Mirror of Apache Mahout

>

>

>

>

>

> On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>

>> Jim,

>>

>> if ALS is of interest, and as far as weighed ALS is concerned (since we

>> already have trivial regularized ALS in the "decompositions" package),

>> here's uncommitted samsara-compatible patch from a while back:

>> https://issues.apache.org/jira/browse/MAHOUT-1365

> [MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>

> issues.apache.org

> Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

>

>

>>

>> it combines weights on both data points (a.k.a "implicit feedback" als)

>> and regularization rates (paper references are given). We combine both

>> approaches in one (which is novel, i guess, but yet simple enough).

>> Obviously the final solver can also be used as pure reg rate regularized if

>> wanted, making it equivalent to one of the papers.

>>

>> You may know implicit feedback paper from mllib's implicit als, but unlike

>> it was done over there (as a use case sort problem that takes input before

>> even features were extracted), we split the problem into pure algebraic

>> solver (double-weighed ALS math) and leave the feature extraction outside

>> of this issue per se (it can be added as a separate adapter).

>>

>> The reason for that is that the specific use-case oriented implementation

>> does not necessarily leave the space for feature extraction that is

>> different from described use case of partially consumed streamed videos in

>> the paper. (e.g., instead of videos one could count visits or clicks or

>> add-to-cart events which may need additional hyperparameter found for them

>> as part of feature extraction and converting observations into "weghts").

>>

>> The biggest problem with these ALS methods however is that all

>> hyperparameters require multidimensional crossvalidation and optimization.

>> I think i mentioned it before as list of desired solutions, as it stands,

>> Mahout does not have hyperarameter fitting routine.

>>

>> In practice, when using these kind of ALS, we have a case of

>> multidimensional hyperparameter optimization. One of them comes from the

>

> Dmitry,

>

> I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state

+1. It looks v. impressive.

> , given that, here are some thoughts/questions:

>

>

> 1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

>

> 2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

Would it make sense to keep them as-is, and "pull them out", as

it were, should they prove to be wanted/needed by the other algo users?

>

> 3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

>

>

>

> More later as I read through the papers.

>

>

> ________________________________

> From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

> Sent: Friday, February 17, 2017 1:45 PM

> To: [EMAIL PROTECTED]

> Subject: Re: Contributing an algorithm for samsara

>

> in particular, this is the samsara implementation of double-weighed als :

> https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

> MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

> github.com

> mahout - Mirror of Apache Mahout

>

>

>

>

>

> On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>

>> Jim,

>>

>> if ALS is of interest, and as far as weighed ALS is concerned (since we

>> already have trivial regularized ALS in the "decompositions" package),

>> here's uncommitted samsara-compatible patch from a while back:

>> https://issues.apache.org/jira/browse/MAHOUT-1365

> [MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>

> issues.apache.org

> Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

>

>

>>

>> it combines weights on both data points (a.k.a "implicit feedback" als)

>> and regularization rates (paper references are given). We combine both

>> approaches in one (which is novel, i guess, but yet simple enough).

>> Obviously the final solver can also be used as pure reg rate regularized if

>> wanted, making it equivalent to one of the papers.

>>

>> You may know implicit feedback paper from mllib's implicit als, but unlike

>> it was done over there (as a use case sort problem that takes input before

>> even features were extracted), we split the problem into pure algebraic

>> solver (double-weighed ALS math) and leave the feature extraction outside

>> of this issue per se (it can be added as a separate adapter).

>>

>> The reason for that is that the specific use-case oriented implementation

>> does not necessarily leave the space for feature extraction that is

>> different from described use case of partially consumed streamed videos in

>> the paper. (e.g., instead of videos one could count visits or clicks or

>> add-to-cart events which may need additional hyperparameter found for them

>> as part of feature extraction and converting observations into "weghts").

>>

>> The biggest problem with these ALS methods however is that all

>> hyperparameters require multidimensional crossvalidation and optimization.

>> I think i mentioned it before as list of desired solutions, as it stands,

>> Mahout does not have hyperarameter fitting routine.

>>

>> In practice, when using these kind of ALS, we have a case of

>> multidimensional hyperparameter optimization. One of them comes from the

I am getting a liittle bit lost who asked what here, inline.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>

>

> Would it make sense to keep them as-is, and "pull them out", as

> it were, should they prove to be wanted/needed by the other algo users?

>

I would hope it is of some help (especially math and in-memory prototype)

for something to look back to. I would really try to plot it all anew, I

found it usually helps my focus if I work with my own code from the ground

up.

So no, i would not just try to take it as is. Not without careful review.

Also, if you noticed, the distributed version is quasi-algebraic, i.e., it

contains direct Spark dependencies and code that relies on Spark. As such,

it cannot be put into our decompositions package in mahout-math-scala

module, where most of other distributed decompositions sit.

I suspect it could be made 100% algebraic with current primitives available

in Samsara. This is necessary condition to get it into mahout-math-scala.

If it can't be done, then it has to live in mahout-spark module as one

backend implementation only.

>

> >

> > 3) On the feature extraction per R like formula can you elaborate more

> here, are you talking about feature extraction using R like dataframes and

> operators?

>

> >

> >

> >

> > More later as I read through the papers.

>

I would really start there before anything else. (Moreover, this is the

most fun part of all of it, as far as i am concerned:) ).

Also my adapted formulas are attached to the issue like i mentioned. I

would look thru the math if it is clear (for interpretation), if not let's

discuss any questions.

> >

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>

>

> Would it make sense to keep them as-is, and "pull them out", as

> it were, should they prove to be wanted/needed by the other algo users?

>

I would hope it is of some help (especially math and in-memory prototype)

for something to look back to. I would really try to plot it all anew, I

found it usually helps my focus if I work with my own code from the ground

up.

So no, i would not just try to take it as is. Not without careful review.

Also, if you noticed, the distributed version is quasi-algebraic, i.e., it

contains direct Spark dependencies and code that relies on Spark. As such,

it cannot be put into our decompositions package in mahout-math-scala

module, where most of other distributed decompositions sit.

I suspect it could be made 100% algebraic with current primitives available

in Samsara. This is necessary condition to get it into mahout-math-scala.

If it can't be done, then it has to live in mahout-spark module as one

backend implementation only.

>

> >

> > 3) On the feature extraction per R like formula can you elaborate more

> here, are you talking about feature extraction using R like dataframes and

> operators?

>

> >

> >

> >

> > More later as I read through the papers.

>

I would really start there before anything else. (Moreover, this is the

most fun part of all of it, as far as i am concerned:) ).

Also my adapted formulas are attached to the issue like i mentioned. I

would look thru the math if it is clear (for interpretation), if not let's

discuss any questions.

> >

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>

>>

>>

>

>> >

>> > 3) On the feature extraction per R like formula can you elaborate more

>> here, are you talking about feature extraction using R like dataframes and

>> operators?

>>

>

>

Yes. I would start doing generic formula parser and then specific part that

works with backend-speicifc data frames. For spark, i don't see any reason

to write our own; we'd just had an adapter for the Spark native data

frames.

>

>>

>>

>

>> >

>> > 3) On the feature extraction per R like formula can you elaborate more

>> here, are you talking about feature extraction using R like dataframes and

>> operators?

>>

>

>

works with backend-speicifc data frames. For spark, i don't see any reason

to write our own; we'd just had an adapter for the Spark native data

frames.

And by formula yes i mean R syntax.

possible use case would be to take Spark DataFrame and formula (say, `age ~

. -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that

converts into predictors and target.

In this particular case, this formula means that the predictor matrix (X)

would have all original variables except `age` (for categorical variables

factor extraction is applied), with no bias column.

Some knowledge of R and SAS is required to pin the compatibility nuances

there.

Maybe we could have reasonable simplifications or omissions compared to R

stuff, if we can be reasonably convinced it is actually better that way

than vanilla R contract, but IMO it would be really useful to retain 100%

compatibility there since it is one of ideas there -- retain R-like-ness

with these things.

On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>

>

> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>

>>>

>>>

>>

>>> >

>>> > 3) On the feature extraction per R like formula can you elaborate more

>>> here, are you talking about feature extraction using R like dataframes and

>>> operators?

>>>

>>

>>

> Yes. I would start doing generic formula parser and then specific part

> that works with backend-speicifc data frames. For spark, i don't see any

> reason to write our own; we'd just had an adapter for the Spark native data

> frames.

>

possible use case would be to take Spark DataFrame and formula (say, `age ~

. -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that

converts into predictors and target.

In this particular case, this formula means that the predictor matrix (X)

would have all original variables except `age` (for categorical variables

factor extraction is applied), with no bias column.

Some knowledge of R and SAS is required to pin the compatibility nuances

there.

Maybe we could have reasonable simplifications or omissions compared to R

stuff, if we can be reasonably convinced it is actually better that way

than vanilla R contract, but IMO it would be really useful to retain 100%

compatibility there since it is one of ideas there -- retain R-like-ness

with these things.

On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>

>

> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

>>

>>>

>>>

>>

>>> >

>>> > 3) On the feature extraction per R like formula can you elaborate more

>>> here, are you talking about feature extraction using R like dataframes and

>>> operators?

>>>

>>

>>

> Yes. I would start doing generic formula parser and then specific part

> that works with backend-speicifc data frames. For spark, i don't see any

> reason to write our own; we'd just had an adapter for the Spark native data

> frames.

>

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext