Saikat Kanjilal

2017-02-09, 20:59

Jim Jagielski

2017-02-17, 14:04

Trevor Grant

2017-02-17, 14:56

Saikat Kanjilal

2017-02-17, 16:15

Jim Jagielski

2017-02-17, 16:18

Saikat Kanjilal

2017-02-17, 17:23

Andrew Palumbo

2017-02-17, 18:11

Dmitriy Lyubimov

2017-02-17, 21:34

Dmitriy Lyubimov

2017-02-17, 21:45

Saikat Kanjilal

2017-02-25, 22:42

- Mahout
- mail # dev
- Contributing an algorithm for samsara

Trevor et al,

I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

mahout.apache.org

Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

Any thoughts/guidance or recommendations would be very helpful.

Thanks in advance.

I'd like to contribute an algorithm or two in samsara using spark as I would like to do a compare and contrast with mahout with R server for a data science pipeline, machine learning repo that I'm working on, in looking at the list of algorithms (https://mahout.apache.org/users/basics/algorithms.html) is there an algorithm for spark that would be beneficial for the community, my use cases would typically be around clustering or real time machine learning for building recommendations on the fly. The algorithms I see that could potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic regression with SVD.

Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/users/basics/algorithms.html>

mahout.apache.org

Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; Mahout Math-Scala Core Library and Scala DSL

Any thoughts/guidance or recommendations would be very helpful.

Thanks in advance.

My own thoughts are that logistic regression seems a more "generalized"

and hence more useful algo to be factored in... At least in the

use cases that I've been toying with.

So I'd like to help out with that if wanted...

and hence more useful algo to be factored in... At least in the

use cases that I've been toying with.

So I'd like to help out with that if wanted...

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

Jim,

What do you say we start with ALS and then tackle glm?

Sent from my iPhone

What do you say we start with ALS and then tackle glm?

Sent from my iPhone

To start this off I figure we should spend some time understanding the current implementations and theory before we dig deep into implementing this in mahout:

1) https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative ...<https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/>

bugra.github.io

Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user u ...

2) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

[https://avatars1.githubusercontent.com/u/47359?v=3&s=400]<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

spark/ALS.scala at master · apache/spark · GitHub<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

github.com

spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.

3) https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala

mahout/ALS.scala at master · apache/mahout · GitHub<https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala>

github.com

mahout - Mirror of Apache Mahout

4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/

Alternating Least Squares – Data Science Made Simpler<https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/>

datasciencemadesimpler.wordpress.com

Collaborative Filtering. Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste ...

Jim I would suggest we spend some time researching and digging into these resources and circle back next week to get this off the ground, let me know if you want to meet offline as well, I would recommend the next steps is a design proposal to the dev list of how the implementation will fit into the current samsara algorithms, what do you think?

Regards

________________________________

From: Jim Jagielski <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 8:18 AM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

1) https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative ...<https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/>

bugra.github.io

Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user u ...

2) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

[https://avatars1.githubusercontent.com/u/47359?v=3&s=400]<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

spark/ALS.scala at master · apache/spark · GitHub<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>

github.com

spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.

3) https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala

mahout/ALS.scala at master · apache/mahout · GitHub<https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala>

github.com

mahout - Mirror of Apache Mahout

4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/

Alternating Least Squares – Data Science Made Simpler<https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/>

datasciencemadesimpler.wordpress.com

Collaborative Filtering. Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste ...

Jim I would suggest we spend some time researching and digging into these resources and circle back next week to get this off the ground, let me know if you want to meet offline as well, I would recommend the next steps is a design proposal to the dev list of how the implementation will fit into the current samsara algorithms, what do you think?

Regards

________________________________

From: Jim Jagielski <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 8:18 AM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

+1 to glms

Sent from my Verizon Wireless 4G LTE smartphone

From: Trevor Grant <[EMAIL PROTECTED]>

Date: 02/17/2017 6:56 AM (GMT-08:00)

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

Sent from my Verizon Wireless 4G LTE smartphone

From: Trevor Grant <[EMAIL PROTECTED]>

Date: 02/17/2017 6:56 AM (GMT-08:00)

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to

implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,

from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping

in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are

in there.

If you have an algorithm you are particularly intimate with, or explicitly

need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.

2) Mahout has some world class Recommenders, a toy ALS implementation might

help us think through how the other reccomenders (e.g. CCO) will 'fit' into

the framework. E.g. ALS being the toy-prototype reccomender that helps us

think through building out that section of the framework.

Trevor Grant

Data Scientist

https://github.com/rawkintrevo

http://stackexchange.com/users/3002022/rawkintrevo

http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <[EMAIL PROTECTED]> wrote:

Jim,

if ALS is of interest, and as far as weighed ALS is concerned (since we

already have trivial regularized ALS in the "decompositions" package),

here's uncommitted samsara-compatible patch from a while back:

https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and

regularization rates (paper references are given). We combine both

approaches in one (which is novel, i guess, but yet simple enough).

Obviously the final solver can also be used as pure reg rate regularized if

wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike

it was done over there (as a use case sort problem that takes input before

even features were extracted), we split the problem into pure algebraic

solver (double-weighed ALS math) and leave the feature extraction outside

of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation

does not necessarily leave the space for feature extraction that is

different from described use case of partially consumed streamed videos in

the paper. (e.g., instead of videos one could count visits or clicks or

add-to-cart events which may need additional hyperparameter found for them

as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all

hyperparameters require multidimensional crossvalidation and optimization.

I think i mentioned it before as list of desired solutions, as it stands,

Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of

multidimensional hyperparameter optimization. One of them comes from the

fitter (reg rate, or base reg rate in case of weighed regularization), and

the others come from feature extraction process. E.g., in original paper

they introduce (at least) 2 formulas to extract measure weighs from the

streaming video observations, and each of them had one parameter, alhpa,

which in context of the whole problem becomes effectively yet another

hyperparameter to fit. In other use cases when your confidence measurement

may be coming from different sources and observations, the confidence

extraction may actually have even more hyperparameters to fit than just

one. And when we have a multidimensional case, simple approaches (like grid

or random search) become either cost prohibitive or ineffective, due to the

curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction

with multidimensional bayesian optimizer, but the company that i wrote it

for did not have it approved for contribution (unlike weighed als) at that

time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask

questions, and we could worry about hyperparameter optimization a bit later

and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren

etc.), this is an ideal use case for more general R-like formula approach,

which is also on desired list of things to have.

So i guess we have 3 problems really here:

(1) double-weighed ALS

(2) bayesian optimization and crossvalidation in an n-dimensional

hyperparameter space

(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]> wrote:

if ALS is of interest, and as far as weighed ALS is concerned (since we

already have trivial regularized ALS in the "decompositions" package),

here's uncommitted samsara-compatible patch from a while back:

https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and

regularization rates (paper references are given). We combine both

approaches in one (which is novel, i guess, but yet simple enough).

Obviously the final solver can also be used as pure reg rate regularized if

wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike

it was done over there (as a use case sort problem that takes input before

even features were extracted), we split the problem into pure algebraic

solver (double-weighed ALS math) and leave the feature extraction outside

of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation

does not necessarily leave the space for feature extraction that is

different from described use case of partially consumed streamed videos in

the paper. (e.g., instead of videos one could count visits or clicks or

add-to-cart events which may need additional hyperparameter found for them

as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all

hyperparameters require multidimensional crossvalidation and optimization.

I think i mentioned it before as list of desired solutions, as it stands,

Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of

multidimensional hyperparameter optimization. One of them comes from the

fitter (reg rate, or base reg rate in case of weighed regularization), and

the others come from feature extraction process. E.g., in original paper

they introduce (at least) 2 formulas to extract measure weighs from the

streaming video observations, and each of them had one parameter, alhpa,

which in context of the whole problem becomes effectively yet another

hyperparameter to fit. In other use cases when your confidence measurement

may be coming from different sources and observations, the confidence

extraction may actually have even more hyperparameters to fit than just

one. And when we have a multidimensional case, simple approaches (like grid

or random search) become either cost prohibitive or ineffective, due to the

curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction

with multidimensional bayesian optimizer, but the company that i wrote it

for did not have it approved for contribution (unlike weighed als) at that

time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask

questions, and we could worry about hyperparameter optimization a bit later

and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren

etc.), this is an ideal use case for more general R-like formula approach,

which is also on desired list of things to have.

So i guess we have 3 problems really here:

(1) double-weighed ALS

(2) bayesian optimization and crossvalidation in an n-dimensional

hyperparameter space

(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <[EMAIL PROTECTED]> wrote:

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

Dmitry,

I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state, given that, here are some thoughts/questions:

1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

More later as I read through the papers.

________________________________

From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 1:45 PM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

github.com

mahout - Mirror of Apache Mahout

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

[MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>

issues.apache.org

Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

I have skimmed through the current samsara implementation and your input below and have some initial questions, for starters I would like to take advantage of the work you've already done and bring that into production state, given that, here are some thoughts/questions:

1) What work does the pull request below still need done, unit tests, integration tests , seems like the implementation is complete from reading the code but I'm coming into this new so not sure here?

2) It seems to be that your points 2 and 3 could be written as generic mahout modules that can be used by all algorithms as appropriate, what do you think?

3) On the feature extraction per R like formula can you elaborate more here, are you talking about feature extraction using R like dataframes and operators?

More later as I read through the papers.

________________________________

From: Dmitriy Lyubimov <[EMAIL PROTECTED]>

Sent: Friday, February 17, 2017 1:45 PM

To: [EMAIL PROTECTED]

Subject: Re: Contributing an algorithm for samsara

in particular, this is the samsara implementation of double-weighed als :

https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626

MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · apache/mahout · GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626>

github.com

mahout - Mirror of Apache Mahout

On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Jim,

>

> if ALS is of interest, and as far as weighed ALS is concerned (since we

> already have trivial regularized ALS in the "decompositions" package),

> here's uncommitted samsara-compatible patch from a while back:

> https://issues.apache.org/jira/browse/MAHOUT-1365

[MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365>

issues.apache.org

Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky ...

>

> it combines weights on both data points (a.k.a "implicit feedback" als)

> and regularization rates (paper references are given). We combine both

> approaches in one (which is novel, i guess, but yet simple enough).

> Obviously the final solver can also be used as pure reg rate regularized if

> wanted, making it equivalent to one of the papers.

>

> You may know implicit feedback paper from mllib's implicit als, but unlike

> it was done over there (as a use case sort problem that takes input before

> even features were extracted), we split the problem into pure algebraic

> solver (double-weighed ALS math) and leave the feature extraction outside

> of this issue per se (it can be added as a separate adapter).

>

> The reason for that is that the specific use-case oriented implementation

> does not necessarily leave the space for feature extraction that is

> different from described use case of partially consumed streamed videos in

> the paper. (e.g., instead of videos one could count visits or clicks or

> add-to-cart events which may need additional hyperparameter found for them

> as part of feature extraction and converting observations into "weghts").

>

> The biggest problem with these ALS methods however is that all

> hyperparameters require multidimensional crossvalidation and optimization.

> I think i mentioned it before as list of desired solutions, as it stands,

> Mahout does not have hyperarameter fitting routine.

>

> In practice, when using these kind of ALS, we have a case of

> multidimensional hyperparameter optimization. One of them comes from the

> fitter (reg rate, or base reg rate in case of weighed regularization), and

> the others come from feature extraction process. E.g., in original paper

> they introduce (at least) 2 formulas to extract measure weighs from the

> streaming video observations, and each of them had one parameter, alhpa,

> which in context of the whole problem becomes effectively yet another

[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model>

Generalized linear model - Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model>

en.wikipedia.org

Part of a series on Statistics: Regression analysis; Models; Linear regression; Simple regression; Ordinary least squares; Polynomial regression; General linear model

[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>

github.com

rawkintrevo has 22 repositories available. Follow their code on GitHub.

User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>

stackexchange.com

Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions

[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>

trevorgrant.org

Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext

Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.

Service operated by Sematext