|
John Conwell
2012-01-24, 21:27
Jake Mannix
2012-01-24, 21:48
John Conwell
2012-01-24, 22:27
Jake Mannix
2012-01-25, 00:35
John Conwell
2012-01-25, 01:31
Jake Mannix
2012-01-25, 01:48
John Conwell
2012-01-25, 02:47
Jake Mannix
2012-01-25, 03:13
|
-
Non-compatible mapper keys between LDADriver and CVB0DriverJohn Conwell 2012-01-24, 21:27
I wanted to compare the two LDA implementations, and I noticed that for the
input corpus sequence file file (key: doc_id, value: vector), the Key for the input file for LDADriver takes any WritableComparable<?> key, but the Key for the input file for CVB0Driver requires IntWritable explicitly. Is there some reason these two LDA implementations cant both use WritableComparable<?> for the key of the input sequence file? It would make integrating them into application workflows much easier and consistant. -- Thanks, John C
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJake Mannix 2012-01-24, 21:48
In general, workflows with matrices in Mahout handle
SequenceFile<IntWritable, VectorWritable>, as this is the on-disk format of the class DistributedRowMatrix. The original Mahout LDA pre-dated this move to standardize closer to that format, and so it didn't have that requirement. Now, as you say, it's true that in the new implementation, the keys aren't actually used, so in principle we could just go with WritableComparable<?> in CVB0Driver's mappers/reducers keys. In fact, it would make certain integrations a little nicer, at the cost of pushing incompatibility somewhere else. For example, the output p(document | topic) distributions go into a SequenceFile whose keys are the same as the input corpus keys (ie the doc_id values), and there may be workflows which take this matrix and transpose it to multiply it by another matrix or somethign of that nature. If the keys are IntWritable, this all works just fine. If not, then transpose will fail horribly, as will matrix multiplication. Standardizing on a common fixed format internally avoids some of these problems, while at the same time being a bit inflexible. It's possible we could add a command-line option + some internal switches to allow the user to explicitly force untyped keys, or just warn on non-integer keys or something... I can just imagine seeing the questions on this very list when someone takes the output of their Long-keyed corpus and try to matrix multiply it by some other matrix... -jake On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > I wanted to compare the two LDA implementations, and I noticed that for the > input corpus sequence file file (key: doc_id, value: vector), the Key for > the input file for LDADriver takes any WritableComparable<?> key, but the > Key for the input file for CVB0Driver requires IntWritable explicitly. Is > there some reason these two LDA implementations cant both use > WritableComparable<?> for the key of the input sequence file? It would > make integrating them into application workflows much easier and > consistant. > > -- > > Thanks, > John C >
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJohn Conwell 2012-01-24, 22:27
Hi Jake,
Thanks for the explanation. I actually prefer using ints as key identifiers globally, vs a string. It can help keep memory and gc utilization way down, especially in algorithms that have high iteration counts. I had gone through an example that used the original LDA algorithm, and the samples used the filename as the document key, vs some kind of integer identifier, so I just went with that. It does make things easier when looking at your output results, since you dont have to keep some separate store that maps integer doc ids against friendly string names, but I dont think that is really all that important. For the long run, in my opinion I would definitely standardize on IntWritable for vector keys. Thanks for the great explanation! JohnC On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > In general, workflows with matrices in Mahout handle > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk format of > the class DistributedRowMatrix. The original Mahout LDA pre-dated this > move to standardize closer to that format, and so it didn't have that > requirement. > > Now, as you say, it's true that in the new implementation, the keys aren't > actually > used, so in principle we could just go with WritableComparable<?> in > CVB0Driver's > mappers/reducers keys. In fact, it would make certain integrations a > little nicer, > at the cost of pushing incompatibility somewhere else. For example, the > output > p(document | topic) distributions go into a SequenceFile whose keys are the > same > as the input corpus keys (ie the doc_id values), and there may be workflows > which > take this matrix and transpose it to multiply it by another matrix or > somethign of that > nature. If the keys are IntWritable, this all works just fine. If not, > then transpose > will fail horribly, as will matrix multiplication. > > Standardizing on a common fixed format internally avoids some of these > problems, > while at the same time being a bit inflexible. > > It's possible we could add a command-line option + some internal switches > to allow > the user to explicitly force untyped keys, or just warn on non-integer keys > or > something... > > I can just imagine seeing the questions on this very list when someone > takes the output > of their Long-keyed corpus and try to matrix multiply it by some other > matrix... > > -jake > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > I wanted to compare the two LDA implementations, and I noticed that for > the > > input corpus sequence file file (key: doc_id, value: vector), the Key for > > the input file for LDADriver takes any WritableComparable<?> key, but the > > Key for the input file for CVB0Driver requires IntWritable explicitly. > Is > > there some reason these two LDA implementations cant both use > > WritableComparable<?> for the key of the input sequence file? It would > > make integrating them into application workflows much easier and > > consistant. > > > > -- > > > > Thanks, > > John C > > > -- Thanks, John C
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJake Mannix 2012-01-25, 00:35
On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[EMAIL PROTECTED]> wrote:
> Hi Jake, > Thanks for the explanation. I actually prefer using ints as key > identifiers globally, vs a string. It can help keep memory and gc > utilization way down, especially in algorithms that have high iteration > counts. > > I had gone through an example that used the original LDA algorithm, and the > samples used the filename as the document key, vs some kind of integer > identifier, so I just went with that. It does make things easier when > looking at your output results, since you dont have to keep > some separate store that maps integer doc ids against friendly string > names, but I dont think that is really all that important. For the long > run, in my opinion I would definitely standardize on IntWritable for vector > keys. > Yeah, avoiding having a separate store / mapping for "docId -> documentName" or whatnot is a good reason to not normalize this field, but since we already have to do this for the terms, for efficiency's sake, keeping an extra mapping for docs is not so much of a big deal, IMO. The only part in which this becomes annoying is that there aren't very many ints. Longs might be better, sometimes. Then again, *forcing* everyone to use big 8byte longs for stuff which easily fits in ints can be silly, and doing this for *both* row keys and column keys is wasting lots of space, but necessary for the idea of "transpose" or matrix multiplication to make sense. > > Thanks for the great explanation! > > No problem. -jake > JohnC > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > In general, workflows with matrices in Mahout handle > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk format > of > > the class DistributedRowMatrix. The original Mahout LDA pre-dated this > > move to standardize closer to that format, and so it didn't have that > > requirement. > > > > Now, as you say, it's true that in the new implementation, the keys > aren't > > actually > > used, so in principle we could just go with WritableComparable<?> in > > CVB0Driver's > > mappers/reducers keys. In fact, it would make certain integrations a > > little nicer, > > at the cost of pushing incompatibility somewhere else. For example, the > > output > > p(document | topic) distributions go into a SequenceFile whose keys are > the > > same > > as the input corpus keys (ie the doc_id values), and there may be > workflows > > which > > take this matrix and transpose it to multiply it by another matrix or > > somethign of that > > nature. If the keys are IntWritable, this all works just fine. If not, > > then transpose > > will fail horribly, as will matrix multiplication. > > > > Standardizing on a common fixed format internally avoids some of these > > problems, > > while at the same time being a bit inflexible. > > > > It's possible we could add a command-line option + some internal switches > > to allow > > the user to explicitly force untyped keys, or just warn on non-integer > keys > > or > > something... > > > > I can just imagine seeing the questions on this very list when someone > > takes the output > > of their Long-keyed corpus and try to matrix multiply it by some other > > matrix... > > > > -jake > > > > On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > > > I wanted to compare the two LDA implementations, and I noticed that for > > the > > > input corpus sequence file file (key: doc_id, value: vector), the Key > for > > > the input file for LDADriver takes any WritableComparable<?> key, but > the > > > Key for the input file for CVB0Driver requires IntWritable explicitly. > > Is > > > there some reason these two LDA implementations cant both use > > > WritableComparable<?> for the key of the input sequence file? It would > > > make integrating them into application workflows much easier and > > > consistant. > > > > > > -- > > > > > > Thanks, > > > John C
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJohn Conwell 2012-01-25, 01:31
Just ran into a problem trying to use IntWritable as my key when creating
vectors so I can use CBV0Driver. I'm using the helper class SparseVectorsFromSequenceFiles to create my document vectors, and I create my sequence file with IntWritable as the key. SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the documents, but, DocumentProcessor's output is key: Text, value: StringTuple. This in turn causes an exception. So it looks like these helper classes that create sequence files of VectorWritable, which are the input to a lot of these algorithms, are not compatible with some of the newer algorithms, like CBV0Driver. Is that correct? And coming back to CBV0Driver, if someone wants to use it, they'll have to hand code the creation of VectorWritables, or possibly run the ones that are created by SparseVectorsFromSequenceFiles through a transform, to output IntWritable keys. Correct? BTW, not trying to sound critical, I'm just trying to understand the architecture. Is this an issue that you guys are want to get fixed/consistant at some point? Where all vector keys are IntWritables, and all helper classes consume and output pairs that have IntWritable keys? I might be interested in helping with that effort. Thanks, JohnC On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > Hi Jake, > > Thanks for the explanation. I actually prefer using ints as key > > identifiers globally, vs a string. It can help keep memory and gc > > utilization way down, especially in algorithms that have high iteration > > counts. > > > > I had gone through an example that used the original LDA algorithm, and > the > > samples used the filename as the document key, vs some kind of integer > > identifier, so I just went with that. It does make things easier when > > looking at your output results, since you dont have to keep > > some separate store that maps integer doc ids against friendly string > > names, but I dont think that is really all that important. For the long > > run, in my opinion I would definitely standardize on IntWritable for > vector > > keys. > > > > Yeah, avoiding having a separate store / mapping for "docId -> > documentName" > or whatnot is a good reason to not normalize this field, but since we > already have > to do this for the terms, for efficiency's sake, keeping an extra mapping > for docs > is not so much of a big deal, IMO. The only part in which this becomes > annoying > is that there aren't very many ints. Longs might be better, sometimes. > Then again, > *forcing* everyone to use big 8byte longs for stuff which easily fits in > ints can be silly, > and doing this for *both* row keys and column keys is wasting lots of > space, but > necessary for the idea of "transpose" or matrix multiplication to make > sense. > > > > > > Thanks for the great explanation! > > > > > No problem. > > -jake > > > > JohnC > > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[EMAIL PROTECTED]> > > wrote: > > > > > In general, workflows with matrices in Mahout handle > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk > format > > of > > > the class DistributedRowMatrix. The original Mahout LDA pre-dated this > > > move to standardize closer to that format, and so it didn't have that > > > requirement. > > > > > > Now, as you say, it's true that in the new implementation, the keys > > aren't > > > actually > > > used, so in principle we could just go with WritableComparable<?> in > > > CVB0Driver's > > > mappers/reducers keys. In fact, it would make certain integrations a > > > little nicer, > > > at the cost of pushing incompatibility somewhere else. For example, > the > > > output > > > p(document | topic) distributions go into a SequenceFile whose keys are > > the > > > same > > > as the input corpus keys (ie the doc_id values), and there may be > > workflows > > > which > > > take this matrix and transpose it to multiply it by another matrix or Thanks, John C
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJake Mannix 2012-01-25, 01:48
On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[EMAIL PROTECTED]> wrote:
> Just ran into a problem trying to use IntWritable as my key when creating > vectors so I can use CBV0Driver. I'm using the helper > class SparseVectorsFromSequenceFiles to create my document vectors, and I > create my sequence file with IntWritable as the key. > SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the > documents, but, DocumentProcessor's output is key: Text, value: > StringTuple. This in turn causes an exception. > > So it looks like these helper classes that create sequence files of > VectorWritable, which are the input to a lot of these algorithms, are not > compatible with some of the newer algorithms, like CBV0Driver. Is that > correct? > $MAHOUT_HOME/bin/mahout rowid --help to the rescue. :) > > And coming back to CBV0Driver, if someone wants to use it, they'll have to > hand code the creation of VectorWritables, or possibly run the ones that > are created by SparseVectorsFromSequenceFiles through a transform, to > output IntWritable keys. Correct? > > BTW, not trying to sound critical, I'm just trying to understand the > architecture. Is this an issue that you guys are want to get > fixed/consistant at some point? Where all vector keys are IntWritables, > and all helper classes consume and output pairs that have IntWritable keys? > I might be interested in helping with that effort. > > Thanks, > JohnC > > > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > > > Hi Jake, > > > Thanks for the explanation. I actually prefer using ints as key > > > identifiers globally, vs a string. It can help keep memory and gc > > > utilization way down, especially in algorithms that have high iteration > > > counts. > > > > > > I had gone through an example that used the original LDA algorithm, and > > the > > > samples used the filename as the document key, vs some kind of integer > > > identifier, so I just went with that. It does make things easier when > > > looking at your output results, since you dont have to keep > > > some separate store that maps integer doc ids against friendly string > > > names, but I dont think that is really all that important. For the > long > > > run, in my opinion I would definitely standardize on IntWritable for > > vector > > > keys. > > > > > > > Yeah, avoiding having a separate store / mapping for "docId -> > > documentName" > > or whatnot is a good reason to not normalize this field, but since we > > already have > > to do this for the terms, for efficiency's sake, keeping an extra mapping > > for docs > > is not so much of a big deal, IMO. The only part in which this becomes > > annoying > > is that there aren't very many ints. Longs might be better, sometimes. > > Then again, > > *forcing* everyone to use big 8byte longs for stuff which easily fits in > > ints can be silly, > > and doing this for *both* row keys and column keys is wasting lots of > > space, but > > necessary for the idea of "transpose" or matrix multiplication to make > > sense. > > > > > > > > > > Thanks for the great explanation! > > > > > > > > No problem. > > > > -jake > > > > > > > JohnC > > > > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[EMAIL PROTECTED]> > > > wrote: > > > > > > > In general, workflows with matrices in Mahout handle > > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk > > format > > > of > > > > the class DistributedRowMatrix. The original Mahout LDA pre-dated > this > > > > move to standardize closer to that format, and so it didn't have that > > > > requirement. > > > > > > > > Now, as you say, it's true that in the new implementation, the keys > > > aren't > > > > actually > > > > used, so in principle we could just go with WritableComparable<?> in > > > > CVB0Driver's > > > > mappers/reducers keys. In fact, it would make certain integrations a
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJohn Conwell 2012-01-25, 02:47
Oh you guys are sneaky. You thought of everything.
Do you guys have future refactoring plans to standardize up on vector id data types? On Tue, Jan 24, 2012 at 5:48 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > Just ran into a problem trying to use IntWritable as my key when creating > > vectors so I can use CBV0Driver. I'm using the helper > > class SparseVectorsFromSequenceFiles to create my document vectors, and I > > create my sequence file with IntWritable as the key. > > SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the > > documents, but, DocumentProcessor's output is key: Text, value: > > StringTuple. This in turn causes an exception. > > > > So it looks like these helper classes that create sequence files of > > VectorWritable, which are the input to a lot of these algorithms, are not > > compatible with some of the newer algorithms, like CBV0Driver. Is that > > correct? > > > > $MAHOUT_HOME/bin/mahout rowid --help > > to the rescue. :) > > > > > > And coming back to CBV0Driver, if someone wants to use it, they'll have > to > > hand code the creation of VectorWritables, or possibly run the ones that > > are created by SparseVectorsFromSequenceFiles through a transform, to > > output IntWritable keys. Correct? > > > > BTW, not trying to sound critical, I'm just trying to understand the > > architecture. Is this an issue that you guys are want to get > > fixed/consistant at some point? Where all vector keys are IntWritables, > > and all helper classes consume and output pairs that have IntWritable > keys? > > I might be interested in helping with that effort. > > > > Thanks, > > JohnC > > > > > > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[EMAIL PROTECTED]> > > wrote: > > > > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Jake, > > > > Thanks for the explanation. I actually prefer using ints as key > > > > identifiers globally, vs a string. It can help keep memory and gc > > > > utilization way down, especially in algorithms that have high > iteration > > > > counts. > > > > > > > > I had gone through an example that used the original LDA algorithm, > and > > > the > > > > samples used the filename as the document key, vs some kind of > integer > > > > identifier, so I just went with that. It does make things easier > when > > > > looking at your output results, since you dont have to keep > > > > some separate store that maps integer doc ids against friendly string > > > > names, but I dont think that is really all that important. For the > > long > > > > run, in my opinion I would definitely standardize on IntWritable for > > > vector > > > > keys. > > > > > > > > > > Yeah, avoiding having a separate store / mapping for "docId -> > > > documentName" > > > or whatnot is a good reason to not normalize this field, but since we > > > already have > > > to do this for the terms, for efficiency's sake, keeping an extra > mapping > > > for docs > > > is not so much of a big deal, IMO. The only part in which this > becomes > > > annoying > > > is that there aren't very many ints. Longs might be better, sometimes. > > > Then again, > > > *forcing* everyone to use big 8byte longs for stuff which easily fits > in > > > ints can be silly, > > > and doing this for *both* row keys and column keys is wasting lots of > > > space, but > > > necessary for the idea of "transpose" or matrix multiplication to make > > > sense. > > > > > > > > > > > > > > Thanks for the great explanation! > > > > > > > > > > > No problem. > > > > > > -jake > > > > > > > > > > JohnC > > > > > > > > On Tue, Jan 24, 2012 at 1:48 PM, Jake Mannix <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > In general, workflows with matrices in Mahout handle > > > > > SequenceFile<IntWritable, VectorWritable>, as this is the on-disk > > > format > > > > of > > > > > the class DistributedRowMatrix. The original Mahout LDA pre-dated Thanks, John C
-
Re: Non-compatible mapper keys between LDADriver and CVB0DriverJake Mannix 2012-01-25, 03:13
On Tue, Jan 24, 2012 at 6:47 PM, John Conwell <[EMAIL PROTECTED]> wrote:
> Oh you guys are sneaky. You thought of everything. > > Do you guys have future refactoring plans to standardize up on vector id > data types? > Well the idea is that early on in the processing chain, you may not have integer ids. So for example, SequenceFilesFromDirectory takes text files from a directory and makes a SequenceFile<Text, Text> with the key being the original filename, and the value being the contents. SparseVectorsFromSequenceFiles takes SequenceFile<Text, Text> and spits out SequenceFile<Text, VectorWritable> with the key remaining unchanged from the previous step, and the value being the vectorized form of the text. RowIdJob then does the final normalization step of turning the Text keys into IntWritable keys. -jake > > On Tue, Jan 24, 2012 at 5:48 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > On Tue, Jan 24, 2012 at 5:31 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > > > Just ran into a problem trying to use IntWritable as my key when > creating > > > vectors so I can use CBV0Driver. I'm using the helper > > > class SparseVectorsFromSequenceFiles to create my document vectors, > and I > > > create my sequence file with IntWritable as the key. > > > SparseVectorsFromSequenceFiles calls DocumentProcessor to tokenize the > > > documents, but, DocumentProcessor's output is key: Text, value: > > > StringTuple. This in turn causes an exception. > > > > > > So it looks like these helper classes that create sequence files of > > > VectorWritable, which are the input to a lot of these algorithms, are > not > > > compatible with some of the newer algorithms, like CBV0Driver. Is that > > > correct? > > > > > > > $MAHOUT_HOME/bin/mahout rowid --help > > > > to the rescue. :) > > > > > > > > > > And coming back to CBV0Driver, if someone wants to use it, they'll have > > to > > > hand code the creation of VectorWritables, or possibly run the ones > that > > > are created by SparseVectorsFromSequenceFiles through a transform, to > > > output IntWritable keys. Correct? > > > > > > BTW, not trying to sound critical, I'm just trying to understand the > > > architecture. Is this an issue that you guys are want to get > > > fixed/consistant at some point? Where all vector keys are > IntWritables, > > > and all helper classes consume and output pairs that have IntWritable > > keys? > > > I might be interested in helping with that effort. > > > > > > Thanks, > > > JohnC > > > > > > > > > On Tue, Jan 24, 2012 at 4:35 PM, Jake Mannix <[EMAIL PROTECTED]> > > > wrote: > > > > > > > On Tue, Jan 24, 2012 at 2:27 PM, John Conwell <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi Jake, > > > > > Thanks for the explanation. I actually prefer using ints as key > > > > > identifiers globally, vs a string. It can help keep memory and gc > > > > > utilization way down, especially in algorithms that have high > > iteration > > > > > counts. > > > > > > > > > > I had gone through an example that used the original LDA algorithm, > > and > > > > the > > > > > samples used the filename as the document key, vs some kind of > > integer > > > > > identifier, so I just went with that. It does make things easier > > when > > > > > looking at your output results, since you dont have to keep > > > > > some separate store that maps integer doc ids against friendly > string > > > > > names, but I dont think that is really all that important. For the > > > long > > > > > run, in my opinion I would definitely standardize on IntWritable > for > > > > vector > > > > > keys. > > > > > > > > > > > > > Yeah, avoiding having a separate store / mapping for "docId -> > > > > documentName" > > > > or whatnot is a good reason to not normalize this field, but since we > > > > already have > > > > to do this for the terms, for efficiency's sake, keeping an extra > > mapping > > > > for docs > > > > is not so much of a big deal, IMO. The only part in which this > > becomes |