|
Ioan Eugen Stan
2011-11-17, 02:39
Grant Ingersoll
2011-11-18, 17:44
Ted Dunning
2011-11-18, 18:04
Ioan Eugen Stan
2011-11-21, 08:57
Lance Norskog
2011-11-22, 02:32
Ioan Eugen Stan
2011-11-22, 09:30
Grant Ingersoll
2011-11-22, 12:23
|
-
clustering hardware requirementsIoan Eugen Stan 2011-11-17, 02:39
Hello,
I have to figure out how much hardware is required to do clustering for my company on about 10+ milion user accounts, each with 100-5000 documents. The documents will be indexed so vector creation will be done at indexing. Is there any formula to approximate the size of the vectors based on the index size? I'm looking for rough estimates (how much disk extra space should I consider?). Which are the most time consuming tasks? From my experience with clustering, the index/vector creation part is the most time consuming, while clustering being the second. Does anyone have some data on how much time a clustering job takes? Thanks, -- Ioan Eugen Stan http://ieugen.blogspot.com/
-
Re: clustering hardware requirementsGrant Ingersoll 2011-11-18, 17:44
On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote: > Hello, > > I have to figure out how much hardware is required to do clustering > for my company on about 10+ milion user accounts, each with 100-5000 > documents. The documents will be indexed so vector creation will be > done at indexing. > Is there any formula to approximate the size of the vectors based on > the index size? I'm looking for rough estimates (how much disk extra > space should I consider?). I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself! Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them. > > Which are the most time consuming tasks? From my experience with > clustering, the index/vector creation part is the most time consuming, > while clustering being the second. Does anyone have some data on how > much time a clustering job takes? That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential. I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector. -Grant
-
Re: clustering hardware requirementsTed Dunning 2011-11-18, 18:04
It is a great idea except that the centroids become harder to interpret.
Not much harder. Just a bit harder. On Fri, Nov 18, 2011 at 9:44 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > I haven't explored yet what it would mean to use Encoded vectors in > Clustering, but perhaps I can call Ted to the front of the class and see if > he has thoughts on whether that even makes sense, as that would give you a > fixed size Vector. >
-
Re: clustering hardware requirementsIoan Eugen Stan 2011-11-21, 08:57
> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself!
I think posting some reference data for the jobs will be great. I will have something to compare to when I have something done. In the mean time I will try to do a quick and dirty implementation working and see how things move and post my findings. This could take a while as I depend on some modifications. > Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them. I also suspect that things will be asymptotically after a certain number of documents, remains to see where that threshold is. > That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential. I will be able to use a map reduce job to create vectors or just create them as an indexing step so I hope this step will not count when considering the effective clustering time. > I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector. > > -Grant I don't know about encoded vectors yet, I hope to get some more info on them from Mahout in Action. If they do what I think they do, I will definitely try them, and probably complain on the list (Ted) if I can't interpret them right :). Thanks for the reply, -- Ioan Eugen Stan
-
Re: clustering hardware requirementsLance Norskog 2011-11-22, 02:32
Ioan- when you understand them, please explain them here:
https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats On Mon, Nov 21, 2011 at 12:57 AM, Ioan Eugen Stan <[EMAIL PROTECTED]>wrote: > I'll try in the next few days to track down the numbers from running the >> stuff in my recent IBM article: http://www.ibm.com/** >> developerworks/java/library/j-**mahout-scaling/<http://www.ibm.com/developerworks/java/library/j-mahout-scaling/>. >> Or, you can go run them yourself! >> > > I think posting some reference data for the jobs will be great. I will > have something to compare to when I have something done. In the mean time I > will try to do a quick and dirty implementation working and see how things > move and post my findings. This could take a while as I depend on some > modifications. > > > Otherwise, I don't know that we have any formula just yet. I suspect >> that once you reach a certain number of documents, your dictionary will >> stop growing, more or less. Then, it is just a question of how many >> vectors you have and the sparseness. This probably could be guessed at by >> looking at what the average number of words are in your email collection. >> Naturally, attachments may skew this if you are including them. >> > > I also suspect that things will be asymptotically after a certain number > of documents, remains to see where that threshold is. > > > That has been my experience, too. Seq2Sparse is often the long part. I >> suspect one could get it done a lot faster in Lucene. >> SequenceFilesFromDirectory is also slow, but that is inherently sequential. >> > > I will be able to use a map reduce job to create vectors or just create > them as an indexing step so I hope this step will not count when > considering the effective clustering time. > > > I haven't explored yet what it would mean to use Encoded vectors in >> Clustering, but perhaps I can call Ted to the front of the class and see if >> he has thoughts on whether that even makes sense, as that would give you a >> fixed size Vector. >> >> -Grant >> > > I don't know about encoded vectors yet, I hope to get some more info on > them from Mahout in Action. If they do what I think they do, I will > definitely try them, and probably complain on the list (Ted) if I can't > interpret them right :). > > Thanks for the reply, > > -- > Ioan Eugen Stan > -- Lance Norskog [EMAIL PROTECTED]
-
Re: clustering hardware requirementsIoan Eugen Stan 2011-11-22, 09:30
Pe 22.11.2011 04:32, Lance Norskog a scris:
> Ioan- when you understand them, please explain them here: > > https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats > Ok, I'll remember that.
-
Re: clustering hardware requirementsGrant Ingersoll 2011-11-22, 12:23
Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 running locally:
Raw content size: 9.2 GB, 48K "items" -- note, most of the files are GZipped It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM After converting to sequence files: 40 GB, 659 items. Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 items. This took 83 minutes to convert Splitting into test and train took 9 minutes for SGD. I had to kill the SGD job due to some issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something is messed up on my machine) that I need to track down. For clustering, about the same time for converting to sequence files The job to convert to vectors took a while (it scrolled out of my window). The resulting tfidf-vecs were 7.8 gb. Dictionary: 82865442 2011-11-21 17:46 dictionary.file-0* 83269191 2011-11-21 17:46 dictionary.file-1* 10963133 2011-11-21 17:46 dictionary.file-2* Freq files: 37160153 2011-11-21 22:35 frequency.file-0* 37160173 2011-11-21 22:35 frequency.file-1* 37160173 2011-11-21 22:35 frequency.file-2* 31407713 2011-11-21 22:35 frequency.file-3* Total dir size for seq2sparse: du -s seq2sparse/ 30923564 seq2sparse/ More as they become available. HTH, Grant On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote: >> I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you can go run them yourself! > > I think posting some reference data for the jobs will be great. I will have something to compare to when I have something done. In the mean time I will try to do a quick and dirty implementation working and see how things move and post my findings. This could take a while as I depend on some modifications. > >> Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them. > > I also suspect that things will be asymptotically after a certain number of documents, remains to see where that threshold is. > >> That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential. > > I will be able to use a map reduce job to create vectors or just create them as an indexing step so I hope this step will not count when considering the effective clustering time. > >> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector. >> >> -Grant > > I don't know about encoded vectors yet, I hope to get some more info on them from Mahout in Action. If they do what I think they do, I will definitely try them, and probably complain on the list (Ted) if I can't interpret them right :). > > Thanks for the reply, > > -- > Ioan Eugen Stan -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com |