|
|
-
Beginner's Question: What is a feature?
Em 2011-05-22, 12:56
Hi list, I just read Mahout in Action and I tried to understand the chapter about classifying data. While I am reimplementing one of the examples from the book, I get really confused and a little bit disappointed about the assumptions the author makes about the reader. There are some lines of code where you can see a variable is in use but you never saw where and how it was defined. So far, my question is: When using an OnlineLogisticRegression-Algorithm, what is ment by "feature"? Let's say I got a bunch of data in a csv-format. There are the following columns I want to consider for classification: "Key", "Value" - does it mean I got two features? Thanks, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Beginner-s-Question-What-is-a-feature-tp2971745p2971745.htmlSent from the Mahout User List mailing list archive at Nabble.com.
-
Re: Beginner's Question: What is a feature?
Jeremy Lewi 2011-05-22, 16:15
Em, Typically in machine learning a feature vector is just a vector of numbers which describes the data. For example, if you are trying to classify images, the features might be a vector of pixel intensities. Or you could process the image to extract higher level features. For example, you might compute some basic statistics of the pixel intensities for each image (e.g, the mean, max, min, etc...) and then use those summary statistics as the features for each image. So in your case if you use key and value as the features then you have a 2-d feature vector. Can you describe your data a little more? J On Sun, 2011-05-22 at 05:56 -0700, Em wrote: > Hi list, > > I just read Mahout in Action and I tried to understand the chapter about > classifying data. > While I am reimplementing one of the examples from the book, I get really > confused and a little bit disappointed about the assumptions the author > makes about the reader. > > There are some lines of code where you can see a variable is in use but you > never saw where and how it was defined. > > So far, my question is: > > When using an OnlineLogisticRegression-Algorithm, what is ment by "feature"? > > Let's say I got a bunch of data in a csv-format. > There are the following columns I want to consider for classification: > "Key", "Value" - does it mean I got two features? > > Thanks, > Em > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Beginner-s-Question-What-is-a-feature-tp2971745p2971745.html> Sent from the Mahout User List mailing list archive at Nabble.com.
-
Re: Beginner's Question: What is a feature?
Em 2011-05-22, 17:32
Hi Jeremy, thank you for your answer. I got no data, I just try to understand and learn more about Mahout, since I am a beginner in machine-learning. Mahout in Action says that there are typically four types of features: categorical, word-like, text-like and continous. So, let's say I got a descriptional-text of 100-200 words (text-like). Does this mean that I got one feature (the description) or does it mean that I got 100-200 features (the words)? The OnlineLogisticRegression-class requires me to tell it how many categories are there and how many features I like to provide. My question now is, if I got a categorical- and a text-like feature, do I have to tell the class that I am going to add two features? What happens, if I encode 20 different features into the vector but missconfigured the algorithm in a way that I told there were only 10 features? I miss a little bit some formula or something like that for the algorithms that are part of mahout. This would make understanding the different parameters more easy, I think. That's what I ment. Hopefully my explanation is better now? Thank you, Em Am 22.05.2011 18:15, schrieb Jeremy Lewi: > Em, > > Typically in machine learning a feature vector is just a vector of > numbers which describes the data. > > For example, if you are trying to classify images, the features might be > a vector of pixel intensities. Or you could process the image to extract > higher level features. For example, you might compute some basic > statistics of the pixel intensities for each image (e.g, the mean, max, > min, etc...) and then use those summary statistics as the features for > each image. > > So in your case if you use key and value as the features then you have a > 2-d feature vector. > > Can you describe your data a little more? > > J > On Sun, 2011-05-22 at 05:56 -0700, Em wrote: >> Hi list, >> >> I just read Mahout in Action and I tried to understand the chapter about >> classifying data. >> While I am reimplementing one of the examples from the book, I get really >> confused and a little bit disappointed about the assumptions the author >> makes about the reader. >> >> There are some lines of code where you can see a variable is in use but you >> never saw where and how it was defined. >> >> So far, my question is: >> >> When using an OnlineLogisticRegression-Algorithm, what is ment by "feature"? >> >> Let's say I got a bunch of data in a csv-format. >> There are the following columns I want to consider for classification: >> "Key", "Value" - does it mean I got two features? >> >> Thanks, >> Em >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Beginner-s-Question-What-is-a-feature-tp2971745p2971745.html>> Sent from the Mahout User List mailing list archive at Nabble.com. > >
-
Re: Beginner's Question: What is a feature?
Ted Dunning 2011-05-22, 17:33
And Em,
Can you particularly describe your target variable and where you get it from?
On Sun, May 22, 2011 at 9:15 AM, Jeremy Lewi <[EMAIL PROTECTED]> wrote:
> Can you describe your data a little more? >
-
Re: Beginner's Question: What is a feature?
Ted Dunning 2011-05-22, 17:43
On Sun, May 22, 2011 at 10:32 AM, Em <[EMAIL PROTECTED]> wrote:
> So, let's say I got a descriptional-text of 100-200 words (text-like). > Does this mean that I got one feature (the description) or does it mean > that I got 100-200 features (the words)? >
There is a bit of confusion because the term feature can be used at two points in the process.
At raw data level, you have one feature that is text-like.
You have to encode this feature, however, as a numerical vector. You can do that in a number of ways, but you can't encode text-like data into a single numerical value. You need to use lots of numerical values to encode it. That can be done where every possible word has a different numerical value or you can use the hashed encoding where you pick the number of numerical values and the hashing encoder deals with your data and your choice.
After you encode the data, you are left with a typically sparse Vector. The learning algorithm never sees your original data, just this Vector.
So, from the viewpoint of the learning algorithm, each element of this Vector is a feature.
Unfortunately this dual use of nomenclature is completely wide-spread when people describe supervised machine learning such as the classifiers in Mahout do.
> The OnlineLogisticRegression-class requires me to tell it how many > categories are there and how many features I like to provide. >
Categories refer to the target variable. You have to say how many possible values of the target that there are.
The number of features given here is *after* encoding. Your text variable would probably be encoded into a Vector of size 10,000-1,000,000 so this size is what you should give the OnlineLogisticRegression. > My question now is, if I got a categorical- and a text-like feature, do > I have to tell the class that I am going to add two features? >
With the hashed encoding what you would do is create two encoders with different types and names. Pick an output vector size that is pretty big (100,000 should do). Then use each encoder with the corresponding data. > > What happens, if I encode 20 different features into the vector but > missconfigured the algorithm in a way that I told there were only 10 >
You would have 20 different encoders and some sized Vector.
If you give the learning algorithm a wrong-sized Vector, it should immediately complain. If it doesn't or if it doesn't complain clearly with a good message, file a bug.
features? I miss a little bit some formula or something like that for > the algorithms that are part of mahout. This would make understanding > the different parameters more easy, I think. >
I think that this is genuinely confusing. Keep going in the book. The next chapters go into more detail on this process.
-
Re: Beginner's Question: What is a feature?
Em 2011-05-22, 19:04
Thank you Ted,
your explanations really helped.
Regards, Em
Am 22.05.2011 19:43, schrieb Ted Dunning: > On Sun, May 22, 2011 at 10:32 AM, Em <[EMAIL PROTECTED]> wrote: > >> So, let's say I got a descriptional-text of 100-200 words (text-like). >> Does this mean that I got one feature (the description) or does it mean >> that I got 100-200 features (the words)? >> > > There is a bit of confusion because the term feature can be used at two > points in the process. > > At raw data level, you have one feature that is text-like. > > You have to encode this feature, however, as a numerical vector. You can do > that in a number of ways, but you can't encode text-like data into a single > numerical value. You need to use lots of numerical values to encode it. > That can be done where every possible word has a different numerical value > or you can use the hashed encoding where you pick the number of numerical > values and the hashing encoder deals with your data and your choice. > > After you encode the data, you are left with a typically sparse Vector. The > learning algorithm never sees your original data, just this Vector. > > So, from the viewpoint of the learning algorithm, each element of this > Vector is a feature. > > Unfortunately this dual use of nomenclature is completely wide-spread when > people describe supervised machine learning such as the classifiers in > Mahout do. > > > >> The OnlineLogisticRegression-class requires me to tell it how many >> categories are there and how many features I like to provide. >> > > Categories refer to the target variable. You have to say how many possible > values of the target that there are. > > The number of features given here is *after* encoding. Your text variable > would probably be encoded into a Vector of size 10,000-1,000,000 so this > size is what you should give the OnlineLogisticRegression. > > >> My question now is, if I got a categorical- and a text-like feature, do >> I have to tell the class that I am going to add two features? >> > > With the hashed encoding what you would do is create two encoders with > different types and names. Pick an output vector size that is pretty big > (100,000 should do). Then use each encoder with the corresponding data. > > >> >> What happens, if I encode 20 different features into the vector but >> missconfigured the algorithm in a way that I told there were only 10 >> > > You would have 20 different encoders and some sized Vector. > > If you give the learning algorithm a wrong-sized Vector, it should > immediately complain. If it doesn't or if it doesn't complain clearly with > a good message, file a bug. > > features? I miss a little bit some formula or something like that for >> the algorithms that are part of mahout. This would make understanding >> the different parameters more easy, I think. >> > > I think that this is genuinely confusing. Keep going in the book. The next > chapters go into more detail on this process. >
-
Re: Beginner's Question: What is a feature?
Lance Norskog 2011-05-23, 21:00
Wait. I thought a "feature" is an abstract concept for clumps of "meaning" that are found by analyzing the set of "feature vectors" described above.
On Sun, May 22, 2011 at 12:04 PM, Em <[EMAIL PROTECTED]> wrote: > Thank you Ted, > > your explanations really helped. > > Regards, > Em > > Am 22.05.2011 19:43, schrieb Ted Dunning: >> On Sun, May 22, 2011 at 10:32 AM, Em <[EMAIL PROTECTED]> wrote: >> >>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>> Does this mean that I got one feature (the description) or does it mean >>> that I got 100-200 features (the words)? >>> >> >> There is a bit of confusion because the term feature can be used at two >> points in the process. >> >> At raw data level, you have one feature that is text-like. >> >> You have to encode this feature, however, as a numerical vector. You can do >> that in a number of ways, but you can't encode text-like data into a single >> numerical value. You need to use lots of numerical values to encode it. >> That can be done where every possible word has a different numerical value >> or you can use the hashed encoding where you pick the number of numerical >> values and the hashing encoder deals with your data and your choice. >> >> After you encode the data, you are left with a typically sparse Vector. The >> learning algorithm never sees your original data, just this Vector. >> >> So, from the viewpoint of the learning algorithm, each element of this >> Vector is a feature. >> >> Unfortunately this dual use of nomenclature is completely wide-spread when >> people describe supervised machine learning such as the classifiers in >> Mahout do. >> >> >> >>> The OnlineLogisticRegression-class requires me to tell it how many >>> categories are there and how many features I like to provide. >>> >> >> Categories refer to the target variable. You have to say how many possible >> values of the target that there are. >> >> The number of features given here is *after* encoding. Your text variable >> would probably be encoded into a Vector of size 10,000-1,000,000 so this >> size is what you should give the OnlineLogisticRegression. >> >> >>> My question now is, if I got a categorical- and a text-like feature, do >>> I have to tell the class that I am going to add two features? >>> >> >> With the hashed encoding what you would do is create two encoders with >> different types and names. Pick an output vector size that is pretty big >> (100,000 should do). Then use each encoder with the corresponding data. >> >> >>> >>> What happens, if I encode 20 different features into the vector but >>> missconfigured the algorithm in a way that I told there were only 10 >>> >> >> You would have 20 different encoders and some sized Vector. >> >> If you give the learning algorithm a wrong-sized Vector, it should >> immediately complain. If it doesn't or if it doesn't complain clearly with >> a good message, file a bug. >> >> features? I miss a little bit some formula or something like that for >>> the algorithms that are part of mahout. This would make understanding >>> the different parameters more easy, I think. >>> >> >> I think that this is genuinely confusing. Keep going in the book. The next >> chapters go into more detail on this process. >> >
-- Lance Norskog [EMAIL PROTECTED]
-
Re: Beginner's Question: What is a feature?
Daniel McEnnis 2011-05-24, 00:34
The traditional meaning of feature in machine learning as I understand it is an arbitrary piece of information about some object. These features are usually grouped by type into a feature vector which provides a uniform way to describe of any object of the same class.
Daniel.
On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Wait. I thought a "feature" is an abstract concept for clumps of > "meaning" that are found by analyzing the set of "feature vectors" > described above. > > On Sun, May 22, 2011 at 12:04 PM, Em <[EMAIL PROTECTED]> wrote: >> Thank you Ted, >> >> your explanations really helped. >> >> Regards, >> Em >> >> Am 22.05.2011 19:43, schrieb Ted Dunning: >>> On Sun, May 22, 2011 at 10:32 AM, Em <[EMAIL PROTECTED]> wrote: >>> >>>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>>> Does this mean that I got one feature (the description) or does it mean >>>> that I got 100-200 features (the words)? >>>> >>> >>> There is a bit of confusion because the term feature can be used at two >>> points in the process. >>> >>> At raw data level, you have one feature that is text-like. >>> >>> You have to encode this feature, however, as a numerical vector. You can do >>> that in a number of ways, but you can't encode text-like data into a single >>> numerical value. You need to use lots of numerical values to encode it. >>> That can be done where every possible word has a different numerical value >>> or you can use the hashed encoding where you pick the number of numerical >>> values and the hashing encoder deals with your data and your choice. >>> >>> After you encode the data, you are left with a typically sparse Vector. The >>> learning algorithm never sees your original data, just this Vector. >>> >>> So, from the viewpoint of the learning algorithm, each element of this >>> Vector is a feature. >>> >>> Unfortunately this dual use of nomenclature is completely wide-spread when >>> people describe supervised machine learning such as the classifiers in >>> Mahout do. >>> >>> >>> >>>> The OnlineLogisticRegression-class requires me to tell it how many >>>> categories are there and how many features I like to provide. >>>> >>> >>> Categories refer to the target variable. You have to say how many possible >>> values of the target that there are. >>> >>> The number of features given here is *after* encoding. Your text variable >>> would probably be encoded into a Vector of size 10,000-1,000,000 so this >>> size is what you should give the OnlineLogisticRegression. >>> >>> >>>> My question now is, if I got a categorical- and a text-like feature, do >>>> I have to tell the class that I am going to add two features? >>>> >>> >>> With the hashed encoding what you would do is create two encoders with >>> different types and names. Pick an output vector size that is pretty big >>> (100,000 should do). Then use each encoder with the corresponding data. >>> >>> >>>> >>>> What happens, if I encode 20 different features into the vector but >>>> missconfigured the algorithm in a way that I told there were only 10 >>>> >>> >>> You would have 20 different encoders and some sized Vector. >>> >>> If you give the learning algorithm a wrong-sized Vector, it should >>> immediately complain. If it doesn't or if it doesn't complain clearly with >>> a good message, file a bug. >>> >>> features? I miss a little bit some formula or something like that for >>>> the algorithms that are part of mahout. This would make understanding >>>> the different parameters more easy, I think. >>>> >>> >>> I think that this is genuinely confusing. Keep going in the book. The next >>> chapters go into more detail on this process. >>> >> > > > > -- > Lance Norskog > [EMAIL PROTECTED] >
-
Re: Beginner's Question: What is a feature?
Benson Margulies 2011-05-24, 00:43
On Mon, May 23, 2011 at 8:34 PM, Daniel McEnnis <[EMAIL PROTECTED]> wrote: > The traditional meaning of feature in machine learning as I understand > it is an arbitrary piece of information about some object. These > features are usually grouped by type into a feature vector which > provides a uniform way to describe of any object of the same class.
Except when they aren't. Consider a sequence tagger. There are features, and no vectors. > > Daniel. > > On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> Wait. I thought a "feature" is an abstract concept for clumps of >> "meaning" that are found by analyzing the set of "feature vectors" >> described above. >> >> On Sun, May 22, 2011 at 12:04 PM, Em <[EMAIL PROTECTED]> wrote: >>> Thank you Ted, >>> >>> your explanations really helped. >>> >>> Regards, >>> Em >>> >>> Am 22.05.2011 19:43, schrieb Ted Dunning: >>>> On Sun, May 22, 2011 at 10:32 AM, Em <[EMAIL PROTECTED]> wrote: >>>> >>>>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>>>> Does this mean that I got one feature (the description) or does it mean >>>>> that I got 100-200 features (the words)? >>>>> >>>> >>>> There is a bit of confusion because the term feature can be used at two >>>> points in the process. >>>> >>>> At raw data level, you have one feature that is text-like. >>>> >>>> You have to encode this feature, however, as a numerical vector. You can do >>>> that in a number of ways, but you can't encode text-like data into a single >>>> numerical value. You need to use lots of numerical values to encode it. >>>> That can be done where every possible word has a different numerical value >>>> or you can use the hashed encoding where you pick the number of numerical >>>> values and the hashing encoder deals with your data and your choice. >>>> >>>> After you encode the data, you are left with a typically sparse Vector. The >>>> learning algorithm never sees your original data, just this Vector. >>>> >>>> So, from the viewpoint of the learning algorithm, each element of this >>>> Vector is a feature. >>>> >>>> Unfortunately this dual use of nomenclature is completely wide-spread when >>>> people describe supervised machine learning such as the classifiers in >>>> Mahout do. >>>> >>>> >>>> >>>>> The OnlineLogisticRegression-class requires me to tell it how many >>>>> categories are there and how many features I like to provide. >>>>> >>>> >>>> Categories refer to the target variable. You have to say how many possible >>>> values of the target that there are. >>>> >>>> The number of features given here is *after* encoding. Your text variable >>>> would probably be encoded into a Vector of size 10,000-1,000,000 so this >>>> size is what you should give the OnlineLogisticRegression. >>>> >>>> >>>>> My question now is, if I got a categorical- and a text-like feature, do >>>>> I have to tell the class that I am going to add two features? >>>>> >>>> >>>> With the hashed encoding what you would do is create two encoders with >>>> different types and names. Pick an output vector size that is pretty big >>>> (100,000 should do). Then use each encoder with the corresponding data. >>>> >>>> >>>>> >>>>> What happens, if I encode 20 different features into the vector but >>>>> missconfigured the algorithm in a way that I told there were only 10 >>>>> >>>> >>>> You would have 20 different encoders and some sized Vector. >>>> >>>> If you give the learning algorithm a wrong-sized Vector, it should >>>> immediately complain. If it doesn't or if it doesn't complain clearly with >>>> a good message, file a bug. >>>> >>>> features? I miss a little bit some formula or something like that for >>>>> the algorithms that are part of mahout. This would make understanding >>>>> the different parameters more easy, I think. >>>>> >>>> >>>> I think that this is genuinely confusing. Keep going in the book. The next >>>> chapters go into more detail on this process.
-
Re: Beginner's Question: What is a feature?
Ted Dunning 2011-05-24, 01:26
On Mon, May 23, 2011 at 5:43 PM, Benson Margulies <[EMAIL PROTECTED]>wrote:
> On Mon, May 23, 2011 at 8:34 PM, Daniel McEnnis <[EMAIL PROTECTED]> > wrote: > > The traditional meaning of feature in machine learning as I understand > > it is an arbitrary piece of information about some object. These > > features are usually grouped by type into a feature vector which > > provides a uniform way to describe of any object of the same class. > > Except when they aren't. Consider a sequence tagger. There are > features, and no vectors. > > > The features are still conventionally grouped into an array of features called a feature vector. For a sequence tagger, there is a feature vector for each point in the sequence that needs to be tagged. It can include the neighboring sequence elements, neighboring model generated tags and the phase of the moon. The raw feature vector is often reprocessed in various ways to get the vector of values the actually gets passed to the learning algorithm (for training) or the resulting model (for classification).
I thought Daniel's description was quite good.
|
|