Hello folks!
Recently I've become more interested in machine learning (mainly because I follow a course about machine learning at my university ), and this thread will be for posting my questions, as it's a tough topic to understand sometimes. Let's head on to the first one!
Say you got a huge data file with feature variables and a single target variable. Now, some features consists of strings, like "b", "c", "t", "p", "g" etc. Since frameworks can't handle that, you have to encode them to numbers. The simplest option is to use a LabelEncoder. This works fine, but only if the string values are related to each other. In my example, those are the shortcuts for "baby", "child", "teen", "parent" and "grandparent". In this case, you obviously want to label these in order, such that baby=0, child=1, teen=2 and so on. However, if these were not related to each other, the learning model might think it is, so you have to use something else. That's where the OneHotEncoder joins the business, to split categorical features. However, my question is: when should you use an OHE? I asked my teacher, and he replied with "First get the correlation between the feature and the target variable when only using LE. Then, try out OHE and see if the correlation improves. If so, use OHE, and if not, continue using LE". The problem arises as follows: how do you get the correlation between the feature and the target variable after using OHE? OHE splits 1 features into X columns, dependent on the amount of different options of the feature, and I'm not sure how to get one correlation number from those X columns.
Sorry if this doesn't make sense, here is the original question:
Encode the features - there are a lot of categorical features with char values. You'll need to use LabelEncoder and OneHotEncoder for them. Also, see if you need to use OneHotEncoder for all of them - for example by checking the correlations between a LabelEncoded feature and the target variable before and after OneHot Encoding. (OneHot encoding doesn't make any sense if your variable actually represents some incremental, hieraarchical relationship - and we don't know it for our dataset).
Thanks in advance, and stay tuned for more questions!
Recently I've become more interested in machine learning (mainly because I follow a course about machine learning at my university ), and this thread will be for posting my questions, as it's a tough topic to understand sometimes. Let's head on to the first one!
Say you got a huge data file with feature variables and a single target variable. Now, some features consists of strings, like "b", "c", "t", "p", "g" etc. Since frameworks can't handle that, you have to encode them to numbers. The simplest option is to use a LabelEncoder. This works fine, but only if the string values are related to each other. In my example, those are the shortcuts for "baby", "child", "teen", "parent" and "grandparent". In this case, you obviously want to label these in order, such that baby=0, child=1, teen=2 and so on. However, if these were not related to each other, the learning model might think it is, so you have to use something else. That's where the OneHotEncoder joins the business, to split categorical features. However, my question is: when should you use an OHE? I asked my teacher, and he replied with "First get the correlation between the feature and the target variable when only using LE. Then, try out OHE and see if the correlation improves. If so, use OHE, and if not, continue using LE". The problem arises as follows: how do you get the correlation between the feature and the target variable after using OHE? OHE splits 1 features into X columns, dependent on the amount of different options of the feature, and I'm not sure how to get one correlation number from those X columns.
Sorry if this doesn't make sense, here is the original question:
Encode the features - there are a lot of categorical features with char values. You'll need to use LabelEncoder and OneHotEncoder for them. Also, see if you need to use OneHotEncoder for all of them - for example by checking the correlations between a LabelEncoded feature and the target variable before and after OneHot Encoding. (OneHot encoding doesn't make any sense if your variable actually represents some incremental, hieraarchical relationship - and we don't know it for our dataset).
Thanks in advance, and stay tuned for more questions!