Dealing with categorical/nominal data

Dealing with Categorical Data
Before doing data modeling we need data prepared. Most often models require data to be numeric and properly scaled. To that end data needs to be stripped of all irregularities that could lead to wrong results.
Data preparation includes operations such as replacing missing or incorrect data, data transformation and/or encoding and sometimes even removing data altogether.
This brings us to the problem of having categorical/nominal data in your data set.
Categorical data are variables that contain label values rather than numeric values.[1]
Having such data in your data set means that some machine learning (ML) algorithms will not work optimally or at all.
For example, natural language processing (NLP) algorithms do well with categorical data but K-means clustering will work poorly because it's reliant on the Euclidean distance between two numbers, and instead, K-modes algorithm is used when clustering categorical data [2]
So what do we do with datasets that contain categorical data?
Well, there is no definitive answer and it all boils down to the ML algorithm used, but if we're to use an algorithm that doesn't work with categorical data, there are few solutions to the problem, and that is encoding categorical/nominal data.
There are few ways to encode categorical/nominal data and there is no single best-for-everything way.
Sample table of categorical values we'll be using in all examples
Fruits |
---|
Apple |
Banana |
Pear |
Peach |
Plum |
Dummy Coding
As you can see from our sample table, we have one categorical/nominal type column (Fruit), so what we need to do is encode the Fruit column in some way to represent its values with numbers instead of textual data.
Dummy coding approach includes two steps:
- For all but one label (one label needs to be a reference and in the example we've chosen "Apple" to be the reference label), add another column (feature) and name it something like "IS_xxx", where xxx=one of the labels (see table bellow)
- Now, set newly added column values to either 1 or 0 based on the "Fruit" label (see table bellow)
Fruit | IS_Banana | IS_Pear | IS_Peach | IS_Plum |
---|---|---|---|---|
Apple | 0 | 0 | 0 | 0 |
Banana | 1 | 0 | 0 | 0 |
Pear | 0 | 1 | 0 | 0 |
Peach | 0 | 0 | 1 | 0 |
Plum | 0 | 0 | 0 | 1 |
This type of coding is generally used for encoding mutually exclusive variable such as sex (male/female), race(caucasian/african-american/asian), religion (christianity/islam) and so on.
Having all 0s, isn't really a problem here, because of the mutual exclusivity, based on the n-1 encoded values, n-th value can be determined.
In our example, if variable is not Banana or Pear or Peach or Plum, than it has to be Apple, so that's the reason we have n-1 additional features.
Now as you might have noticed, this approach, though easy and elegant, is not good to use:
- If your data set has large number of unique categorical values
- If you need contextual and/or semantic information of your categorical values. For humans, pear and apple are similar in concept, but to machine, those are just values of a feature, numerically represented. So, if you need to have that "human understanding" of things (categorical values), this isn't what you would use to encode those values.
One-hot encoding
One hot is similar to dummy encoding with only difference in that new columns are added for each unique categorical value. So for n categorical values, n new features are added.
This is also most used method for encoding categorical values. So, our encoded values would look like the following:
Fruit | IS_Apple | IS_Banana | IS_Pear | IS_Peach | IS_Plum |
---|---|---|---|---|---|
Apple | 1 | 0 | 0 | 0 | 0 |
Banana | 0 | 1 | 0 | 0 | 0 |
Pear | 0 | 0 | 1 | 0 | 0 |
Peach | 0 | 0 | 0 | 1 | 0 |
Plum | 0 | 0 | 0 | 0 | 1 |
Deviation Coding / Sum Contrast
This encoding is similar to the Dummy coding with one difference being that categorical value not extracted as a feature (reference value) has all values of its vector set to -1
So if we exclude the IS_Peach feature we'll get the following encoding table.
Fruit | IS_Apple | IS_Banana | IS_Pear | IS_Plum | |
---|---|---|---|---|---|
Apple | 1 | 0 | 0 | 0 | 0 |
Banana | 0 | 1 | 0 | 0 | 0 |
Pear | 0 | 0 | 1 | 0 | 0 |
Peach | -1 | -1 | -1 | -1 | -1 |
Plum | 0 | 0 | 0 | 0 | 1 |
This way in linear regression we can avoid the multicollinearity problem which represents a state of high corelation between independent variables (regressors) which, again, can lead to wrong estimates.
Binary Encoding
The idea behind binary encoding is pretty simple:
- apply ordinal encoding,
- convert each label to its binary representation,
- split the binary string into individial bits,
- add the same amount of new features as there are bits in the binary string and store it's values in the new features
Dimensionality depends on the cardinality. So for the set of 4294967296 categorical values, dimensionality is increased by 32 (32bit binary string).
So, for the table above, encoding would look like this:
Fruit | x2 | x3 | x4 |
---|---|---|---|
Apple | 0 | 0 | 1 |
Banana | 0 | 1 | 0 |
Pear | 0 | 1 | 1 |
Peach | 1 | 0 | 0 |
Plum | 1 | 0 | 1 |
Binary encoding could be used if dimensionality is problem, in which case binary encoding offers same accuracy [19] as one-hot encoding but with far smaller dimensionality.
Effects/Sum Coding ([4,17])
Similar to One-Hotencoding in Sum Coding qw introduce n-1 additional features per unique categorical value.
Unlike One-Hot, sum of all values of a feature must amount to 0, and we achieve this by replacing all 0 values of a feature with a negative number so that we get a 0 sum.
So let's start with the One-Hotencoded values, with which we get the same table as before. As we mentioned before, the sum of values for each feature must be 0, so to get there we replace 0 values of each feature with a number we get using the following expression:
Now we replace 0 values of a feature with respective calculated values above and we get the final table:
Fruit | IS_Apple | IS_Banana | IS_Pear | IS_Peach | IS_Plum |
---|---|---|---|---|---|
Apple | +1 | -1/4 | -1/4 | -1/4 | -1/4 |
Banana | -1/4 | +1 | -1/4 | -1/4 | -1/4 |
Pear | -1/4 | -1/4 | +1 | -1/4 | -1/4 |
Peach | -1/4 | -1/4 | -1/4 | +1 | -1/4 |
Plum | -1/4 | -1/4 | -1/4 | -1/4 | +1 |
Frequency encoding [5]
This approach is encoding categorical values of a feature to a value between 0 and 1 based on their relative frequency:
For each unique feature value, we calculate it's relative frequency using the following expression:
For FeatureX:
It's important to note, unlike other coding types mentioned here, with frequency it makes no sense to extract unique categorical values and encode them individually.
Doing that we'll get the same frequency for each value, and that's why we must make sure we get the TotalNrOfObservations value from data set we're using.
This also means, that changing data set by adding or removing observations. the frequencies need to be re-calculated.
So let's say we have the following data set with our already established categorical:
Fruit | Amount |
---|---|
Apple | 4 |
Plum | 5 |
Pear | 12 |
Peach | 2 |
Plum | 8 |
Pear | 6 |
Banana | 7 |
Peach | 8 |
Plum | 4 |
Pear | 6 |
Plum | 7 |
Peach | 8 |
Plum | 4 |
Apple | 3 |
Peach | 3 |
Using the above mentioned expression for EncodedFeatureX we have our encoded values:
Fruit | Frequency |
---|---|
Apple | 2/15 (0.133) |
Banana | 1/15 (0.066) |
Pear | 3/15 (0.2) |
Peach | 4/15 (0.266) |
Plum | 5/15 (0.333) |
The problem arises when we have two or more different categorical values having the same frequency.
Simple coding
Similar to dummy coding, simple coding uses the principle of comparing each level with a reference one, but intercepts are what differentiates the two.
The generalized rule for encoding can be seen from the table bellow
Fruit | Apple | Banana | Pear | Plum |
---|---|---|---|---|
Apple | (n-1)/n | -1/n | -1/n | -1/n |
Banana | -1/n | (n-1)/n | -1/n | -1/n |
Pear | -1/n | -1/n | (n-1)/n | -1/n |
Peach | -1/n | -1/n | -1/n | -1/n |
Plum | -1/n | -1/n | -1/n | (n-1)/n |
Where n is the total number of unique categorical values (n=5 in our case).
Encoded categorical values
Fruit | Apple | Banana | Pear | Plum |
---|---|---|---|---|
Apple | 4/5 | -1/5 | -1/5 | -1/5 |
Banana | -1/5 | 4/5 | -1/5 | -1/5 |
Pear | -1/5 | -1/5 | 4/5 | -1/5 |
Peach | -1/5 | -1/5 | -1/5 | -1/5 |
Plum | -1/5 | -1/5 | -1/5 | 4/5 |
Hash Coding
This one is also simple to understand and implement. For each categorical value, calculate its hash and use that instead.
So, for our table we'll use the classic md5 (security is not the point here, and with this small set of categorical values, it's unlikely there'll be any collisions) have the following encodings:
Fruit | Hashed |
---|---|
Apple | 211859036441461519383052513295366044732 |
Banana | 307019281675310238147665428131133803038 |
Pear | 290913666918761818858019567647261110148 |
Peach | 111011883478669060863936919078330512428 |
Plum | 196555295208199796744724520899431816190 |
Note: md5 outputs hexadecimal string, so another conversion is needed from hexadecimal to decimal.
Helmert Coding
Helmert coding works by comparing each level of categorical value to the mean value of the subsequent levels
Rule table for constructing constrast matrix using Helmert Coding :
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | (total_levels-current_level)/(total_levels-current_level+1) | 0 | 0 | 0 |
Banana | -1/(total_levels-0) | (total_levels-current_level)/(total_levels-current_level+1) | 0 | 0 |
Pear | -1/(total_levels-0) | -1/(total_levels-1) | (total_levels-current_level)/(total_levels-current_level+1) | 0 |
Peach | -1/(total_levels-0) | -1/(total_levels-1) | -1/(total_levels-2) | (total_levels-current_level)/(total_levels-current_level+1) |
Plum | -1/(total_levels-0) | -1/(total_levels-1) | -1/(total_levels-2) | -1/(total_levels-3) |
So for out categoricals, we'll have:
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | 4/5 | 0 | 0 | 0 |
Banana | -1/5 | 3/4 | 0 | 0 |
Pear | -1/5 | -1/4 | 2/3 | 0 |
Peach | -1/5 | -1/4 | -1/3 | 1/2 |
Plum | -1/5 | -1/4 | -1/3 | -1/2 |
Reverse Helmert Coding
Like the name says just we reverse the Helmert Coding explained above. Instead of comparing each level of categorical values to the subsequent levels, comparison is done on the previous levels, which means that comparison starts with level 2 categorical value, and goes down the list.
Rule table for constructing constrast matrix using Reverse Helmert Coding :
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | -1/(total_levels-3) | -1/(total_levels-2) | -1/(total_levels-1) | -1/(total_levels-0) |
Banana | (current_level-1)/current_level | -1/(total_levels-2) | -1/(total_levels-1) | -1/(total_levels-0) |
Pear | 0 | (current_level-1)/current_level | -1/(total_levels-1) | -1/(total_levels-0) |
Peach | 0 | 0 | (current_level-1)/current_level | -1/(total_levels-0) |
Plum | 0 | 0 | 0 | (current_level-1)/current_level |
Encoded categorical values:
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | -1/2 | -1/3 | -1/4 | -1/5 |
Banana | 1/2 | -1/3 | -1/4 | -1/5 |
Pear | 0 | 2/3 | -1/4 | -1/5 |
Peach | 0 | 0 | 3/4 | -1/5 |
Plum | 0 | 0 | 0 | 4/5 |
Forward Difference Coding
In Forward Difference Coding each level (up to n-1 levels to be exact) is compared to the mean of the next level, So instead of comparing all subsequent levels like it's done with Helmert coding, comparison is done only with the next level.
For our table, general rule looks like this (for n=5):
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | (n-1)/n | (n-2)/n | (n-3)/n | (n-4)/n |
Banana | -1/n | (n-2)/n | (n-3)/n | (n-4)/n |
Pear | -1/n | -2/n | (n-3)/n | (n-4)/n |
Peach | -1/n | -2/n | -3/n | (n-4)/n |
Plum | -1/n | -2/n | -3/n | -4/n |
If we apply the rule, we'll get the following encoded values:
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | 4/5 | 3/5 | 2/5 | 1/5 |
Banana | -1/5 | 3/5 | 2/5 | 1/5 |
Pear | -1/5 | -2/5 | 2/5 | 1/5 |
Peach | -1/5 | -2/5 | -3/5 | 1/5 |
Plum | -1/5 | -2/5 | -3/5 | -4/5 |
Backward Difference Coding:
Unlike the Forward Difference Coding, mean of one level of the categorical value is compared to the mean of the previous level.
So again, the rule for our table is represented with the following:
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | -(n-1)/n | -(n-2)/n | -(n-3)/n | -(n-4)/n |
Banana | 1/n | -(n-2)/n | -(n-3)/n | -(n-4)/n |
Pear | 1/n | 2/n | -(n-3)/n | -(n-4)/n |
Peach | 1/n | 2/n | 3/n | -(n-4)/n |
Plum | 1/n | 2/n | 3/n | 4/n |
Now we can apply these rules to get our encoded values;
Fruit | L_Apple | L_Banana | L_Pear | L_Peach |
---|---|---|---|---|
Apple | -4/5 | -3/5 | -2/5 | -1/5 |
Banana | 1/5 | -3/5 | -2/5 | -1/5 |
Pear | 1/5 | 2/5 | -2/5 | -1/5 |
Peach | 1/5 | 2/5 | 3/5 | -1/5 |
Plum | 1/5 | 2/5 | 3/5 | 4/5 |
Which encoding scheme to use?
Short answer; it depends.
Long answer:
Given the fact that a lot of the time we're dealing with two types of categorical data, nominal and ordinal, different types of encodings for each are recommended.
For nominal data (un-ordered groups) it's generally not recommended to encode it with something that will give the data some sort of order, unless, you explicitly give it yourself, in which case, you treat it as ordinal group of values.
So if you have, for example, a variable for shirt color with possible values:
- Red
- Green,
- Black,
- Yellow
- Orange
You can see that there is no apparent order in the list, one color is not better than the rest and neither has any real numerical value aside from the fact they're used for nothing more than labeling.
Knowing this, adequate encoding schemes would be:
- One-Hot
- Dummy
- Deviation
- Effects
- Simple
- Hashing
- Any other type of encoding as long as it doesn't introduce some sort of order in an otherwise un-ordered group of values.
In contrast, ordinal data, requires some sort of encoding that will retain the ordinality of variable values, again, if explicity not defined otherwise.
Recommended encoding schemes would be:
- Ordinal (ordered integers)
- Binary
- One-Hot
- Frequency
These recommendations are not a must. You can't and you shouldn't select one and stick with it without knowing your data and what you expect to get as a result.
One example of why it is important to know your data before choosing encoder is having an ordinal variable whose values have different level of "importance", so for example:
- Very good
- Good
- Fine
- Bad
- Very bad
You can have a case where Good is better than Fine, and Fine is better than Bad, but we don't know how much is "better" or is the difference between neighbouring values always the same or we have a case where Good to Very Good has higher impact than Fine to Good.
Knowing these important details will helps us in choosing the right encoder for your data.
References
1 - Why One-Hot Encode Data in Machine Learning?, Jason Brownlee, July 28, 2017
2 - http://www.cse.ust.hk/~qyang/537/Papers/huang98extensions.pdf
3 - https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
4 - http://daljitdhadwal.com/general/an-alternative-to-dummy-variable-coding-in-regression-equations/
5 - https://www.slideshare.net/0xdata/feature-engineering-83511751
7 - http://www.cscu.cornell.edu/news/statnews/stnews72.pdf
8 - http://pbpython.com/categorical-encoding.html
9 - https://winvector.github.io/DataPrep/EN-CNTNT-Whitepaper-Data-Prep-Using-R.pdf
10 - https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12
11 - https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69
12 - https://www.researchgate.net/publication/278403191_Numerical_Coding_of_Nominal_Data
13 - http://www.saedsayad.com/encoding.htm
15 - https://www.quora.com/What-are-the-main-issues-with-using-one-hot-encoding
16 - https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
17 - https://hlplab.files.wordpress.com/2011/02/codingtutorial.pdf
18 - http://faculty.nps.edu/sebuttre/home/R/contrasts.html
19 - A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers
21 - https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/