Dealing with categorical/nominal data

Published On: 2018-01-10

Dealing with Categorical Data

Adnan Travar January 10, 2018 No Comments Machine Learning

Before doing data modeling we need data prepared. Most often models require data to be numeric and properly scaled. To that end data needs to be stripped of all irregularities that could lead to wrong results.

Data preparation includes operations such as replacing missing or incorrect data, data transformation and/or encoding and sometimes even removing data altogether.

This brings us to the problem of having categorical/nominal data in your data set.

Categorical data are variables that contain label values rather than numeric values.[1]

Having such data in your data set means that some machine learning (ML) algorithms will not work optimally or at all.

For example, natural language processing (NLP) algorithms do well with categorical data but K-means clustering will work poorly because it's reliant on the Euclidean distance between two numbers, and instead, K-modes algorithm is used when clustering categorical data [2]

So what do we do with datasets that contain categorical data?

Well, there is no definitive answer and it all boils down to the ML algorithm used, but if we're to use an algorithm that doesn't work with categorical data, there are few solutions to the problem, and that is encoding categorical/nominal data.

There are few ways to encode categorical/nominal data and there is no single best-for-everything way.

Sample table of categorical values we'll be using in all examples

Fruits
Apple
Banana
Pear
Peach
Plum

Dummy Coding

As you can see from our sample table, we have one categorical/nominal type column (Fruit), so what we need to do is encode the Fruit column in some way to represent its values with numbers instead of textual data.

Dummy coding approach includes two steps:

For all but one label (one label needs to be a reference and in the example we've chosen "Apple" to be the reference label), add another column (feature) and name it something like "IS_xxx", where xxx=one of the labels (see table bellow)
Now, set newly added column values to either 1 or 0 based on the "Fruit" label (see table bellow)

Fruit	IS_Banana	IS_Pear	IS_Peach	IS_Plum
Apple	0	0	0	0
Banana	1	0	0	0
Pear	0	1	0	0
Peach	0	0	1	0
Plum	0	0	0	1

This type of coding is generally used for encoding mutually exclusive variable such as sex (male/female), race(caucasian/african-american/asian), religion (christianity/islam) and so on.

Having all 0s, isn't really a problem here, because of the mutual exclusivity, based on the n-1 encoded values, n-th value can be determined.

In our example, if variable is not Banana or Pear or Peach or Plum, than it has to be Apple, so that's the reason we have n-1 additional features.

Now as you might have noticed, this approach, though easy and elegant, is not good to use:

If your data set has large number of unique categorical values
If you need contextual and/or semantic information of your categorical values. For humans, pear and apple are similar in concept, but to machine, those are just values of a feature, numerically represented. So, if you need to have that "human understanding" of things (categorical values), this isn't what you would use to encode those values.

One-hot encoding

One hot is similar to dummy encoding with only difference in that new columns are added for each unique categorical value. So for n categorical values, n new features are added.

This is also most used method for encoding categorical values. So, our encoded values would look like the following:

Fruit	IS_Apple	IS_Banana	IS_Pear	IS_Peach	IS_Plum
Apple	1	0	0	0	0
Banana	0	1	0	0	0
Pear	0	0	1	0	0
Peach	0	0	0	1	0
Plum	0	0	0	0	1

Deviation Coding / Sum Contrast

This encoding is similar to the Dummy coding with one difference being that categorical value not extracted as a feature (reference value) has all values of its vector set to -1

So if we exclude the IS_Peach feature we'll get the following encoding table.

Fruit	IS_Apple	IS_Banana	IS_Pear	IS_Plum
Apple	1	0	0	0	0
Banana	0	1	0	0	0
Pear	0	0	1	0	0
Peach	-1	-1	-1	-1	-1
Plum	0	0	0	0	1

This way in linear regression we can avoid the multicollinearity problem which represents a state of high corelation between independent variables (regressors) which, again, can lead to wrong estimates.

Binary Encoding

The idea behind binary encoding is pretty simple:

apply ordinal encoding,
convert each label to its binary representation,
split the binary string into individial bits,
add the same amount of new features as there are bits in the binary string and store it's values in the new features

Dimensionality depends on the cardinality. So for the set of 4294967296 categorical values, dimensionality is increased by 32 (32bit binary string).

So, for the table above, encoding would look like this:

Fruit	x2	x3	x4
Apple	0	0	1
Banana	0	1	0
Pear	0	1	1
Peach	1	0	0
Plum	1	0	1

Binary encoding could be used if dimensionality is problem, in which case binary encoding offers same accuracy [19] as one-hot encoding but with far smaller dimensionality.

Effects/Sum Coding ([4,17])

Similar to One-Hotencoding in Sum Coding qw introduce n-1 additional features per unique categorical value.

Unlike One-Hot, sum of all values of a feature must amount to 0, and we achieve this by replacing all 0 values of a feature with a negative number so that we get a 0 sum.

So let's start with the One-Hotencoded values, with which we get the same table as before. As we mentioned before, the sum of values for each feature must be 0, so to get there we replace 0 values of each feature with a number we get using the following expression:

\text{-}1 \over (total\_number\_of\_observations\text{-}1)

Now we replace 0 values of a feature with respective calculated values above and we get the final table:

Fruit	IS_Apple	IS_Banana	IS_Pear	IS_Peach	IS_Plum
Apple	+1	-1/4	-1/4	-1/4	-1/4
Banana	-1/4	+1	-1/4	-1/4	-1/4
Pear	-1/4	-1/4	+1	-1/4	-1/4
Peach	-1/4	-1/4	-1/4	+1	-1/4
Plum	-1/4	-1/4	-1/4	-1/4	+1

Frequency encoding [5]

This approach is encoding categorical values of a feature to a value between 0 and 1 based on their relative frequency:

For each unique feature value, we calculate it's relative frequency using the following expression:

For FeatureX:

EncodedFeatureX = {NrOfXFeatures \over TotalNrOfObservations}

It's important to note, unlike other coding types mentioned here, with frequency it makes no sense to extract unique categorical values and encode them individually.

Doing that we'll get the same frequency for each value, and that's why we must make sure we get the TotalNrOfObservations value from data set we're using.

This also means, that changing data set by adding or removing observations. the frequencies need to be re-calculated.

So let's say we have the following data set with our already established categorical:

Fruit	Amount
Apple	4
Plum	5
Pear	12
Peach	2
Plum	8
Pear	6
Banana	7
Peach	8
Plum	4
Pear	6
Plum	7
Peach	8
Plum	4
Apple	3
Peach	3

Using the above mentioned expression for EncodedFeatureX we have our encoded values:

Fruit	Frequency
Apple	2/15 (0.133)
Banana	1/15 (0.066)
Pear	3/15 (0.2)
Peach	4/15 (0.266)
Plum	5/15 (0.333)

The problem arises when we have two or more different categorical values having the same frequency.

Simple coding

Similar to dummy coding, simple coding uses the principle of comparing each level with a reference one, but intercepts are what differentiates the two.

The generalized rule for encoding can be seen from the table bellow

Fruit	Apple	Banana	Pear	Plum
Apple	(n-1)/n	-1/n	-1/n	-1/n
Banana	-1/n	(n-1)/n	-1/n	-1/n
Pear	-1/n	-1/n	(n-1)/n	-1/n
Peach	-1/n	-1/n	-1/n	-1/n
Plum	-1/n	-1/n	-1/n	(n-1)/n

Where n is the total number of unique categorical values (n=5 in our case).

Encoded categorical values

Fruit	Apple	Banana	Pear	Plum
Apple	4/5	-1/5	-1/5	-1/5
Banana	-1/5	4/5	-1/5	-1/5
Pear	-1/5	-1/5	4/5	-1/5
Peach	-1/5	-1/5	-1/5	-1/5
Plum	-1/5	-1/5	-1/5	4/5

Hash Coding

This one is also simple to understand and implement. For each categorical value, calculate its hash and use that instead.

So, for our table we'll use the classic md5 (security is not the point here, and with this small set of categorical values, it's unlikely there'll be any collisions) have the following encodings:

Fruit	Hashed
Apple	211859036441461519383052513295366044732
Banana	307019281675310238147665428131133803038
Pear	290913666918761818858019567647261110148
Peach	111011883478669060863936919078330512428
Plum	196555295208199796744724520899431816190

Note: md5 outputs hexadecimal string, so another conversion is needed from hexadecimal to decimal.

Helmert Coding

Helmert coding works by comparing each level of categorical value to the mean value of the subsequent levels

Rule table for constructing constrast matrix using Helmert Coding :

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	(total_levels-current_level)/(total_levels-current_level+1)	0	0	0
Banana	-1/(total_levels-0)	(total_levels-current_level)/(total_levels-current_level+1)	0	0
Pear	-1/(total_levels-0)	-1/(total_levels-1)	(total_levels-current_level)/(total_levels-current_level+1)	0
Peach	-1/(total_levels-0)	-1/(total_levels-1)	-1/(total_levels-2)	(total_levels-current_level)/(total_levels-current_level+1)
Plum	-1/(total_levels-0)	-1/(total_levels-1)	-1/(total_levels-2)	-1/(total_levels-3)

So for out categoricals, we'll have:

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	4/5	0	0	0
Banana	-1/5	3/4	0	0
Pear	-1/5	-1/4	2/3	0
Peach	-1/5	-1/4	-1/3	1/2
Plum	-1/5	-1/4	-1/3	-1/2

Reverse Helmert Coding

Like the name says just we reverse the Helmert Coding explained above. Instead of comparing each level of categorical values to the subsequent levels, comparison is done on the previous levels, which means that comparison starts with level 2 categorical value, and goes down the list.

Rule table for constructing constrast matrix using Reverse Helmert Coding :

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	-1/(total_levels-3)	-1/(total_levels-2)	-1/(total_levels-1)	-1/(total_levels-0)
Banana	(current_level-1)/current_level	-1/(total_levels-2)	-1/(total_levels-1)	-1/(total_levels-0)
Pear	0	(current_level-1)/current_level	-1/(total_levels-1)	-1/(total_levels-0)
Peach	0	0	(current_level-1)/current_level	-1/(total_levels-0)
Plum	0	0	0	(current_level-1)/current_level

Encoded categorical values:

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	-1/2	-1/3	-1/4	-1/5
Banana	1/2	-1/3	-1/4	-1/5
Pear	0	2/3	-1/4	-1/5
Peach	0	0	3/4	-1/5
Plum	0	0	0	4/5

Forward Difference Coding

In Forward Difference Coding each level (up to n-1 levels to be exact) is compared to the mean of the next level, So instead of comparing all subsequent levels like it's done with Helmert coding, comparison is done only with the next level.

For our table, general rule looks like this (for n=5):

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	(n-1)/n	(n-2)/n	(n-3)/n	(n-4)/n
Banana	-1/n	(n-2)/n	(n-3)/n	(n-4)/n
Pear	-1/n	-2/n	(n-3)/n	(n-4)/n
Peach	-1/n	-2/n	-3/n	(n-4)/n
Plum	-1/n	-2/n	-3/n	-4/n

If we apply the rule, we'll get the following encoded values:

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	4/5	3/5	2/5	1/5
Banana	-1/5	3/5	2/5	1/5
Pear	-1/5	-2/5	2/5	1/5
Peach	-1/5	-2/5	-3/5	1/5
Plum	-1/5	-2/5	-3/5	-4/5

Backward Difference Coding:

Unlike the Forward Difference Coding, mean of one level of the categorical value is compared to the mean of the previous level.

So again, the rule for our table is represented with the following:

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	-(n-1)/n	-(n-2)/n	-(n-3)/n	-(n-4)/n
Banana	1/n	-(n-2)/n	-(n-3)/n	-(n-4)/n
Pear	1/n	2/n	-(n-3)/n	-(n-4)/n
Peach	1/n	2/n	3/n	-(n-4)/n
Plum	1/n	2/n	3/n	4/n

Now we can apply these rules to get our encoded values;

Fruit	L_Apple	L_Banana	L_Pear	L_Peach
Apple	-4/5	-3/5	-2/5	-1/5
Banana	1/5	-3/5	-2/5	-1/5
Pear	1/5	2/5	-2/5	-1/5
Peach	1/5	2/5	3/5	-1/5
Plum	1/5	2/5	3/5	4/5

Which encoding scheme to use?

Short answer; it depends.

Long answer:

Given the fact that a lot of the time we're dealing with two types of categorical data, nominal and ordinal, different types of encodings for each are recommended.

For nominal data (un-ordered groups) it's generally not recommended to encode it with something that will give the data some sort of order, unless, you explicitly give it yourself, in which case, you treat it as ordinal group of values.

So if you have, for example, a variable for shirt color with possible values:

Red
Green,
Black,
Yellow
Orange

You can see that there is no apparent order in the list, one color is not better than the rest and neither has any real numerical value aside from the fact they're used for nothing more than labeling.

Knowing this, adequate encoding schemes would be:

One-Hot
Dummy
Deviation
Effects
Simple
Hashing
Any other type of encoding as long as it doesn't introduce some sort of order in an otherwise un-ordered group of values.

In contrast, ordinal data, requires some sort of encoding that will retain the ordinality of variable values, again, if explicity not defined otherwise.

Recommended encoding schemes would be:

Ordinal (ordered integers)
Binary
One-Hot
Frequency

These recommendations are not a must. You can't and you shouldn't select one and stick with it without knowing your data and what you expect to get as a result.

One example of why it is important to know your data before choosing encoder is having an ordinal variable whose values have different level of "importance", so for example:

Very good
Good
Fine
Bad
Very bad

You can have a case where Good is better than Fine, and Fine is better than Bad, but we don't know how much is "better" or is the difference between neighbouring values always the same or we have a case where Good to Very Good has higher impact than Fine to Good.

Knowing these important details will helps us in choosing the right encoder for your data.