Feature engineering in machine learning – part 3

In the first part of this series, we covered different types of data in Machine Learning, their mathematical interpretation and how to use it in an algorithm. In the second part, we covered some simple feature engineering techniques like imputations and transformations. Let’s cover some more in the following section.

Binning

Some algorithms like Naive Bayes work with classes. Naive Bayes calculates the probability of a certain event to happen given certain input values. The input into a Naive Bayes classifier needs to be nominal as well as the prediction the classifier is intended for.

Naive Bayes defines the probability of something happening based on how often the combination of the input value and prediction class were encountered during training. This, in turn, means that if we want to apply Naive Bayes on anything that is continuous, we might run into problems. Take the following example:

Trying to predict the age of a person based on height highlights the problem because “height” and “age” are continuous. The result would be an infinite number of height and age combinations. To handle this issue, binning comes into play. When applying binning, we define bins in which we group ranges of values resulting in a limited number of bins. In our previous example, “age” can create age bins by year so a person that is 20 years old would be placed into the “20-year-old” bin as well as the person that is 20 years and 5 months old.

Binning can be done with fine granularity, like a person’s exact age in years, but you can also have age bins that span across 25 years, i.e. 0 to 24 years, 25 to 49 years, 50 to 75 and so on. The way binning is done strongly depends on the values and the data one wants to bin over. Sometimes there is business knowledge that can suggest an effective way of binning, but it might still be wise to at least try different ways of binning to get the most effective one. It makes the model more robust and prevents overfitting. Normalization

Some algorithms internally calculate the distance between data. Distance is not the “physical” distance measured in miles or kilometers but is the mathematical distance calculated in mathematical space.

Let’s take a look at a real estate example (Figure 1), where flat prices (units in currency) at X-axis are shown in (‘000) thousands and flat sizes (units in sqft) at Y-axis are shown in (’00) hundreds. If the data points are not normalized, it’s difficult to find the distance between two data points as both axes have data in different units.

The points chosen in the figure were the points with index 1, 3 and 36 (indexes start counting at 0 so the first element has the index 0).

The distance between points 1 and 3 is represented as the blue line in the figure while the distance between points 1 and 36 is represented as the red line.

The numbers in Table 1 are the result of calculating the distances of the red and blue lines. The length of the red line (distance between points 36 and 1) is 17 times larger than the length of the blue line (distance between points 3 and 1). Table 1

This is not visible in the figure because the aspect ratio of the axis does not correspond to the absolute value ranges.

Aligning the aspect ratio to the values in the figure would result in Figure 3. The proportions in that figure are not close to the truth due to technical limitations but they convey the picture. It clearly shows how the flat price dominates the flat size with regards to distance calculation. This creates a feature imbalance since one feature has a much larger influence on the distance than another one.

To circumvent that problem, normalization is used with which the price and the size are translated onto a scale of 0 to 1. Each feature is treated separately and 0 is set to represent the minimal value while 1 is set to represent the maximal value encountered. Each other value is assigned a corresponding value between 0 and 1 keeping the relative distances between the values. That means if the smallest value is 0 and the biggest value is 100 and there are two other values in between – namely 25 and 60 – then the result would be the following: Table 2

Applying this method to the flat size and price features and redrawing the figure results in Figure 3. Comparing the red and blue distances now result in the red distance being shorter. This is because the flat size and flat price now have the same influence on the distance. Table 3 shows the distances with the normalized values. Table 3

This kind of generated features (not the original ones) will be used with any algorithm and can improve the accuracy significantly.

The above fundamental methods can be beneficial in the feature engineering process. It is also important to keep in mind that these techniques are not magical tools. If your data sets are small or hard to work with, then generally feature engineering may not be a solution that fixes this instantly. Rule of thumb is “Garbage in, garbage out”.