Feature engineering in machine learning - Part 2

This blog is part two in a three-part series, and provides explanations about simple feature engineering methods. In part 1 of this series, we covered different data types in Machine Learning (ML) and their mathematical interpretation, how to use in an algorithm and a brief introduction of Feature Engineering.

Feature Engineering

Feature engineering is the process of extracting new variables by transforming raw data to improve the predictability of a machine learning model. But feature engineering is not just this kind of simple translation of categories like names or colors into numbers. The following section includes a collection of different kinds of engineering approaches that go beyond a translation of categories into numbers and address needs such as transforming numbers into categories or filtering data points due to missing or false data.

Imputation

Missing values are one of the most common problems we encounter while preparing data for a machine learning model. This can be due to various reasons like human error, privacy reasons, interruptions in the data flow, etc. Whatever the reason may be, the presence of a missing value will impact the data model’s performance. Moreover, most of the machine learning algorithms don’t support null values for model building.

Data scientists can use the following techniques to fix the missing values problem:

Drop rows/columns

The most common technique is to drop the rows or the entire column if most of the data cells of that row/column contain null values. There is no optimum threshold, but you can decide it based on the dataset and drop the rows and columns if missing values are higher than that threshold.

Numerical Imputation

In case there are only a few null data values, then dropping those would not make much sense as it may lead to loss of information. If the data column is numeric, then the most common way to impute a missing value is calculating the mean or median depending on if the data type is continuous or discrete. In some cases, when you only have two values like 1 or NA, then you can impute NA with 0.

For example, to impute NA or missing values with 0 we can use below sample code in python.

data = data.fillna(0)

Another option to impute missing values is by median of the same column as shown below.

data = data.fillna(data.median())

Categorical Imputation

For categorical data, we can impute it with the mode of that column. The mode is nothing but the most frequent category of the column as seen in the example below were the value shows a NA.

Transformations

Data transformation is a technique to transform raw data into some more meaningful format that is ready for analysis. It will ensure maximum data quality which is imperative to gaining accurate analysis. In the following section, we explore the different kinds of encoding or transformation techniques needed for different data types

Label Encoding

ML algorithms work well with numeric data but in reality, you can get a mix of categorical and numerical data. There are two types of categorical data:

Nominal – Data that contains a set of unique values that don’t have any ordered relationship, for example, “Sex”.
Ordinal – This type of data is a special case of a categorical feature. Ordinal data values are sorted in some meaningful order, for example, “Level of Education: Kindergarten, Undergraduate, Bachelor, Master, Doctoral”

Label encoding maps categories to numbers. Despite the possibility of encoding both nominal and ordinal data, label encoding works well with ordinal data. Also, this technique is useful when you are working with tree-based algorithms such as decision trees among others.

Here is an example of label encoding for “Color Names” in Table 1 and “Education Level” in Table 2.

Table 1

The numbers chosen to represent the actual values will be random or alphabetical for nominal data and does not have any ordered value.

Table 2 is an example of label encoding of the ordinal feature. As “Education Level” has some meaningful order like UnderGrad < Bachelor < Masters < Doctoral, it should get encoded in such a way that they retain their ordinal sequence (e.g. 1<2<3, etc.).

Table 2

Count Encoding

Count encoding replaces each unique category with the number of times it appears in the dataset. For example, suppose “Red” appears 35 times and “Blue” 17 times in the feature “Color Names” then each “Red” and “Blue” will be encoded with 35 and 17 respectively.

One Hot Encoding

If the data is nominal the preferred way of transforming it into numerical is applying one hot encoding rather than label encoding. The sole reason is label encoding assigns an integer to nominal data hence making it ordered whereas one hot encoding makes representation unordered. It creates a new column for each unique category in a categorical variable. Each observation receives a 1 in the column for its corresponding category and a 0 in all other new columns as shown below.

Timestamp data

Sometimes you get the feature as timestamp and its difficult to extract any meaningful information unless you transform it into some more meaningful data. One of the approaches is to extract “day”, “month” and “year” information from the timestamp and use that for your analysis.

In the next blog of this series, we will cover a few more feature engineering techniques.

See the OpenText™ Magellan™ website for more information about our AI-powered Analytics platform, Magellan, and check out the AI & Analytics Services pages for more details on what Professional Services can offer.

Author: Vikram Singh, Data Scientist, Professional Services – Center of Excellence

OpenText

OpenText, The Information Company, enables organizations to gain insight through market-leading information management solutions, powered by OpenText Cloud Editions.

See all posts

Feature engineering in machine learning – Part 2