This blog is part two in a three-part series, and provides explanations about simple feature engineering methods. In part 1 of this series, we covered different data types in Machine Learning (ML) and their mathematical interpretation, how to use in an algorithm and a brief introduction of Feature Engineering.
Feature engineering is the process of extracting new variables by transforming raw data to improve the predictability of a machine learning model. But feature engineering is not just this kind of simple translation of categories like names or colors into numbers. The following section includes a collection of different kinds of engineering approaches that go beyond a translation of categories into numbers and address needs such as transforming numbers into categories or filtering data points due to missing or false data.
Missing values are one of the most common problems we encounter while preparing data for a machine learning model. This can be due to various reasons like human error, privacy reasons, interruptions in the data flow, etc. Whatever the reason may be, the presence of a missing value will impact the data model’s performance. Moreover, most of the machine learning algorithms don’t support null values for model building.
Data scientists can use the following techniques to fix the missing values problem:
The most common technique is to drop the rows or the entire column if most of the data cells of that row/column contain null values. There is no optimum threshold, but you can decide it based on the dataset and drop the rows and columns if missing values are higher than that threshold.
In case there are only a few null data values, then dropping those would not make much sense as it may lead to loss of information. If the data column is numeric, then the most common way to impute a missing value is calculating the mean or median depending on if the data type is continuous or discrete. In some cases, when you only have two values like 1 or NA, then you can impute NA with 0.
For example, to impute NA or missing values with 0 we can use below sample code in python.
data = data.fillna(0)
Another option to impute missing values is by median of the same column as shown below.
data = data.fillna(data.median())
For categorical data, we can impute it with the mode of that column. The mode is nothing but the most frequent category of the column as seen in the example below were the value shows a NA.
Data transformation is a technique to transform raw data into some more meaningful format that is ready for analysis. It will ensure maximum data quality which is imperative to gaining accurate analysis. In the following section, we explore the different kinds of encoding or transformation techniques needed for different data types
ML algorithms work well with numeric data but in reality, you can get a mix of categorical and numerical data. There are two types of categorical data:
- Nominal – Data that contains a set of unique values that don’t have any ordered relationship, for example, “Sex”.
- Ordinal – This type of data is a special case of a categorical feature. Ordinal data values are sorted in some meaningful order, for example, “Level of Education: Kindergarten, Undergraduate, Bachelor, Master, Doctoral”
Label encoding maps categories to numbers. Despite the possibility of encoding both nominal and ordinal data, label encoding works well with ordinal data. Also, this technique is useful when you are working with tree-based algorithms such as decision trees among others.
Here is an example of label encoding for “Color Names” in Table 1 and “Education Level” in Table 2.
The numbers chosen to represent the actual values will be random or alphabetical for nominal data and does not have any ordered value.
Table 2 is an example of label encoding of the ordinal feature. As “Education Level” has some meaningful order like UnderGrad < Bachelor < Masters < Doctoral, it should get encoded in such a way that they retain their ordinal sequence (e.g. 1<2<3, etc.).
Count encoding replaces each unique category with the number of times it appears in the dataset. For example, suppose “Red” appears 35 times and “Blue” 17 times in the feature “Color Names” then each “Red” and “Blue” will be encoded with 35 and 17 respectively.
One Hot Encoding
If the data is nominal the preferred way of transforming it into numerical is applying one hot encoding rather than label encoding. The sole reason is label encoding assigns an integer to nominal data hence making it ordered whereas one hot encoding makes representation unordered. It creates a new column for each unique category in a categorical variable. Each observation receives a 1 in the column for its corresponding category and a 0 in all other new columns as shown below.
Sometimes you get the feature as timestamp and its difficult to extract any meaningful information unless you transform it into some more meaningful data. One of the approaches is to extract “day”, “month” and “year” information from the timestamp and use that for your analysis.
In the next blog of this series, we will cover a few more feature engineering techniques.
See the OpenText™ Magellan™ website for more information about our AI-powered Analytics platform, Magellan, and check out the AI & Analytics Services pages for more details on what Professional Services can offer.
Author: Vikram Singh, Data Scientist, Professional Services – Center of Excellence