Feature engineering in machine learning - An overview

This blog is a first in a 3-part series that will give a glimpse into the feature engineering part of a data scientist’s daily work.

When comparing machine learning to traditional software development, the main difference is that in traditional software development the developer decides which actions a software must take, under what conditions. Those rules are defined by the developer manually. In machine learning, the software “learns” rules based on the data available. The algorithms to find those rules are called machine learning algorithms. The software is trained on data and once the algorithms guiding it have learned the necessary conditions, they are called models. But first, before the algorithms can learn, they need the data they consume to be put into numeric form.

Data types in machine learning

Once the data is formatted numerically, the algorithm starts learning by organizing, comparing, and ranking all the pieces, before it can go on to look for more sophisticated patterns. Comparison or ranking types for data fall into four major types (as described in the 1960s by Harvard psychologist Stanley Smith Stevens): nominal, ordinal, interval, and ratio.

Nominal

Nominal data are pieces that cannot be compared to each other. A good example is the names of cities. A comparison between two city names is (strictly speaking) not possible since neither name is “more” or “better.”

Ordinal

Ordinal data are data that can be ranked by order. An example is chili peppers whose degree of spiciness is “mild,” “hot” and “volcano.” These values fall into an obvious order, but it is not possible to tell exactly how much bigger each one is than the one before.

Interval

Interval data can be compared like ordinal data, but the distance between two data points is measurable on a scale with quantified intervals. Examples include time, temperature, or money. For instance, the Scoville scale measures the heat of various types of chili peppers by testing how many parts per million they have of a chemical called capsaicin. A mild Anaheim chile might have a rating of 500 Scoville units, an ancho chile might have a rating of 2,000, and a Thai pepper, 100,000. An interval scale shows that the ancho is four times as spicy as an Anaheim pepper, and a Thai pepper is 200 times as spicy!

Ratio

Ratio data is the same as interval data but with the difference that ratio data has a true zero point. 0°C, for instance, is not a zero point (which would imply there is no temperature at 0°C); temperatures can go below freezing as well as above it. Length on the other hand is a true ratio-scaled value; something with a length of 0 is not measurable at all!

Mathematical interpretation of data types

These different data types or scales determine the kinds of mathematical operations algorithms can use on them. These mathematical operations are the next step towards a full machine learning analysis that will spot the sorts of patterns humans find useful.

Nominal data can only be compared to one another to see if they’re the same. In mathematical terms the “=” operation can be used.
Ordinal data can be compared not just for equality but also by size. In mathematical terms the “=” and the “<” as well as the “>” operators can be used on them.
Interval data can also be added or subtracted. In mathematical terms, the “=”, “<“, “>”, “+” and “-” operators can be used.
Ratio data can undergo all the previous operations plus multiplication and division. The “=”, “<“, “>”, “+”, “-“, “x” and “/” operators can be used.

A sample algorithm and its data type requirements

Here’s an example of how the data types and operations we discussed above fit together to perform sophisticated analytics.

Linear Regression

The linear regression function tries to fit a straight line through multiple data points, in order to identify the way the two variables are related and how they’re trending. For that, it requires all data points to be ratio data since the calculations that determine the position and angle of the line are done using multiplications and divisions.

Let’s look at the graphic below. The green points represent a house with a certain size and price, in a given market. (House size and price are examples of input variables data scientists choose to use for the algorithm.)

graph showing linear regression

Linear regression takes all those points and tries to find the best straight line that “fits” them. (In this case, it shows that trying to get even a little more living space will cost a lot more money.)

What is Feature Engineering?

Feature engineering is the process of creating features from raw data that makes machine learning algorithms work. Done correctly, it increases the predictive power and accuracy of machine learning algorithms. To derive better features, subject matter expertise is critical and adds more value than a fancy algorithm.

Features

In this context, “features” refers to mathematical representations of data, not the everyday sense of specific things a piece of software can do. Since machine learning algorithms are mathematical functions that cannot work directly on text, word-based variables have to be represented by numbers. City names are a good example. For instance, for a retail brand owning stores in New York, Paris and Sydney, the “City Name” data could be represented this way as a feature:

City Name	Numeric Value
New York	1
Paris	2
Sydney	3

Table 1

So when we give the algorithm the input “1” as the value for “City Name,” we really mean “New York.” The numbers assigned to the cities are just labels here – “features” in the machine learning sense – and are treated the same way as the underlying names. Because the originating data was nominal, the feature is also nominal; you can’t subtract 1 from 2 any more than you can subtract New York from Paris.

These ground rules have implications on what algorithms can be used and how to use those algorithms with the data at hand. For example, nominal or categorical variables generally manifest themselves as classification problems, making Naive Bayes Classification, Decision Tree, Random Forest, kNN, or an ensemble as possible options.

In the next blog of this series, we will cover some simple feature engineering techniques.

See the OpenText™ Magellan™ web pages for more information about our AI-powered Analytics platform, Magellan, and check out the AI & Analytics Services pages for more details on what Professional Services can offer.

OpenText

OpenText, The Information Company, enables organizations to gain insight through market-leading information management solutions, powered by OpenText Cloud Editions.

See all posts

Feature engineering in machine learning – An overview