Here are three reasons why choosing the right algorithm is crucial for the success of any Machine Learning project.
A brief introduction to Machine Learning
When a programmer needs to create a program that outputs the price of a house based on size, they typically would write an algorithm that, depending on the input (house size), would calculate the output (price).
Sometimes that is not possible, if the programmer doesn’t have the necessary information for how a program should react to certain values. They could take existing data and try to figure out a way to intelligently react to all input values, but this is time consuming and may need to be redone if the data changes.
To avoid this, a programmer can write an algorithm, that “learns” and delivers the optimal output, depending on the input that comes in. Writing these kind of learning programs is called Machine Learning.
Linear regression is one of the simplest machine learning algorithms and operates where, given a set of data, it tries to find a straight line that has the least distance between data points. An example can be seen in figure 2.
A more generalized version of the linear regression is called polynomial fitting. With polynomial fitting one tries to find a polynom of a predefined degree that has the least distance to all of the data points. Linear regression is a polynomial fitting of degree 2. An example for polynomial fittings from degree 0 to degree 9 is shown in figure 7.
Another widely used algorithm is the decision tree. In Germany, police used to buy white German cars, which resulted in white cars being unwanted when someone was looking for a used car, as there was a high chance that it was a used police car. If someone wanted to sell their used white car, they would have to answer some questions to figure out what price could be asked. This is shown in figure 1.
Figure 1: Decision Tree example
K Nearest Neighbors (KNN) is an algorithm that tries to decide, which class a data point belongs to. The algorithm takes the K nearest neighbors and checks which class the neighbors belong to. This means that for a given data point the algorithm calculates the distance to all training data points and selects the K data points that are the closest. It then counts how many of each type there are and the majority will be chosen as the predicted type.
KMeans is an algorithm, that tries to calculate an appropriate center point for a predefined number of clusters and then uses those calculated center points, to determine which class should be chosen for a new data point. This is done by selecting the center point that is the closes to the new data point.
1. Not every algorithm works with every kind of data
There are Machine Learning algorithms that only work on specific kinds of data.
Take for instance a linear regression. The linear regression tries to find a line that best represents all data points and requires the input data to be rational data. Rational means, the distance between 1 and 2 is the same as the distance between 4 and 5, and 4 is double the distance of 2. Being able to divide and multiply is crucial for ratio types.
Comparing price and size works well because both are ratio types. This is shown in figure 2, where we compare the rental cost of a flat to the area the flat has.
Figure 2: Linear regression
If we were to compare the color of the flat’s interior versus price, using linear regression, it wouldn’t work since color is not a rational data type. It is impossible to determine if red is bigger than green, or to say that green is twice as colorful as red is not a rational statement. For the linear regression operating with rational data is a necessity.
The decision tree algorithm, on the other hand, can work with any kind of data, so this algorithm might be used to compare the price with the color. The drawback here is that the resulting prediction is not linear for a given price.
2. Run time complexity grows with the number of data points
There are prediction algorithms that have a longer execution time depending on the number of data points in the training data used to train that algorithm. Take the following fictitious data that shows age and relative distance to the next big city. Red, represents people that own a car, and green represents people that do not own a car.
Figure 3: Distance to City vs Age. Red: People owning a car. Green: People not owning a car
If we want to predict if a certain new data point has a car or not, we might choose to use the KNN (K Nearest Neighbors) algorithm. This algorithm puts the new data point into the data set, and searches for the K data points that are nearest to the new data point. This search employs calculating the distance between the new data point and all training data points. The predictions from this method improve if more training data is available, but for each additional data point in the training set, the resulting prediction algorithm takes longer to calculate.
For example, if a prediction with 10 data points takes 1 second, then a prediction with 100 might take 10 seconds or more. This means that picking a different algorithm, like KMeans, might result in faster calculations, especially for large data sets. On the other hand, KMeans requires the different clusters to be circular in shape, which is not the case in figure 3.
3. Low amounts of data bring poor results
Real life data can produce underlying mathematical models.
The task for data scientists and Machine Learning is to find those underlying models, where choosing the right algorithm strongly influences the accuracy of any planned prediction.
Taking for instance the function y = -x² + 5x and plotting that function will look like figure 4.
Figure 4: Blue: y = -x² + 5x
Real life data is full of noise, and subsequently, measuring real life events is never without some variation and data deviation. If we apply some noise and allow the data points to have multiple values. The resulting data could look like figure 5, but it’s still possible to see that the underlying function is still discernible.
Figure 5: Blue: y = -x² + 5x, Green: y = -x² + 5x + noise
In reality there is often only a limited amount of data available so the underlying model cannot be seen with any kind of resolution. Say we reduce the number of data points down to 9 as shown in figure 6.
Figure 6: Green: 9 data points selected from “-x² + 5x + noise”
The simplest algorithm to figure out the original function would be a polynomial fitting algorithm. Polynomial fitting tries to find a polynomic function of a chosen order that comes as close to the data points as possible. Figure 7 shows 10 polynomial fittings for the orders 0 to 9.
Figure 7: Polynomial fittings of order 0 to 9
A comparison of our polynomial fittings with the actual underlying function is represented in figure 8.
Figure 8: Estimations vs actual
Comparing the estimations and the actual function in figure 8, it is easy to see that degrees two, three, four and five are functions that come close to what the real underlying function is. All other estimations are so far off from the real function that, when using those fittings, the predictions will be wrong.
The problem is that we cannot be sure which function is the original, underlying function. WIth so few data points the underlying function is open to interpretation, so each one of the estimated functions could have been the right one.
Enlarging the amount of data points from 10 to 1000 delivers a completely different picture as figure 9 shows. Here, the width of the actual function needed to be changed, or otherwise, it would not have been distinguishable from the polyfit estimations.
Figure 9: Estimation with 1000 noisy values vs actual
This shows that choosing the wrong algorithm for small sets of training data can be harmful to the accuracy of predictions or decisions.
There is no “one fits all” algorithm. There are lots of different algorithms that must work with training data sets of different types, volume and accuracy. The job of a data scientist is to choose the right algorithm that fits the data and the underlying truths, utilizing their experience and professional knowledge.
For more information on Machine Learning, you can contact the team at PortfolioAnalyticsPS@opentext.com.
See the OpenText™ Magellan™ web pages for more information about our AI-powered Analytics platform.