Machine Learning for Java Developers, Part 1
Objective Function Estimation
Let us recall that the target function
hθ
, also known as the prediction function, is the result of the preparation or training process. Mathematically, the challenge is to find a function that takes a variable as input
х
and returns the predicted value
у
.
In machine learning, a cost function
(J(θ))
is used to calculate the error value or "cost" of a given objective function.
The cost function shows how well the model fits the training data. To determine the cost of the objective function shown above, it is necessary to calculate the squared error of each example house
(i)
. Error is the distance between the calculated value
у
and the real value
y
of the house from the example
i
.
For example, the real price of a house with an area of
1330 = 6,500,000 € . And the difference between the predicted house price by the trained objective function is
€7,032,478 : the difference (or error) is
€532,478 . You can also see this difference in the graph above. The difference (or error) is shown as vertical dashed red lines for each price-area training pair. Having calculated the cost of the trained objective function, you need to sum the squared error for each house in the example and calculate the main value. The smaller the price value
(J(θ))
, the more accurate the predictions of our objective function will be. Listing
3 shows a simple Java implementation of a cost function that takes as input an objective function, a list of training data, and labels associated with them. The prediction values will be calculated in a loop and the error will be calculated by subtracting the real price value (taken from the label). Later, the square of the errors will be summed and the error value will be calculated. The cost will be returned as a value of type
double
:
Listing-3
public static double cost(Function<ltDouble[], Double> targetFunction,
List<ltDouble[]> dataset,
List<ltDouble> labels) {
int m = dataset.size();
double sumSquaredErrors = 0;
for (int i = 0; i < m; i++) {
Double[] featureVector = dataset.get(i);
double predicted = targetFunction.apply(featureVector);
double label = labels.get(i);
double gap = predicted - label;
sumSquaredErrors += Math.pow(gap, 2);
}
return (1.0 / (2 * m)) * sumSquaredErrors;
}
Learning the target function
Although the cost function helps evaluate the quality of the objective function and theta parameters, you still need to find the most suitable theta parameters. You can use the gradient descent algorithm for this.
Gradient Descent
Gradient descent minimizes the cost function. This means that it is used to find the theta parameters that have the minimum cost
(J(θ))
based on the training data. Here's a simplified algorithm for calculating new, more appropriate theta values:
So, the parameters of the theta vector will improve with each iteration of the algorithm. The learning coefficient α specifies the number of calculations at each iteration. These calculations can be carried out until "good" theta values are found. For example, the linear regression function below has three theta parameters:
At each iteration, a new value will be calculated for each of the theta parameters: , , and . After each iteration, a new, more appropriate implementation can be created using the new theta vector
{θ 0 , θ 1 , θ 2 } . Listing
-4 shows the Java code for the gradient decay algorithm. Theta for the regression function will be trained using training data, marker data, learning rate . The result will be an improved objective function using theta parameters. The method will be called again and again, passing the new objective function and the new theta parameters from previous calculations. And these calls will be repeated until the configured objective function reaches a minimum plateau:
θ0
θ1
θ2
LinearRegressionFunction
(α)
train()
Listing-4
public static LinearRegressionFunction train(LinearRegressionFunction targetFunction,
List<ltDouble[]> dataset,
List<ltDouble> labels,
double alpha) {
int m = dataset.size();
double[] thetaVector = targetFunction.getThetas();
double[] newThetaVector = new double[thetaVector.length];
for (int j = 0; j < thetaVector.length; j++) {
double sumErrors = 0;
for (int i = 0; i < m; i++) {
Double[] featureVector = dataset.get(i);
double error = targetFunction.apply(featureVector) - labels.get(i);
sumErrors += error * featureVector[j];
}
double gradient = (1.0 / m) * sumErrors;
newThetaVector[j] = thetaVector[j] - alpha * gradient;
}
return new LinearRegressionFunction(newThetaVector);
}
To ensure that the cost continually decreases, you can run the cost function
J(θ)
after each training step. After each iteration, the cost should decrease. If this does not happen, it means that the value of the learning coefficient is too large and the algorithm has simply missed the minimum value. In such a case, the gradient decay algorithm fails. The plots below show the objective function using the new, calculated theta parameters, starting with the starting theta vector
{1.0, 1.0}
. The left column shows the plot of the prediction function after 50 iterations; middle column after 200 repetitions; and the right column after 1000 repetitions. From these we can see that the price decreases after each iteration, and the new objective function fits better and better. After 500-600 repetitions, the theta parameters no longer change significantly, and the price reaches a stable plateau. After this, the accuracy of the target function cannot be improved in this way.
In this case, even though the cost no longer decreases significantly after 500-600 iterations, the objective function is still not optimal. This indicates
a discrepancy . In machine learning, the term "inconsistency" is used to mean that the learning algorithm does not find underlying trends in the data. Based on real-life experience, it is likely to expect a reduction in the price per square meter for larger properties. From this we can conclude that the model used for the target function learning process does not fit the data well enough. The discrepancy is often due to oversimplification of the model. This happened in our case, the objective function is too simple, and for analysis it uses a single parameter - the area of the house. But this information is not enough to accurately predict the price of a house.
Adding features and scaling them
If you find that your objective function does not correspond to the problem you are trying to solve, it needs to be adjusted. A common way to correct for inconsistency is to add additional features to the feature vector. In the example of the price of a house, you can add characteristics such as the number of rooms or the age of the house. That is, instead of using a vector with one feature value
{size}
to describe a house, you can use a vector with several values, for example,
{size, number-of-rooms, age}.
In some cases, the number of features in the available training data is not enough. Then it’s worth trying to use polynomial features that are calculated using existing ones. For example, you have the opportunity to extend the objective function for determining the price of a house so that it includes a calculated feature of square meters (x2):
Using multiple features requires
feature scaling , which is used to standardize the range across different features. Thus, the range of values for
the size 2 attribute is significantly larger than the range of values for the size attribute. Without feature scaling,
size 2 will unduly influence the cost function. The error introduced by the
size 2 attribute will be significantly larger than the error introduced by the size attribute. A simple feature scaling algorithm is given below:
This algorithm is implemented in the class
FeaturesScaling
in the example code below. The class
FeaturesScaling
presents a commercial method for creating a scaling function that is tuned to training data. Internally, the training data instances are used to calculate the average, minimum and maximum values. The resulting function takes the feature vector and produces a new one with the scaled features. Feature scaling is necessary for both the learning process and the prediction process, as shown below:
List<ltDouble[]> dataset = new ArrayList<>();
dataset.add(new Double[] { 1.0, 90.0, 8100.0 });
dataset.add(new Double[] { 1.0, 101.0, 10201.0 });
dataset.add(new Double[] { 1.0, 103.0, 10609.0 });
List<ltDouble> labels = new ArrayList<>();
labels.add(249.0);
labels.add(338.0);
labels.add(304.0);
Function<ltDouble[], Double[]> scalingFunc = FeaturesScaling.createFunction(dataset);
List<ltDouble[]> scaledDataset = dataset.stream().map(scalingFunc).collect(Collectors.toList());
LinearRegressionFunction targetFunction = new LinearRegressionFunction(new double[] { 1.0, 1.0, 1.0 });
for (int i = 0; i < 10000; i++) {
targetFunction = Learner.train(targetFunction, scaledDataset, labels, 0.1);
}
Double[] scaledFeatureVector = scalingFunc.apply(new Double[] { 1.0, 600.0, 360000.0 });
double predictedPrice = targetFunction.apply(scaledFeatureVector);
As more and more features are added, the fit to the objective function increases, but be careful. If you go too far and add too many features, you may end up learning an objective function that is overfit.
Over-matching and cross-validation
Overfitting occurs when the objective function or model fits the training data too well, so much so that it captures noise or random variations in the training data. An example of overfitting is shown in the rightmost graph below:
However, an overfitting model performs very well on training data, but will perform poorly on real unknown data. There are several ways to avoid overfitting.
- Use a larger data set for training.
- Use fewer features as shown in the graphs above.
- Use an improved machine learning algorithm that takes regularization into account.
If a prediction algorithm overfits the training data, it is necessary to eliminate features that do not benefit its accuracy. The difficulty is to find features that have a more significant effect on the accuracy of prediction than others. As shown in the graphs, overfit can be determined visually using graphs. This works well for graphs with 2 or 3 coordinates, it becomes difficult to plot and evaluate the graph if you use more than 2 features. In cross-validation, you retest models after training using data unknown to the algorithm after the training process is complete. Available labeled data should be divided into 3 sets:
- training data;
- verification data;
- test data.
In this case, 60 percent of the labeled records characterizing the houses should be used in the process of training variants of the target algorithm. After the training process, half of the remaining data (not previously used) should be used to verify that the trained target algorithm performs well on the unknown data. Typically, the algorithm that performs better than others is selected for use. The remaining data is used to calculate the error value for the final selected model. There are other cross-validation techniques, such as
k-fold . However, I will not describe them in this article.
Machine learning tools and Weka framework
Most frameworks and libraries provide an extensive collection of machine learning algorithms. In addition, they provide a convenient high-level interface to training, testing and processing data models. Weka is one of the most popular frameworks for the JVM. Weka is a practical Java library that contains graphical tests for validating models. The example below uses the Weka library to create a training dataset that contains features and labels. Method
setClassIndex()
- for marking. In Weka, a label is defined as a class:
ArrayList<ltAttribute> attributes = new ArrayList<>();
Attribute sizeAttribute = new Attribute("sizeFeature");
attributes.add(sizeAttribute);
Attribute squaredSizeAttribute = new Attribute("squaredSizeFeature");
attributes.add(squaredSizeAttribute);
Attribute priceAttribute = new Attribute("priceLabel");
attributes.add(priceAttribute);
Instances trainingDataset = new Instances("trainData", attributes, 5000);
trainingDataset.setClassIndex(trainingSet.numAttributes() - 1);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 90.0);
instance.setValue(squaredSizeAttribute, Math.pow(90.0, 2));
instance.setValue(priceAttribute, 249.0);
trainingDataset.add(instance);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 101.0);
...
The Data Set and Sample Object can be saved and loaded from a file. Weka uses
ARFF (Attribute Relation File Format) which is supported by Weka's graphics benchmarks. This dataset is used to train an objective function known as a classifier in Weka. First of all, you must define the objective function. The code below
LinearRegression
will create an instance of the classifier. This classifier will be trained using the
buildClassifier()
. The method
buildClassifier()
selects theta parameters based on training data in search of the best target model. With Weka, you don't have to worry about setting the learning rate or number of iterations. Weka also performs feature scaling independently.
Classifier targetFunction = new LinearRegression();
targetFunction.buildClassifier(trainingDataset);
Once these settings are made, the objective function can be used to predict the price of the house, as shown below:
Instances unlabeledInstances = new Instances("predictionset", attributes, 1);
unlabeledInstances.setClassIndex(trainingSet.numAttributes() - 1);
Instance unlabeled = new DenseInstance(3);
unlabeled.setValue(sizeAttribute, 1330.0);
unlabeled.setValue(squaredSizeAttribute, Math.pow(1330.0, 2));
unlabeledInstances.add(unlabeled);
double prediction = targetFunction.classifyInstance(unlabeledInstances.get(0));
Weka provides a class
Evaluation
to test a trained classifier or model. In the code below, a selected array of validation data is used to avoid false results. The measurement results (cost of error) will be displayed on the console. Typically, evaluation results are used to compare models that were trained using different machine learning algorithms, or variations of these:
Evaluation evaluation = new Evaluation(trainingDataset);
evaluation.evaluateModel(targetFunction, validationDataset);
System.out.println(evaluation.toSummaryString("Results", false));
The example above uses linear regression, which predicts numerical values, such as the price of a house, based on input values. Linear regression supports the prediction of continuous numerical values. To predict binary values (“Yes” and “No”), you need to use other machine learning algorithms. For example, decision tree, neural networks or logistic regression.
Classifier targetFunction = new Logistic();
targetFunction.buildClassifier(trainingSet);
You can use one of these algorithms, for example, to predict whether an email message is spam, or predict the weather, or predict whether a house will sell well. If you want to teach your algorithm to predict the weather or how quickly a house will sell, you need a different data set, e.g.
topseller:
ArrayList<string> classVal = new ArrayList<>();
classVal.add("true");
classVal.add("false");
Attribute topsellerAttribute = new Attribute("topsellerLabel", classVal);
attributes.add(topsellerAttribute);
This dataset will be used to train a new classifier
topseller
. Once it has been trained, the prediction call should return a token class index that can be used to obtain the predicted value.
int idx = (int) targetFunction.classifyInstance(unlabeledInstances.get(0));
String prediction = classVal.get(idx);
Conclusion
Although machine learning is closely related to statistics and uses many mathematical concepts, the machine learning toolkit allows you to start integrating machine learning into your programs without deep knowledge of mathematics. However, the better you understand the underlying machine learning algorithms, such as the linear regression algorithm we explored in this article, the more you will be able to choose the right algorithm and tune it for optimal performance.
Translation from English. Author: Gregor Roth, Software Architect, JavaWorld.
GO TO FULL VERSION