The Regression AI & ML Agent allows you to predict a target variable based on independent variables. The label can be any real value and is not from a finite set of values as in classification tasks. Regression algorithms model the dependency of the label on its related features to determine how the label will change as the values of the features are varied.
The input of a regression algorithm is a set of examples with labels of known values. The output of a regression algorithm is a function, which you can use to predict the label value for any new set of input features.
Regression Agent currently supports the algorithms listed below:
Decision trees are non-parametric models that perform a sequence of simple tests on inputs. This decision procedure maps them to outputs found in the training dataset whose inputs were similar to the instance being processed. A decision is made at each node of the binary tree data structure based on a measure of similarity that maps each instance recursively through the branches of the tree until the appropriate leaf node is reached and the output decision returned.
Decision trees have several advantages:
- They are efficient in both computation and memory usage during training and prediction.
- They can represent non-linear decision boundaries.
- They perform integrated feature selection and classification.
- They are resilient in the presence of noisy features.
Fast forest is a random forest implementation. The model consists of an ensemble of decision trees. Each tree in a decision forest outputs a Gaussian distribution by way of prediction. Aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model. This decision forest classifier consists of an ensemble of decision trees.
Generally, ensemble models provide better coverage and accuracy than single decision trees. Each tree in a decision forest outputs a Gaussian distribution.
For more information, see:
FastTree is an efficient implementation of the MART gradient boosting algorithm. Gradient boosting is a machine learning technique for regression problems. It builds each regression tree in a step-wise fashion, using a predefined loss function to measure the error for each step and corrects for it in the next. So this prediction model is actually an ensemble of weaker prediction models. In regression problems, boosting builds a series of such trees in a step-wise fashion and then selects the optimal tree using an arbitrary differentiable loss function.
MART learns an ensemble of regression trees, which is a decision tree with scalar values in its leaves. A decision (or regression) tree is a binary tree-like flow chart, where at each interior node one decides which of the two child nodes to continue to based on one of the feature values from the input. At each leaf node, a value is returned. In the interior nodes, the decision is based on the test x <= v where x is the value of the feature in the input sample and v is one of the possible values of this feature. The functions that can be produced by a regression tree are all the piece-wise constant functions.
The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function, and adding it to the previous tree with coefficients that minimize the loss of the new tree. The output of the ensemble produced by MART on a given instance is the sum of the tree outputs.
- In the case of a binary classification problem, the output is converted to a probability by using some form of calibration.
- In the case of a regression problem, the output is the predicted value of the function.
- In case of a ranking problem, the instances are ordered by the output value of the ensemble.
For more information, see:
The Tweedie boosting model follows the mathematics established in Insurance Premium Prediction via Gradient Tree-Boosted Tweedie Compound Poisson Models from Yang, Quan, and Zou.
For an introduction to Gradient Boosting, and more information, see:
Generalized Additive Models, or GAMs, model the data as a set of linearly independent features similar to a linear model. For each feature, the GAM trainer learns a non-linear function, called a "shape function", that computes the response as a function of the feature's value. (In contrast, a linear model fits a linear response (e.g. a line) to each feature.) To score an input, the outputs of all the shape functions are summed and the score is the total value.
This GAM trainer is implemented using shallow gradient boosted trees (e.g. tree stumps) to learn nonparametric shape functions, and is based on the method described in Lou, Caruana, and Gehrke. "Intelligible Models for Classification and Regression." KDD'12, Beijing, China. 2012. After training, an intercept is added to represent the average prediction over the training set, and the shape functions are normalized to represent the deviation from the average prediction. This results in models that are easily interpreted simply by inspecting the intercept and the shape functions.
Stochastic gradient descent uses a simple yet efficient iterative technique to fit model coefficients using error gradients for convex loss functions. Online Gradient Descent (OGD) implements the standard (non-batch) stochastic gradient descent, with a choice of loss functions, and an option to update the weight vector using the average of the vectors seen over time (averaged argument is set to True by default).
Poisson regression is a parameterized regression method. It assumes that the log of the conditional mean of the dependent variable follows a linear function of the dependent variables. Assuming that the dependent variable follows a Poisson distribution, the regression parameters can be estimated by maximizing the likelihood of the obtained observations.
This trainer is based on the Stochastic Dual Coordinate Ascent (SDCA) method, a state-of-the-art optimization technique for convex objective functions. The algorithm can be scaled because it's a streaming training algorithm as described in a KDD best paper.
Convergence is underwritten by periodically enforcing synchronization between primal and dual variables in a separate thread. Several choices of loss functions are also provided such as hinge-loss and logistic loss. Depending on the loss used, the trained model can be, for example, support vector machine or logistic regression. The SDCA method combines several of the best properties such as the ability to do streaming learning (without fitting the entire data set into your memory), reaching a reasonable result with a few scans of the whole data set (for example, see experiments in this paper), and spending no computation on zeros in sparse data sets.
More details about the algorithms can be found at https://docs.microsoft.com/en-gb/dotnet/machine-learning/resources/tasks#regression
The minimum XMPro Data Stream version required for this Agent is 4.1.