# Binary Classification

Binary Classification AI & ML Agent allows you to predict categorical variables where the output is only restricted to two classes. The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabelled instances.
Details for an example and its configuration can be found in the How to Use section.
Binary Classification Agent currently supports the algorithms listed below:

### Average Perceptron

The perceptron is a classification algorithm that makes its predictions by finding a separating hyperplane. For instance, with feature values `f0,f1,...,fD−1f0,f1,...,fD−1`, the prediction is given by determining what side of the hyperplane the point falls into. That is the same as the sign of the features' weighted sum, i.e`. ∑D−1i=0(wi∗fi)+b∑i=0D−1(wi∗fi)+b`, where `w0,w1,...,wD−1w0,w1,...,wD−1 `are the weights computed by the algorithm, and `b` is the bias computed by the algorithm.
The perceptron is an online algorithm, which means it processes the instances in the training set one at a time. It starts with a set of initial weights (zero, random, or initialized from a previous learner). Then, for each example in the training set, the weighted sum of the features is computed. If this value has the same sign as the label of the current example, the weights remain the same. If they have opposite signs, the weights vector is updated by either adding or subtracting (if the label is positive or negative, respectively) the feature vector of the current example, multiplied by a factor 0 < a <= 1, called the learning rate. In a generalization of this algorithm, the weights are updated by adding the feature vector multiplied by the learning rate, and by the gradient of some loss function (in the specific case described above, the loss is hinge-loss, whose gradient is 1 when it is non-zero).
In Averaged Perceptron (aka voted-perceptron), for each iteration, i.e. pass through the training data, a weight vector is calculated as explained above. The final prediction is then calculated by averaging the weighted sum from each weight vector and looking at the sign of the result.

### Fast Forest

Fast Forest is a random forest implementation. The model consists of an ensemble of decision trees. Each tree in a decision forest outputs a Gaussian distribution by way of prediction. Aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model. This decision forest classifier consists of an ensemble of decision trees.
Generally, ensemble models provide better coverage and accuracy than single decision trees. Each tree in a decision forest outputs a Gaussian distribution.

### Fast Tree

FastTree is an efficient implementation of the MART gradient boosting algorithm. Gradient boosting is a machine learning technique for regression problems. It builds each regression tree in a step-wise fashion, using a predefined loss function to measure the error for each step and corrects for it in the next. So this prediction model is actually an ensemble of weaker prediction models. In regression problems, boosting builds a series of such trees in a step-wise fashion and then selects the optimal tree using an arbitrary differentiable loss function.
The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function and adds it to the previous tree with coefficients that minimize the loss of the new tree. The output of the ensemble produced by MART on a given instance is the sum of the tree outputs.

### Field Aware Factorization Matrix

The algorithm implemented is based on a stochastic gradient method. Algorithm details are described in Algorithm 3 in this online document. The minimized loss function is logistic loss, so the trained model can be viewed as a non-linear logistic regression.

### GAM Binary Trainer

Generalized Additive Models, or GAMs, model the data as a set of linearly independent features similar to a linear model. For each feature, the GAM trainer learns a non-linear function, called a "shape function", that computes the response as a function of the feature's value. (In contrast, a linear model fits a linear response (e.g. a line) to each feature.) To score an input, the outputs of all the shape functions are summed and the score is the total value.
This GAM trainer is implemented using shallow gradient boosted trees (e.g. tree stumps) to learn nonparametric shape functions, and is based on the method described in Lou, Caruana, and Gehrke. "Intelligible Models for Classification and Regression." KDD'12, Beijing, China. 2012. After training, an intercept is added to represent the average prediction over the training set, and the shape functions are normalized to represent the deviation from the average prediction. This results in models that are easily interpreted simply by inspecting the intercept and the shape functions. See the sample below for an example of how to train a GAM model and inspect and interpret the results.

### L-BFGS Logistic Regression

The optimization technique implemented is based on the limited memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS). L-BFGS is a quasi-Newtonian method thatreplaces the expensive computation cost of the Hessian matrix with an approximation but still enjoys a fast convergence rate like the Newton method where the full Hessian matrix is computed. Since the L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with the high-dimensional feature vector. The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation to the Hessian matrix but also a higher computation cost per step.
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. Therefore, choosing the right regularization coefficients is important when applying logistic regression.

### LD-SVM

Local Deep SVM (LD-SVM) is a generalization of Localized Multiple Kernel Learning for non-linear SVM. Multiple kernel methods learn a different kernel, and hence a different classifier, for each point in the feature space. The prediction time cost for multiple kernel methods can be prohibitively expensive for large training sets because it is proportional to the number of support vectors, and these grow linearly with the size of the training set. LD-SVM reduces the prediction cost by learning a tree-based local feature embedding that is high dimensional and sparse, efficiently encoding non-linearities. Using LD-SVM, the prediction cost grows logarithmically with the size of the training set, rather than linearly, with a tolerable loss in classification accuracy.

### Linear SVM

Linear SVM implements an algorithm that finds a hyperplane in the feature space for binary classification, by solving an SVM problem. For instance, with feature values, `f0,f1,...,fD−1f0,f1,...,fD−1,` the prediction is given by determining what side of the hyperplane the point falls into. That is the same as the sign of the feautures' weighted sum, i.e. `∑D−1i=0(wi∗fi)+b∑i=0D−1(wi∗fi)+b,` where `w0,w1,...,wD−1w0,w1,...,wD−1` are the weights computed by the algorithm, and bb is the bias computed by the algorithm.
Linear SVM implements the PEGASOS method, which alternates between stochastic gradient descent steps and projection steps, introduced in this paper by Shalev-Shwartz, Singer and Srebro.

### SDCA & SDCA Logistic Regression

This trainer is based on the Stochastic Dual Coordinate Ascent (SDCA) method, a state-of-the-art optimization technique for convex objective functions. The algorithm can be scaled because it's a streaming training algorithm as described in a KDD test paper.
Convergence is underwritten by periodically enforcing synchronization between primal and dual variables in a separate thread. Several choices of loss functions are also provided such as hinge-loss and logistic loss. Depending on the loss used, the trained model can be, for example, a support vector machine or logistic regression. The SDCA method combines several of the best properties such as the ability to do streaming learning (without fitting the entire data set into your memory), reaching a reasonable result with a few scans of the whole data set (for example, see experiments in this paper), and spending no computation on zeros in sparse data sets.

### SGD Calibrated & SGD Non-Calibrated

The Stochastic Gradient Descent (SGD) is one of the popular stochastic optimization procedures that can be integrated into several machine learning tasks to achieve state-of-the-art performance. This trainer implements the Hogwild Stochastic Gradient Descent for binary classification that supports multi-threading without any locking. If the associated optimization problem is sparse, Hogwild Stochastic Gradient Descent achieves a nearly optimal rate of convergence.

### Pre-requisites

The minimum XMPro Data Stream version required for this Agent is 4.1.