XMPro Binary Classification ML Agent - Algorithms, Integration & Setup Guide

Learn how to integrate and configure the XMPro Binary Classification ML Agent for binary outcome prediction.

Overview

Integrate the XMPro Binary Classification Machine Learning Agent into your data streams to predict outcomes with two classes. This guide covers supported algorithms, configuration, and a step-by-step integration example.

The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances.

For a simple use case, see the Binary Classification ML Agent Example.

For a detailed configuration guide, see the Binary Classification ML Agent Configuration.

Read on for more details of the algorithms supported by the XMPro Binary Classification Agent, or the Microsoft machine learning library source at https://docs.microsoft.com/en-gb/dotnet/machine-learning/resources/tasks#binary-classificationarrow-up-right.

Current Version

  • Downloadarrow-up-right the Binary Classification AI & ML Agent v1.09 (Last Updated 03 Feb 2026)

  • Compatibility: XMPro Data Stream Designer v4.1+

  • Supported OS: Windows (v1.00+), Linux (v1.08+)

  • Additional Prerequisites: None

Please contactenvelope XMPro if you're looking for an older version of this Agent.

Release Notes

Version
Date
Description

1.09

03 Feb 2026

Fixed missing DLL error when using Average Perception algorithm.

1.08

12 May 2025

Added support for Linux Stream Hosts. Upgraded the Microsoft.ML NuGet package from 1.6.0 to 4.0.2.

1.07

9 Sep 2022

Repackaged to include the Agent's Category.

1.06

25 Oct 2021

Fixed output attributes to also show parent outputs. Fixed the issue with Input Mapping.

1.05

14 Sep 2021

Added hints to Algorithm Parameters and Model Options fields.

1.04

25 Aug 2021

Added Input Map to map variables to the training dataset.

Removed Input map from the grid in the agent properties.

Added dropdown to the grid to select each variable as feature, class variable or exclude it.

Changed the Training file caption to Dataset.

Implemented IMapAndReceiveAgent to forward all inputs.

Added Average Perceptron, Fast Forest, Field Aware Factorization Matrix, Gam, Lbfgs Logistic Regression, LdSVM, LinearSVM, SDGNonCalibrated, SDGCalibrated algorithms.

1.03

09 Aug 2021

Updated the Agent and ML Shared DLL file to resolve the Assembly Load issue.

1.02

20 May 2019

Training file field names are automatically generated if there is no header.

1.01

20 May 2019

Input Map changed to config-based mapping. Added validation related to the mapping change.

1.00

15 May 2019

Initial Release.

Binary Classification Algorithms Supported in XMPro

Average Perceptron Algorithm

The Perceptron algorithm is a fast, linear classifier ideal for simple binary classification tasks in XMPro. Use it when you need quick predictions with minimal computational overhead.

This algorithm makes its predictions by finding a separating hyperplane. For instance, with feature values f0,f1,...,fD−1f0,f1,...,fD−1, the prediction is given by determining what side of the hyperplane the point falls into. That is the same as the sign of the features' weighted sum, i.e. ∑D−1i=0(wi∗fi)+b∑i=0D−1(wi∗fi)+b, where w0,w1,...,wD−1w0,w1,...,wD−1 are the weights computed by the algorithm, and b is the bias computed by the algorithm.

The perceptron is an online algorithm, which means it processes the instances in the training set one at a time. It starts with a set of initial weights (zero, random, or initialized from a previous learner). Then, for each example in the training set, the weighted sum of the features is computed. If this value has the same sign as the label of the current example, the weights remain the same. If they have opposite signs, the weights vector is updated by either adding or subtracting (if the label is positive or negative, respectively) the feature vector of the current example, multiplied by a factor 0 < a <= 1, called the learning rate. In a generalization of this algorithm, the weights are updated by adding the feature vector multiplied by the learning rate, and by the gradient of some loss function (in the specific case described above, the loss is hinge-loss, whose gradient is 1 when it is non-zero).

In Averaged Perceptron (aka voted-perceptron), for each iteration, i.e. pass through the training data, a weight vector is calculated as explained above. The final prediction is then calculated by averaging the weighted sum from each weight vector and looking at the sign of the result.

For more information, see the Wikipedia entry for Perceptronarrow-up-right or Large Margin Classification Using the Perceptron Algorithmarrow-up-right.

Fast Forest Algorithm

Fast Forest is a random forest implementation. The model consists of an ensemble of decision trees. Each tree in a decision forest outputs a Gaussian distribution by way of prediction. Aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model. This decision forest classifier consists of an ensemble of decision trees.

Generally, ensemble models provide better coverage and accuracy than single decision trees. Each tree in a decision forest outputs a Gaussian distribution.

Fast Tree Algorithm

Fast Tree is an efficient implementation of the MARTarrow-up-right gradient boosting algorithm. Gradient boosting is a machine learning technique for regression problems. It builds each regression tree in a step-wise fashion, using a predefined loss function to measure the error for each step and corrects for it in the next. So this prediction model is actually an ensemble of weaker prediction models. In regression problems, boosting builds a series of such trees in a step-wise fashion and then selects the optimal tree using an arbitrary differentiable loss function.

The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function and adds it to the previous tree with coefficients that minimize the loss of the new tree. The output of the ensemble produced by MART on a given instance is the sum of the tree outputs.

Field Aware Factorization Matrix Algorithm

The algorithm implemented is based on a stochastic gradient methodarrow-up-right. Algorithm details are described in Algorithm 3 in this online documentarrow-up-right. The minimized loss function is logistic lossarrow-up-right, so the trained model can be viewed as a non-linear logistic regression.

GAM Binary Trainer Algorithm

Generalized Additive Models, or GAMs, model the data as a set of linearly independent features similar to a linear model. For each feature, the GAM trainer learns a non-linear function, called a "shape function", that computes the response as a function of the feature's value. (In contrast, a linear model fits a linear response (e.g. a line) to each feature.) To score an input, the outputs of all the shape functions are summed and the score is the total value.

This GAM trainer is implemented using shallow gradient boosted trees (e.g. tree stumps) to learn nonparametric shape functions, and is based on the method described in Lou, Caruana, and Gehrke. "Intelligible Models for Classification and Regression."arrow-up-right KDD'12, Beijing, China. 2012. After training, an intercept is added to represent the average prediction over the training set, and the shape functions are normalized to represent the deviation from the average prediction. This results in models that are easily interpreted simply by inspecting the intercept and the shape functions. See the sample below for an example of how to train a GAM model and inspect and interpret the results.

L-BFGS Logistic Regression Algorithm

The optimization technique implemented is based on the limited memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS)arrow-up-right. L-BFGS is a quasi-Newtonian methodarrow-up-right thatreplaces the expensive computation cost of the Hessian matrix with an approximation but still enjoys a fast convergence rate like the Newton methodarrow-up-right where the full Hessian matrix is computed. Since the L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with the high-dimensional feature vector. The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation to the Hessian matrix but also a higher computation cost per step.

An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. Therefore, choosing the right regularization coefficients is important when applying logistic regression.

LD-SVM Algorithm

Local Deep SVM (LD-SVM) is a generalization of Localized Multiple Kernel Learning for non-linear SVM. Multiple kernel methods learn a different kernel, and hence a different classifier, for each point in the feature space. The prediction time cost for multiple kernel methods can be prohibitively expensive for large training sets because it is proportional to the number of support vectors, and these grow linearly with the size of the training set. LD-SVM reduces the prediction cost by learning a tree-based local feature embedding that is high dimensional and sparse, efficiently encoding non-linearities. Using LD-SVM, the prediction cost grows logarithmically with the size of the training set, rather than linearly, with a tolerable loss in classification accuracy.

Local Deep SVM is an implementation of the algorithm described in C. Jose, P. Goyal, P. Aggrwal, and M. Varma, Local Deep Kernel Learning for Efficient Non-linear SVM Prediction, ICML, 2013arrow-up-right.

Linear SVM Algorithm

Linear SVMarrow-up-right implements an algorithm that finds a hyperplane in the feature space for binary classification, by solving an SVM problemarrow-up-right. For instance, with feature values, f0,f1,...,fD−1f0,f1,...,fD−1, the prediction is given by determining what side of the hyperplane the point falls into. That is the same as the sign of the feautures' weighted sum, i.e. ∑D−1i=0(wi∗fi)+b∑i=0D−1(wi∗fi)+b, where w0,w1,...,wD−1w0,w1,...,wD−1 are the weights computed by the algorithm, and bb is the bias computed by the algorithm.

Linear SVM implements the PEGASOS method, which alternates between stochastic gradient descent steps and projection steps, introduced in this paperarrow-up-right by Shalev-Shwartz, Singer and Srebro.

SDCA & SDCA Logistic Regression Algorithm

This trainer is based on the Stochastic Dual Coordinate Ascent (SDCA) method, a state-of-the-art optimization technique for convex objective functions. The algorithm can be scaled because it's a streaming training algorithm as described in a KDD test paper.arrow-up-right

Convergence is underwritten by periodically enforcing synchronization between primal and dual variables in a separate thread. Several choices of loss functions are also provided such as hinge-lossarrow-up-right and logistic lossarrow-up-right. Depending on the loss used, the trained model can be, for example, a support vector machinearrow-up-right or logistic regressionarrow-up-right. The SDCA method combines several of the best properties such as the ability to do streaming learning (without fitting the entire data set into your memory), reaching a reasonable result with a few scans of the whole data set (for example, see experiments in this paperarrow-up-right), and spending no computation on zeros in sparse data sets.

SGD Calibrated & SGD Non-Calibrated Algorithm

The Stochastic Gradient Descent (SGD)arrow-up-right is one of the popular stochastic optimization procedures that can be integrated into several machine learning tasks to achieve state-of-the-art performance. This trainer implements the Hogwild Stochastic Gradient Descent for binary classification that supports multi-threading without any locking. If the associated optimization problem is sparse, Hogwild Stochastic Gradient Descent achieves a nearly optimal rate of convergence.

Last updated

Was this helpful?