# Multi-Class Classification

Multi-Class Classification AI & ML Agent allows you to classify elements into different classes. Unlike Binary Classification, the output is not restricted to two classes.

Details for an example and its configuration can be found in the *How to Use* section.

Multi-Class Classification Agent currently supports the algorithms listed below:

### Naive Bayes

Naive Bayes is a probabilistic classifier that can be used for multiclass problems. Using Bayes' theorem, the conditional probability for a sample belonging to a class can be calculated based on the sample count for each feature combination group. However, Naive Bayes Classifier is feasible only if the number of features and the values each feature can take is relatively small. It assumes independence among the presence of features in a class even though they may be dependent on each other. This multi-class trainer accepts "binary" feature values of type float: feature values that are greater than zero are treated as `true`

and feature values that are less or equal to 0 are treated as false.

### Sdca Maximum Entropy & Sdca Non-Calibrated

The optimization algorithm is an extension of a coordinate descent method following a similar path proposed in an earlier paper. It is usually much faster than L-BFGS and truncated Newton methods for large-scale and sparse data sets.

This class uses empirical risk minimization (i.e., ERM) to formulate the optimization problem built upon collected data. Note that empirical risk is usually measured by applying a loss function on the model's predictions on collected data points. If the training data does not contain enough data points (for example, to train a linear model in nn-dimensional space, we need at least nn data points), overfitting may happen so that the model produced by ERM is good at describing training data but may fail to predict correct results in unseen events. Regularization is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measured by the norm function) of model parameters. This trainer supports elastic net regularization, which penalizes a linear combination of L1-norm (LASSO), ||wc||1||wc||1, and L2-norm (ridge), ||wc||22||wc||22 regularizations for c=1,…,mc=1,…,m. L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.

Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, w1,…,wmw1,…,wm. For high-dimensional and sparse data sets, if users carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model that has only a few non-zero weights (e.g., 1% of total model weights) without affecting its prediction power. In contrast, the L2-norm cannot increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values. Sometimes, using the L2-norm leads to a better prediction quality, so users may still want to try it and fine-tune the coefficients of L1-norm and L2-norm. Note that conceptually, using L1-norm implies that the distribution of all model parameters is a Laplace distribution while L2-norm implies a Gaussian distribution for them.

An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables from the model. For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model. Therefore, choosing the right regularization coefficients is important in practice.

### One Versus All

In one-versus-all (OVA) strategy, a binary classification algorithm is used to train one classifier for each class, which distinguishes that class from all other classes. Prediction is then performed by running these binary classifiers and choosing the prediction with the highest confidence score. This algorithm can be used with any of the binary classifiers in ML.NET. A few binary classifiers already have implementation for multi-class problems, thus users can choose either one depending on the context. The OVA version of a binary classifier, such as wrapping a LightGbmBinaryTrainer, can be different from LightGbmMulticlassTrainer, which develops a multi-class classifier directly. Note that even if the classifier indicates that it does not need caching, OneVersusAll will always request caching, as it will be performing multiple passes over the data set. This trainer will request normalization from the data pipeline if the classifier indicates it would benefit from it.

This can allow you to exploit trainers that do not naturally have a multiclass option, for example, using the FastTreeBinaryTrainer to solve a multiclass problem. Alternately, it can allow ML.NET to solve a "simpler" problem even in the cases where the trainer has a multiclass option, but using it directly is not practical due to, usually, memory constraints. For example, while a multiclass logistic regression is a more principled way to solve a multiclass problem, it requires that the trainer store a lot more intermediate state in the form of L-BFGS history for all classes *simultaneously*, rather than just one-by-one as would be needed for a one-versus-all classification model.

### Reference

More details about the algorithms can be found at https://docs.microsoft.com/en-gb/dotnet/machine-learning/resources/tasks#multiclass-classification

### Pre-requisites

The minimum XMPro Data Stream version required for this Agent is 4.1.

### Download

Please contact XMPro if you're looking for an older version of this Agent.

Last updated