Appendix B - Understanding the Model Metrics
  • 30 Jul 2024
  • 5 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Appendix B - Understanding the Model Metrics

  • Dark
    Light
  • PDF

Article summary

In the Ushur intelligent workflow automation platform, it is easy and intuitive to build and plug in various machine-learning models for Natural Language Processing(NLP) into business automation workflows. These range from simple sentences, similarity detectors to complex classification. It also works on sentiment analysis models that employ state-of-art deep learning techniques. The following sections aim to illustrate the various methods that can be used to effectively assess the model performance and how tools within the Ushur ecosystem can enable the same.

Accuracy is often used as a defining metric while evaluating the performance of Machine-Learning models for real-world business use cases. Accuracy can be thought of as a proportion of correct results the model has achieved. But using it is a naïve approach and leads us to gauge the model’s predictive power incorrectly. Therefore, in statistical model testing, additional metrics are used to estimate the overall performance of ML models.

Let us take a quick tour of these metrics.

For instance, let us consider a simple e-mail classification model that classifies 100 e-mail samples as either “spam” (positive class) or “not-spam” (negative class).

The spam prediction model can be summarized into a 2 x 2 “confusion matrix”, which lists all the possible outcomes:

The image shows the following outcomes:

  • A “True Positive” (or TP) is an outcome where the model correctly predicts the “positive” class.

  • A “True Negative” (or TN) is an outcome when the model correctly predicts the “negative” class.

  • A “False Positive” (or FP) is an outcome when the model incorrectly predicts the “positive” class.

  • A “False Negative” (or FN) is an outcome when the model incorrectly predicts the “negative” class.

The following table describes the different ways to understand the various scenarios using the example:

Name

Definition

Formula

Outcome

Analysis

Accuracy

The fraction of predictions the model got right.

(TP + TN) / (TP + TN + FP + FN)

(1 + 90) / (1 + 90 + 1 + 8) = 0.91 or 91%

In the test sample of 100, there are 91 actual not-spam emails. The model correctly identifies 90 of them as not-spam. But out of the nine spam emails, the model only correctly identifies one as spam. This implies that 8 out of 9 spam emails go undetected, which is not the ideal outcome.

Precision

It defines how many predictions were correct amongst all positive predictions of the model.

Precision = TP / (TP + FP)

Precision = 1 / (1 + 1) = 0.5 or 50%

This implies that when the model predicts an email as spam, it is correct 50% or half of the time. To improve the precision of a model, we need to minimize false-positives.

Recall

It defines what proportion of actual positives were correctly predicted (or how many did we miss).

Recall = TP / (TP + FN)

Recall = 1 / (1 + 8) = 0.11 or 11%

This indicates that the model identifies 11% of spam emails correctly. To improve the recall of a model, we need to minimize the false-negatives.

F1 Score

It is a harmonic mean of precision and recall that is defined.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

This metric can be thought of as an alternate to the overall accuracy. It is very useful when seeking a balance between precision and recall.

Testing The Ushur ML Models

After a model is trained in the Ushur platform, you can evaluate its performance using the UCV (Ushur Classification Verifier) tool. It is a simple, python-based on-premise model testing tool. It utilizes the Ushur platform’s REST API to submit inference requests and retrieve the classification results. All the relevant metrics (per-category precision, recall, f1-score, confusion matrix, and so on) is the output from this tool.

As shown in the images, the UCV tool outputs the precision/recall (misclassification/accuracy) of each category in the model. It also generates a confusion matrix in the form of a PNG file which visually depicts the spread of classifications across all categories. The categories which exhibit sub-optimal performance can be easily identified and earmarked for further improvement.

The general rules of thumb for improving the performance of sub-optimal categories are as follows:

  • If the number of training samples is significantly less, try adding more data.

  • Check for labeling errors. Potential mislabels identified by the Ushur platform can also be made available for this exercise.

  • Double-check the data collection process for potential “sampling bias”. A biased sample is not representative of the entire population. Data collection should be purely a random exercise, wherein each data point has an equal chance of being chosen (using an entire database/backup/PST archive, collecting samples across a wide time range, and more).

  • In some scenarios, two or more categories can have overlap. It might be hard to distinguish between categories based on the textual content of samples. In those cases, consider using the Ushur platform’s intelligent data extraction capability. It helps identify KBIs (Key Business Indicators) which can then be coupled with the Metadata feature to override the model’s predicted classifications.

  • If the number of categories is large, it might be helpful to start with fewer categories and progressively add more. The initial set of categories can be chosen based on the volume/availability of data, business value, and more.

  • Experiment with other model types. The Ushur platform supports a wide variety of NLP models starting from simple TF-IDF, Doc2Vec, or word2vec based SVM models up to neural-network-based deep learning models like fasttext, BiLSTM, ULMfit, etc.

Assessing The Model Performance

As illustrated in earlier sections, the overall accuracy should not be the be-all and end-all model metric. In most real-world business use cases, data is inherently imbalanced, and as a result, the overall accuracy will be contributed significantly by a large number of true negatives/positives. But the false positives and false negatives are more interesting, as they have associated business costs (both tangible and intangible).

The overall effectiveness of the model can be assessed by considering both precision and recall or F1 metrics. But in reality, improving recall hampers precision and vice-versa. We need to arrive at a suitable trade-off by assessing the business opportunity costs involved.

Consider the example of classifying emails in an insurance company into a “claim inquiry” vs. a “general inquiry” category. If we “miss” detecting claim-related emails (low recall), the response to the user might be delayed inordinately. This might lead to a bad customer experience and also have an impact on business SLAs. On the other hand, if we incorrectly classify emails in the “general inquiry” category into the “claim category”, we might end up sending it to the wrong business unit/personnel, thus generating additional internal work in the process. The “cost” in each of these scenarios needs to be evaluated to arrive at the right set of acceptable values for the model.


Was this article helpful?