In a previous article, Artificial Intelligence for Records Management, we talked about different types of automation in the context of AI.

Since then we have released our Classification Intelligence (CI) module for Records365. In a nutshell, CI uses supervised machine learning algorithms to look at the contents of a records and suggest its classification.

There are many algorithms that can be used to *train* statistical models for such classification problems. As a Data Scientist or Machine Learning Engineer, you need to feel confident in your model selection – a big part of this is being able to easily compare the performance of the various different models against each other.

Put another way, how can we tell how good a model is? That is, how well does it make correct predictions based on a set of known examples (called the *training set*)?

## Model Accuracy

If you’ve read anything at all about machine learning, you may have heard of the accuracy score which is basically the ratio of correct predictions over total predictions. The accuracy score can be misleading however.

For instance, if you have a *heavily skewed* data set of food pictures where about 10% of them are images of hotdogs, you can achieve ~90% accuracy by using a dumb model that always predicts “not-hotdog”.

That is not ideal, so we need a better way of evaluating our classifiers.

## Confusion Matrix

In a nutshell, the confusion matrix scores the model on 4 different dimensions:

This information can now be combined into more meaningful scores such as *precision* and *recall.*

### Precision

Precision describes the accuracy of positive predictions: how often does the model predict “hotdog” correctly. The precision score can be calculated as such:

We wouldn’t want to use precision alone though. It turns out that you can get a perfect precision score by simply making a single correct positive prediction:

We need recall to complete this picture.

### Recall

Recall describes the ratio of positive instances detected by the model – e.g.: of all “hotdog” images in the training set, how many of them were predicted as such by the classifier? Recall is defined as:

If we compute precision and recall based on the predictions of our dumb model and the example training set where 10% of the images are hotdogs we get a much different picture with:

- ~70% precision – percentage of correct hotdog precisions
- ~75% recall – percentage of hotdogs correctly identified as such across the dataset

Here’s a more visual way to look at how accuracy and precision relate:

### F1 score

Precision and Recall can further be combined into a single number: the F1 score, which conveys the balance between precision and recall and is calculated as such:

This formula gives us a F1 score of ~74% for our previous example. It’s a very convenient way to evaluate how good a classifier you have.

## If we have F1 scores, why should we care about Precision and Recall?

We might be tempted to simply calculate the F1 score for a classifier and be done with it. This may be a mistake and we need to think about what we’re trying to achieve: there are times where precision is a lot more important than recall and vice-versa.

Let’s look at services such as Facebook and Youtube. As long as you have an account, you can post anything: pictures and videos of your dog, holidays and, unfortunately,* inappropriate content*. What’s more important here?

In this case it’s much better to have a classifier with * high precision, *only allowing truly safe content to be posted – but with

*occasionally blocking videos that are safe. Not a big deal and everyone is better for it.*

**low recall**–Conversely, if you have a classifier that attempts to predict whether a patient has a type of cancer you might decide that a * lower precision *model with

*is more appropriate: it’s better to have a false positive result that can be easily disproved with further, high precision tests, than to have false negatives robbing patients from the opportunity to get early treatment and potentially maximising their survival rate.*

**higher recall**This is what is known in industry as the *precision / recall trade-off*.

## Conclusion

In this article we looked at how we evaluate classifiers. Specifically we used a simple binary classifier as the example and left out multi-label classifiers for simplicity. Although brief and high-level we hope this article gives you some insight into how we evaluate and choose our own models at RecordPoint.