How Well Do You Classify?

In a previous article, Artificial Intelligence for Records Management, we talked about different types of automation in the context of AI.

Since then we have released our Classification Intelligence (CI) module for Records365. In a nutshell, CI uses supervised machine learning algorithms to look at the contents of a records and suggest its classification.

There are many algorithms that can be used to train statistical models for such classification problems. As a Data Scientist or Machine Learning Engineer, you need to feel confident in your model selection – a big part of this is being able to easily compare the performance of the various different models against each other.

Put another way, how can we tell how good a model is? That is, how well does it make correct predictions based on a set of known examples (called the training set)?

Model Accuracy

If you’ve read anything at all about machine learning, you may have heard of the accuracy score which is basically the ratio of correct predictions over total predictions. The accuracy score can be misleading however.

For instance, if you have a heavily skewed data set of food pictures where about 10% of them are images of hotdogs, you can achieve ~90% accuracy by using a dumb model that always predicts “not-hotdog”.

That is not ideal, so we need a better way of evaluating our classifiers.

Confusion Matrix

In a nutshell, the confusion matrix scores the model on 4 different dimensions:

This information can now be combined into more meaningful scores such as precision and recall.


Precision describes the accuracy of positive predictions: how often does the model predict “hotdog” correctly. The precision score can be calculated as such:

We wouldn’t want to use precision alone though. It turns out that you can get a perfect precision score by simply making a single correct positive prediction:

We need recall to complete this picture.


Recall describes the ratio of positive instances detected by the model – e.g.: of all “hotdog” images in the training set, how many of them were predicted as such by the classifier? Recall is defined as:

If we compute precision and recall based on the predictions of our dumb model and the example training set where 10% of the images are hotdogs we get a much different picture with:

  • ~70% precision – percentage of correct hotdog precisions
  • ~75% recall – percentage of hotdogs correctly identified as such across the dataset

Here’s a more visual way to look at how accuracy and precision relate:

F1 score

Precision and Recall can further be combined into a single number: the F1 score, which conveys the balance between precision and recall and is calculated as such:

This formula gives us a F1 score of ~74% for our previous example. It’s a very convenient way to evaluate how good a classifier you have.


If we have F1 scores, why should we care about Precision and Recall?

We might be tempted to simply calculate the F1 score for a classifier and be done with it. This may be a mistake and we need to think about what we’re trying to achieve: there are times where precision is a lot more important than recall and vice-versa.

Let’s look at services such as Facebook and Youtube. As long as you have an account, you can post anything: pictures and videos of your dog, holidays and, unfortunately, inappropriate content. What’s more important here?

In this case it’s much better to have a classifier with high precision, only allowing truly safe content to be posted – but with low recall occasionally blocking videos that are safe. Not a big deal and everyone is better for it.

Conversely, if you have a classifier that attempts to predict whether a patient has a type of cancer you might decide that a lower precision model with higher recall is more appropriate: it’s better to have a false positive result that can be easily disproved with further, high precision tests, than to have false negatives robbing patients from the opportunity to get early treatment and potentially maximising their survival rate.

This is what is known in industry as the precision / recall trade-off.


In this article we looked at how we evaluate classifiers. Specifically we used a simple binary classifier as the example and left out multi-label classifiers for simplicity. Although brief and high-level we hope this article gives you some insight into how we evaluate and choose our own models at RecordPoint.

You Might Like These Posts

  • Create Your End User Adoption Strategy

    In this presentation, we will take a closer look at the end user adoption work stream on an Office 365 project, and the associated roles, responsibilities, and tasks for the project plan. We will also review case studies to demonstrate how these differ based on the size of the project and the specific needs of the organization.

  • Microsoft Teams Collaboration

    Microsoft Teams introduces yet another collaboration option to Office 365. It joins a full suite of tools designed to increase productivity and facilitate people working together. How is Microsoft Teams different than these options? How is it the same?

  • Managing-More-Than-SharePoint

    The current digital landscape is one of the most complex and challenging we have ever seen for compliance and records management practices. Apart from the exponential growth in the volume of information created every day, the fact that content is being stored everywhere is also contributing to the compliance problem.

felis elit. leo Praesent venenatis at ut