Understanding Annotation and its Importance for Machine Learning
  • 08 Jul 2024
  • 3 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Understanding Annotation and its Importance for Machine Learning

  • Dark
    Light
  • PDF

Article summary

Introduction

Annotation is the process of assigning labels to a document and its content - the primary purpose is to ensure the relevant parts are identified and given as input to Machine Learning Models (ML Models). It also includes the process of identifying important parts of emails, documents, PDFs, and other forms of unstructured information. Broadly, it includes image and text annotation.

Documents contain information that is a mixture of content that is of business relevance to a customer, several parts that could be simply descriptions or other flowing text that is of lesser relevance to a specific business use-case.

Annotation therefore helps in ensuring that the ML models can understand what is important in a document and what can be safely ignored for a specific context.

Annotation can also be described as the process of Manual Supervision to enable ML models to be fed with the right kind of information for training purposes.

Why Annotate?

Customer documentation may have multiple categories of information. It is imperative to identify the information within the documents to extract the relevant information that helps the enterprise for its critical function. Key information could include the following:

  • Information

  • Names of people

  • Names of companies

  • Financial information (Limits, Deductibles, Premiums, etc) Other relevant information

It is also crucial to classify this important information as keys, and their associated values and establish a link between them.

For example, a driving license card could contain the license number, the cardholder's name, and the expiry date. Each of these could be classified as keys, and the associated information is identified as the value. These are linked together as relevant information, so they are classified easily.

Annotating ID Cards - Exploring a Use-Case Scenario

This is a sample ID card provided by a customer.

The customer may be expecting to obtain information such as Member Name, Member ID, Health Plan Number, and so on. In the screen above, these are marked as keys, and the corresponding information is marked as values. To annotate, each of the keys and values are marked and they are linked together so that the relevant information is gathered correctly. The sample labelling is displayed below.

Annotation with Layout Information

An important process is labeling the document layout to ensure that the geometric and structural information is captured. As mentioned in the previous use case, this process captures the key and the values which are geometrically and structurally present within any document.

For example, one can identify if the name is generally on the top left or it appears in the bottom right, and so on.

Annotating the layout information is useful in single-sheet images and documents. This method is also useful for high-volume documents that have fewer data points contained within them.

There are also other ways to annotate data using different mechanisms:

  • Providing an Excel sheet with keys and values or

  • Providing a dump of the customer’s database with the keys and values.

These are required for extracting information from Emails and PDFs. This way of annotating is critical in use cases where multipage PDFs or emails contain various extractable values.

Why is Annotation Important?

Annotating helps separate the wheat from the chaff - identifying what areas of the document are important and of interest from what is not. It creates relationships between entities in a document such as key, value, and link between a key to a value, and helps to map a specific key to a particular value.

However, our customers are the best judge. They are aware of what they know, what they want from the documents, and what to extract from the documents that are critical for their business process. They are also sure about the areas of flowing text that are irrelevant and hence have the ability to annotate samples and provide the “seed” for Machine Learning. This fosters the concept of good annotation leading to better predictions.

Caveats of Annotation

Although annotation has its advantages, it is also important to understand that:

  • Good annotations do not mean 100% prediction accuracy

  • Predictions are based on various other factors like document and image quality

  • Various OCR errors must be factored in while reading the document/image Alignment issues of keys to values are high possibilities

  • Variation in input documents leads to lower prediction accuracies


Was this article helpful?