Azure vs AWS vs GCP (Part 2: Form Recognizers)

  • Form recognizers use artificial intelligence to extract data from digital or handwritten custom forms, invoices, tables and receipts.
  • We compared the form recognizers solutions on Amazon, Google and Microsoft Cloud.
  • Azure Form Recognizer does a fantastic job in creating a viable solution with just five sample documents. It performs end-to-end Optical Character Recognition (OCR) on handwritten as well as digital documents with an amazing accuracy score and in just three seconds.

In part 1, we compared handwriting recognition solutions on Azure, AWS and GCP. In this post, we will be comparing form recognizer capabilities.

Handwriting Recognition Result

Data is more expensive than oil now. Companies have a lot of data, but not all data is digitized. Recently, while consulting for a client, we realized that they had been doing big data analysis manually. It was shocking. They had more than a hundred year's worth of handwritten documents. Imagine, doing an analysis on that amount of data. We built a solution for them that not only recognizes the standard forms like invoices, receipts etc., but also converts the handwritten texts into digitized text. Once the data is in the right form, analysis becomes easy. To manage, analyse, make predictions and decisions using that data is a daunting task, especially if the data is not collected using the right process. To understand the high-level process, please read the following article. Even though these solutions can run on-premise, in this article we will compare the available cloud offerings for recognizing form data in the cloud.

There are lots of ways to reduce that pile of documents in your shelf or cupboard and store them digitally in a hard disk or in cloud storage. There are services available, which can read the information present in your forms, recognize and extract that information while preserving their relationship within the data. These services use artificial intelligence and machine learning algorithms to extract text from different document formats. These allow you to easily search through your information, apply data science and make predictions based on the data input and, lastly but also importantly, reduce paper use.

In this blog, we will examine:

Custom labeling model comparison

Each business case requires unique modifications specific to its purpose. Behind the scenes, each model inspects the training data from the cloud, selects the right machine learning algorithm, trains and provides a custom model and evaluation metrics. This new model can now be used through the API and integrated into the applications. They all utilize bounding boxes to tag a specific area to extract the information.

What is a bounding box?

A bounding box is a rectangular box with x and y coordinates of the upper left and lower right corners and is used to find the exact location of an element on a page. Bounding boxes are typically used for multiple object recognition by forming a rectangular box around the object to be retrieved.

Why do we need bounding boxes?

In order to replicate the structure of a document, invoice, form or receipt, we need to know the exact coordinates of all the elements of interest. For example, in a W-4 employment form, an OCR program is capable of finding out all digitized text. However, a form recognizer, uses OCR to retrieve digitized texts and bounding boxes to retrieve where the particular text is located. The x and y coordinates of the bounding boxes of fields like name, social security number and address provide the necessary relative locations of these fields. This helps us reconstruct the document on a custom user interface created by us that looks exactly like the scanned form which is our input. The results are kept in a set of key-value pairs. That makes it perfect to be stored in a data store. For example, the fields above are the keys. All keys are part of the standard form. The values are either typed or handwritten.

We are using the following tools and their UI implementations (if available) on their corresponding websites:

  • Amazon Rekognition Custom Label: It can be used to identify objects and scenes in images that are specific to business needs. For example, it can identify logos, identify products on store shelves, identify animated characters in videos, etc. Bounding boxes here are specified using all four vertices of the rectangular box along with the width and height.
  • Google Cloud Auto ML Vision Object Detection: It uses custom machine learning models that can detect individual objects in each image along with its bounding box and label. Bounding boxes are specified using the top-left and bottom-right vertices.
  • Microsoft Form Recognition: The form recognizer uses unsupervised machine learning algorithms to identify and extract text, key/value pairs and table data from form documents. It uses pre-trained models to output structured data such as the time and date of transactions, merchant information, amount of taxes or total. This helps preserve the relationships in the original form document. It can also be applied for custom forms by using bounding boxes and labeling to train and test on different documents. Bounding boxes here are specified using all four vertices of the rectangular box.

We are testing these models based on how well they can find the bounding boxes specific to a business use case. The use case here is extracting the information from the W2 form images. Here we will be comparing custom labeling implementation of Amazon, Google and Microsoft for sample sizes of five and ten W2 forms. We will keep one form separate from the training set for testing purposes. Tags considered (in these approximate positions in W2 form):

1. Employee SSN 5. Section 1-14
2. Employee Identification Number (EIN)
3. Employer’s name, address, zip
4. Employer’s name, address, zip
6. Section 10-15
7. Form Name 8. Form Year

Visual representation of tags considered:

Form Recognition
Figure 1: Visual Representation of W2 form labels

We also tested the test image on their instance features for text detection (OCR) on:

1. Amazon Textract: The dark shaded regions are recognized as the key-value pairs.

Form Recognition

2. Google Document AI (pdf only): The red rectangles are the key-value measures.

Form Recognition

3. Microsoft Azure Form Recognizer: Labels shown from Analyze API of Form Recognizer Key-value pairs detected. (No public UI to test implementation on prebuilt form)

Form Recognition

Since we wanted the labels to detect individual custom labels, we used their custom labeling tools as provided on their website. Representation of the process to label, train and test the forms (showing for 10 images sample size):

Amazon Rekognition Custom Label Google Cloud Auto ML Vision Microsoft Azure Form Recognizer
Labeling Form Recognition Form Recognition Form Recognition
Training Form Recognition Form Recognition Form Recognition
Results Form Recognition Form Recognition Form Recognition

From the results we received, we will be comparing the bounding boxes covering our required information. The purple-colored ovals show the amount of information that the custom model was not able to pick up for extraction.

Amazon:

Form Recognition
Figure 2: Amazon Missed Data (10 samples)
Form Recognition
Figure 3: Amazon Missed Data (10 samples)

Microsoft:

Form Recognition
Figure 4: Microsoft Missed Data (5 samples)
Form Recognition
Figure 5: Microsoft Missed Data (5 samples)

Google: It requires a minimum of 10 samples.

Form Recognition
Figure 6: Google Missed Data (10 samples)

Microsoft goes the extra step and provides OCR for the text within the bounding boxes in its Form Recognizer feature.

Google and Amazon do not provide a feature directly to perform OCR with bounding box; however, we can utilize Google Cloud Vision and Amazon Textract/Amazon Recognition Text Detection to further perform OCR on bounding box through their APIs once we have found the bounding boxes information from the custom models.

Bounding Box with OCR 5 Samples 10 Samples
Training Time Taken Accuracy Training Time Taken Accuracy
Amazon - 52 minutes 3 regions unidentified 1 hour 3 minutes 3 regions unidentified
Google - - - 1 hour 14 minutes Full
Microsoft Handwriting Recognition Result 3 seconds 1.5 regions unidentified 5 seconds 1.5 regions unidentified

Conclusion: Azure Form Recognizer happens to be the only tool that provides the digitized text output along with the bounding box coordinates and it does so in an impressive way. In just three seconds, we were able to train the model with a very high degree of accuracy. When it comes to accuracy, Google is 100% accurate. However, training takes a little more than an hour and it requires at least ten sample images, compared to Microsoft or Amazon, both of which only need a minimum of five images. Overall, Microsoft is the clear winner.

Invoice Recognition

Every day we use a lot of services; we go shopping in stores to purchase products or go to a tailor or other custom repair shop. For each one of them we might get an invoice/bill representing the purchased product/service. A company may want to digitize all the invoices of its employees purchase or a frugal person would want to store all its family expenses digitally for budget calculations. For all such use cases, we will be testing various cloud services and mark the errors which each one of them makes side-by-side.

We are using the following services:

  • Amazon Textract
  • Google Cloud Vision (Note: Google Document AI is also used, but as it only supports PDF files, it is not compared)
  • Microsoft Azure Form Recognizer

Amazon: Example 1

Form Recognition
Figure 7: Amazon Invoice Recognition Sample Input
Form Recognition
Figure 8: Amazon Invoice Recognition Sample Output

Amazon also displays information in key value pairs and in table format. Additionally, they have a Human Review feature where you can build human workflows to review predictions from Amazon Rekognition and Amazon Textract for content moderation.

Google: Example 1

Form Recognition
Figure 9: Google Invoice Recognition Sample Input
Form Recognition
Figure 10: Google Invoice Recognition Sample Output

Google Document AI: It could also be used to detect the OCR in documents.

Example:

Form Recognition

Microsoft: Example 1

Form Recognition
Figure 11: Microsoft Invoice Recognition Sample Input
Form Recognition
Figure 12: Microsoft Invoice Recognition Sample Output

Observations:

Invoice Recognition Service Error Count
Amazon Textract 1
Google Cloud Vision 2
Microsoft Azure Form Recognizer 3

Conclusion: From the above test results we can see that Amazon Textract only had one error where it could not recognize the word "Qty." Instead it displayed that text as "Oty." The same was the case with Google Cloud Vision (GCV) as well as Microsoft Azure Form Recognizer (AFR). GCV and AFR could not recognize the number 1 where GCV completely missed adding it in the result and AFR at one place added the text "Interdum" instead of 1. It is clear that Amazon Textract is a clear winner as it only gave one error as well as recognized most of the content better when compared to the other two services.

Receipt Recognition

Like invoice recognition, cloud services also offer solutions to recognize receipts and POS transaction slips. Here too, we are marking the error between the actual receipt and the digitized text.

We are using the following services:

  • Amazon Textract
  • Google Cloud Vision (Note: Google Document AI is also used, but since it only takes PDF file as input, it is not compared)
  • Microsoft Azure Form Recognizer

Amazon:

Amazon Receipt Recognition Input
Figure 13: Amazon Receipt Recognition Sample Input
Amazon Receipt Recognition Output
Figure 14: Amazon Receipt Recognition Sample Output

Google:

Google Receipt Recognition Input
Figure 15: Google Receipt Recognition Sample Input
Google Receipt Recognition Output
Figure 16: Google Receipt Recognition Sample Output

When using the same receipt in .pdf format in Google Document AI, we get the following result, and the order seems to have been preserved.

Google Document AI Receipt Recognition Output
Figure 17: Google Document AI Receipt Recognition Sample Output

Microsoft:

Microsoft Receipt Recognition Input
Figure 18: Microsoft Receipt Recognition Sample Input
Form Recognition
Figure 19: Microsoft Receipt Recognition Sample Output

Observations:

We have accounted for two kinds of errors in this case:

  • Text Prediction Error Count: This is the count of error in text prediction. We assign it a weight of one.
  • Text Structure Error Count: This is the count of error in the order of the text output from the service API. Since this is important to preserve the structure of text coming from JSON, we will give more weight to it than the first one, so we will assign it a weight of five.
Text Prediction Error Count Text Structure Error Count Error Score
Microsoft Azure Form Recognizer 2 1 (2 * 1) + (1 * 5) = 7
Amazon Textract 8 1 (8 * 1) + (1 * 5) = 13
Google Cloud Vision 2 3 (2 * 1) + (3 * 5) = 17

Conclusion: From the above test results, we can see that the number of predictive text errors is the same for Microsoft Azure Form Recognizer (AFR) and Google Cloud Vision (GCV) whereas Amazon Textract (AT) has made the maximum number of mistakes in recognizing the text correctly. With regards to structural errors, AFR and AT made one mistake whereas GCV made three.

AFR made two predictive text errors where it could not recognize the characters “S” and “lb” and gave the result as “5” and “1b” respectively whereas it had only one structure recognition error. AT is way off in recognizing the text correctly but made only one structure recognition error. GCV could not recognize the - (hyphen) mark at two places after the price and replaced the word “WT” with “H”. It made the maximum number of structure recognition errors.

Nevertheless, on considering the error score (the lower the better), we can conclude that Microsoft AFR performed better as compared to the other services.

Application Form Recognition

This area deals with the handwritten application form, which we typically fill out in a bank or for an indemnification form. We will not be comparing Amazon cloud service, since it does not support handwriting recognition yet in Amazon Textract.

We are using the following services:

  • Google Cloud Vision (Note: Google Document AI is also used, but since it only takes PDF file as input, it is not compared)
  • Microsoft Azure Form Recognizer

Google: Sample 1

Google Application Form Recognition Input
Figure 20: Google Application Form Recognition Sample Input 1
Google Application Form Recognition Output
Figure 21: Google Application Form Recognition Sample Output 1

Notice how the format is mixed up in the key value pairs extracted from the JSON output. However, if we tested this on Google Document AI, we get the following result: the structure is not preserved here as well.

Google Document AI Application Form Recognition

Microsoft: Sample 1

Microsoft Application Form Recognition Input
Figure 22: Microsoft Application Form Recognition Sample Input 1
Microsoft Application Form Recognition Output
Figure 23: Microsoft Application Form Recognition Sample Output 1

Google: Sample 2

Google Application Form Recognition Input 2
Figure 24: Google Application Form Recognition Sample Input 2
Google Application Form Recognition Output 2
Figure 25: Google Application Form Recognition Sample Output 2

However, if we tested the same input in Google Document AI, we will get the following result:

Google Document AI Application Form Recognition

Microsoft: Sample 2

Microsoft Application Form Recognition Input 2
Figure 26: Microsoft Application Form Recognition Sample Input 2
Microsoft Application Form Recognition Output 2
Figure 27: Microsoft Application Form Recognition Sample Output 2

Observations:

We have accounted for two kinds of errors in this case:

  • Text Prediction Error Count: This is the count of error in text prediction. We assign it a weight of one.
  • Text Structure Error Count: This is the count of error in the order of the text output from the service API. Since this is important to preserve the structure of text coming from JSON, we will give more weight to it than the first one, so we will assign it a weight of five.
Sample 1 Sample 2 Error Score
Application Form Recognition Services Text Prediction Error Count Text Structure Error Count Text Prediction Error Count Text Structure Error Count
Microsoft Azure Form Recognizer 6 1 8 0 (6 * 1 + 1 * 5) + (8 * 1 + 0) = 19
Google Cloud Vision 9 1 4 1 (9 * 1 + 1 * 5) + (4 * 1 + 1 * 5) = 23

Conclusion: If we closely look at the text prediction errors, for Sample 1 AFR made six mistakes as compared to nine by Google Cloud Vision (GCV). For Sample 2, GCV made only four mistakes whereas AFR made eight. If we sum-up the total mistakes of both the samples, AFR made fourteen mistakes in text recognition and GCV made thirteen mistakes. However, AFR takes the lead when it comes to structural errors. For Sample 1, AFR made only one mistake and none for Sample 2 whereas GCV made one mistake in both the samples. So based on the total error score, Microsoft Azure Form Recognizer (AFR) performed better.

Table Data Recognition

This category deals with extracting tabular information from forms or documents in image (png/jpg) or other formats like pdf. We are only showing the tabular data from JSON output and not the text data as key value pairs in this category. Google does not specifically extract tables from images, so we have not included it here.

We are using the following services:

  • Amazon Textract
  • Google Cloud Vision (Note: Google Document AI is also used, but since it only takes PDF file as input, it is not compared)
  • Microsoft Azure Form Recognizer

Amazon: Sample 1

Amazon Table Data Recognition Input 1
Figure 32: Amazon Table Data Recognition Sample Input 1
Amazon Table Data Recognition Output 1
Figure 33: Amazon Table Data Recognition Sample Output 1

Google: Sample 1

Google Table Data Recognition Input 1
Figure 28: Google Table Data Recognition Sample Input 1
Google Table Data Recognition Output 1
Figure 29: Google Table Data Recognition Sample Output 1

Microsoft: Sample 1

Microsoft Table Data Recognition Input 1
Figure 30: Microsoft Table Data Recognition Sample Input 1
Microsoft Table Data Recognition Output 1
Figure 31: Microsoft Table Data Recognition Sample Output 1

All three services could recognize some part of the indices as tables.

Table Data - Sample 2

Table Data - Sample 2
Figure 34: Table Data Recognition Input 2

Amazon: Output for Sample 2

Amazon Table Data Recognition Output Sample 2
Figure 37: Amazon Table Data Recognition Output Sample 2

Google: Output for Sample 2

Google Table Data Recognition Output Sample 2
Figure 35: Google Table Data Recognition Output Sample 2

Microsoft: Output for Sample 2

Microsoft Table Data Recognition Output Sample 2
Figure 36: Microsoft Table Data Recognition Output Sample 2

Observations:

Table Data Recognition Services Sample 1 Accuracy Sample 2 Accuracy
Amazon Textract 1/8 Full Text Accuracy
Google Document AI 7/8 Full Text Accuracy
Microsoft Azure Form Recognizer 5/8 Full Text Accuracy

Conclusion: If we look at Sample 1, there are eight different sections in the index. When we compare all three services, Amazon Textract only recognizes one section and creates a table. Whereas, Google Document AI and Microsoft Azure Form Recognizer create seven and five tables respectively. For Sample 2, all three services could fully recognize the text and gave correct output. So, if we compare the performance for both samples, Google seems to perform well.

Overall, we looked at the following services:

  • Amazon: Amazon Textract, Amazon Rekognition
  • Google: Google Cloud Vision, Google Document AI Google Cloud Auto ML Object detection (Note: OCR for Google Cloud Vision supports PNG/JPG files, while for Google Document AI it supports PDF files only.)
  • Microsoft: Microsoft Azure Form Recognizer

Data required for testing these services were hosted on their own cloud services:

  • Amazon: Amazon S3
  • Google: Google Cloud Platform
  • Microsoft: Azure Container (Block Blob Data)

All the input files to the different services were .png files (unless stated otherwise).

Disclaimer:

These results are based on the state of the AI models on May 20, 2020, and have been examined using a limited dataset. The results can change over time with newer models or better training of existing models.

In our custom labelled content, we are looking for key-value pairs. Our use case has two main parts:

  • Identify the custom labels on the test image from the trained model.
  • Get the digitized content (key value pairs) in each custom label of the test image.

Microsoft Azure Form Recognizer (AFR) performs Optical Character Recognition (OCR) and offers the solution for both parts of the use case, but Amazon Rekognition Custom Label and Google Cloud Auto ML Vision Object detection do not perform OCR with custom labels. We could have utilized Google Cloud Vision/Google Document AI and Amazon Textract/Amazon Rekognition Text Detection to further perform OCR on bounding boxes through their APIs once we have found the bounding boxes information from the custom label models. However, we are looking for a complete solution for our use case which they did not provide. Finally, instead of using AFR, we could have used Microsoft Custom Vision for the first part of the use case and then performed OCR with AFR for the second part. Since AFR is providing both in one network call, we have decided to proceed with the AFR option for the specified use case.

Final Conclusion:

This article aims to identify some customer scenarios like recognizing data in application form, receipts, invoices, tables and custom labels among different cloud services and perform a comparison.

  • W2 form Custom Label:
    • Part 1: Correctly identify custom label: Google Cloud Auto ML Vision Object Detection Accuracy was the best.
    • Part 2 : Perform OCR on identified custom label: Only Microsoft Azure Form Recognizer provided OCR, so no comparison.
  • Invoice Recognition: Amazon Textract performed the best.
  • Receipt Recognition: Microsoft Azure Form Recognizer performed the best.
  • Application Form Recognition: Microsoft Azure Form Recognizer performed the best.
  • Table Data Recognition: Google Document AI performed the best.

Overall, Microsoft is the winner for many reasons. Microsoft only requires five documents to train the model and it does it in record time of three seconds while maintaining high accuracy. Other competitive solutions take close to an hour or more on average. In our research, the closest competitive solution took 52 minutes. Microsoft is the only tool to perform OCR along with bounding boxes. The solution is a holistic solution that focuses on practical enterprise use cases.

If your company is building applications using Big Data and Artificial Intelligence or creating web or mobile applications on cloud, microservices or on-premises, please check our services to find out why Fortune 500 companies choose Cazton for building multi-billion-dollar revenue generating applications.

Cazton is composed of technical professionals with expertise gained all over the world and in all fields of the tech industry and we put this expertise to work for you. We serve all industries, including banking, finance, legal services, life sciences & healthcare, technology, media, and the public sector. Check out some of our services:

Cazton has expanded into a global company, servicing clients not only across the United States, but in Oslo, Norway; Stockholm, Sweden; London, England; Berlin, Germany; Frankfurt, Germany; Paris, France; Amsterdam, Netherlands; Brussels, Belgium; Rome, Italy; Quebec City, Toronto Vancouver, Montreal, Ottawa, Calgary, Edmonton, Victoria, and Winnipeg as well. In the United States, we provide our consulting and training services across various cities like Austin, Dallas, Houston, New York, New Jersey, Irvine, Los Angeles, Denver, Boulder, Charlotte, Atlanta, Orlando, Miami, San Antonio, San Diego, Stamford and others. Contact us today to learn more about what our experts can do for you.

Software Consulting

Would you like some world class consultants be at your service? Our experts are able to quickly identify, predict, and satisfy your current and future need.

Learn More

Trainings & Workshops

Would you like some world class training? Choose from one of the existing packages or contact us to create a training fully customizable for your needs.

Learn More

Recruiting & Staffing

Would you like some recruiting help? We provide full-service technical staffing to suit your needs: contract, contract-to-hire, full-time.

Learn More

Copyright © 2021 Cazton. • All Rights Reserved • View Sitemap