Is Textract aware of document language and character set used within the language?


I've run into problems when trying to get texts from PDF which is in Czech language - some characters especially carons are badly recognized and/or not even returned. If Textract is aware about the source language or region then he should also know character set used within the language (source texts inside the PDF are utf-8 strings, but the national characters which are really used are usually limited to CE subset - such subset is specified in older iso-8859-2 or Latin-2/cp1250 char tables).

The PDF documents have been uploaded into CE region bucket (eu-central-1) and processed in batch by client::StartDocumentTextDetection(options) (the client's region set to eu-central-1 too).

Is there an option to specify source document language or preferred character set or how to enhance the detection results ?

1 Answer

Hi, as noted here in the developer guide, Czech is not in the list of language currently officially supported by the service for text extraction. Also as of now, there's no option in the APIs (for example StartDocumentTextDetection or StartDocumentAnalysis) to explicitly specify which language(s) your content is in.

In my experience, Amazon Textract can still work well for other latin-character languages outside the list (for example Indonesian / Malay), but other locales with almost-but-not-quite supported character sets (such as Vietnamese) can be a challenge.

One option you could explore for some locales is to run an (e.g. open-source) spell-checker on the output to try and reconstruct the missing characters / accents? The semantic importance of the unsupported characters will drive how successful this approach can be: If it's usually pretty clear substitution, then great, but if not then a simple dictionary- and rule-based spell-checker may not be sufficient. Apologies I don't have experience with Czech in particular.

If post-processing Amazon Tesseract isn't viable in your particular case, you could perhaps explore:

  • Other 3rd-party OCR offerings available on the AWS Marketplace
  • Open-source tools with existing AWS deployment patterns. For example:
    • This document processing pipeline sample for layout-aware entity recognition can tackle some advanced structure extraction use cases similar to Textract and Comprehend, and uses Textract for OCR by default - but has integration options for multi-lingual models and open-source Tesseract OCR
    • A range of 3rd-party authors have released samples and blogs about deploying Tesseract OCR serverlessly on AWS Lambda
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions