Textract Layout Response Not In Document Order

1

I am using the Java Textract SDK to evaluate Textract, and the documentation states this:

Amazon Textract can be used to detect the layout of a document by finding the locations of different elements and their associated lines of text. These elements are paragraphs, lists, headers, footers, page numbers, figures, tables, titles, and section headers. When analyzing the layout of a document, Amazon Textract returns a bounding box location of the layout elements as well as the text in those elements. This information is returned in the implied reading order o the document, listing elements from top to bottom, left to right.

When I submit a PDF through the web UI, I get a layout.csv that returns the layout and contents in document order, which is what I want. When I make the call using AnalyzeDocumentRequest in Java with the same PDF, I get a response like this:

PAGE LINE LINE LINE LINE LINE LINE LINE LINE LINE WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD LAYOUT_HEADER LAYOUT_HEADER LAYOUT_HEADER LAYOUT_HEADER LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_SECTION_HEADER LAYOUT_TEXT LAYOUT_LIST LAYOUT_TEXT LAYOUT_TEXT LAYOUT_TEXT LAYOUT_PAGE_NUMBER LAYOUT_PAGE_NUMBER

It's not only the wrong order but a different order than what I get manually in the CSV. Any thoughts on why? Thanks.

asked 2 months ago66 views
2 Answers
1
Accepted Answer

As discussed in more detail in the "Interpreting Layout response objects" section of the doc and the launch blog post, Layout is an additional analysis that links to the underlying/raw WORD and LINE results. There's no equivalent API to directly generate the layout.csv as of now.

If you're flexible on programming language, I'd recommend trying out Amazon Textract Textractor for Python (as mentioned in the above blog post), or Amazon Textract Response Parser for JavaScript/TypeScript - both of which provide pre-built parsers for navigating this complex information, and even serializing documents to in-reading-order HTML/XML if that's an end goal for you. At a high-level, you should find the LAYOUT blocks are returned in reading order but the underlying WORD/LINE blocks are a more naive left-right, top-bottom.

Unfortunately, I'm not aware of an equivalent for these libraries in Java today - so you'd need to write your own logic (perhaps using the existing ones as a guide) if you're tied to that.

AWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed a month ago
  • That is the conclusion I came to as well although I didn't know about the TypeScript option. Thanks.

0

The discrepancy you're experiencing between the web UI output and the Java SDK response is likely due to how the data is processed and presented in different contexts.

When you use the AnalyzeDocumentRequest in Java, you're getting the raw output from the Textract service. This raw output includes all the detected elements (WORD, LINE, LAYOUT elements, etc.) in a single list, which may not appear to be in any particular order at first glance.

However, the order of elements in this raw output is actually meaningful. The elements are generally returned in a hierarchical structure, with higher-level elements (like LAYOUT elements) appearing after the lower-level elements they contain (like WORD and LINE elements). This structure allows for efficient processing and reconstruction of the document layout programmatically.

The CSV output you get from the web UI, on the other hand, is likely a processed version of this raw data, where the information has been reorganized into a more human-readable format, presenting the layout elements in the document's reading order.

To get the layout elements in document order from the Java SDK response, you would need to process the raw output. This involves:

  1. Filtering the Blocks to focus on the LAYOUT elements.
  2. Using the Geometry information (bounding boxes) of these LAYOUT elements to sort them in top-to-bottom, left-to-right order.
  3. Associating the text content with each LAYOUT element using the Relationships data.

This processing step is what the web UI is likely doing behind the scenes to generate the CSV in document order.

If you need the output in document order for your Java application, you'll need to implement this sorting and processing logic yourself using the geometric and relationship data provided in the raw API response.
Sources
Analyzing Documents - Amazon Textract
Detecting Text - Amazon Textract
DetectDocumentText - Amazon Textract

profile picture
answered 2 months ago
  • Thank you, AI. I'm already aware I could have some fun with the coding challenge of doing it myself. I was hoping I wouldn't have to and that the answers from the web and the API call would be deterministic. Or that this is common enough a problem that there is some established way already of doing that processing.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions