- Newest
- Most votes
- Most comments
As discussed in more detail in the "Interpreting Layout response objects" section of the doc and the launch blog post, Layout is an additional analysis that links to the underlying/raw WORD
and LINE
results. There's no equivalent API to directly generate the layout.csv as of now.
If you're flexible on programming language, I'd recommend trying out Amazon Textract Textractor for Python (as mentioned in the above blog post), or Amazon Textract Response Parser for JavaScript/TypeScript - both of which provide pre-built parsers for navigating this complex information, and even serializing documents to in-reading-order HTML/XML if that's an end goal for you. At a high-level, you should find the LAYOUT
blocks are returned in reading order but the underlying WORD
/LINE
blocks are a more naive left-right, top-bottom.
Unfortunately, I'm not aware of an equivalent for these libraries in Java today - so you'd need to write your own logic (perhaps using the existing ones as a guide) if you're tied to that.
The discrepancy you're experiencing between the web UI output and the Java SDK response is likely due to how the data is processed and presented in different contexts.
When you use the AnalyzeDocumentRequest in Java, you're getting the raw output from the Textract service. This raw output includes all the detected elements (WORD, LINE, LAYOUT elements, etc.) in a single list, which may not appear to be in any particular order at first glance.
However, the order of elements in this raw output is actually meaningful. The elements are generally returned in a hierarchical structure, with higher-level elements (like LAYOUT elements) appearing after the lower-level elements they contain (like WORD and LINE elements). This structure allows for efficient processing and reconstruction of the document layout programmatically.
The CSV output you get from the web UI, on the other hand, is likely a processed version of this raw data, where the information has been reorganized into a more human-readable format, presenting the layout elements in the document's reading order.
To get the layout elements in document order from the Java SDK response, you would need to process the raw output. This involves:
- Filtering the Blocks to focus on the LAYOUT elements.
- Using the Geometry information (bounding boxes) of these LAYOUT elements to sort them in top-to-bottom, left-to-right order.
- Associating the text content with each LAYOUT element using the Relationships data.
This processing step is what the web UI is likely doing behind the scenes to generate the CSV in document order.
If you need the output in document order for your Java application, you'll need to implement this sorting and processing logic yourself using the geometric and relationship data provided in the raw API response.
Sources
Analyzing Documents - Amazon Textract
Detecting Text - Amazon Textract
DetectDocumentText - Amazon Textract
Thank you, AI. I'm already aware I could have some fun with the coding challenge of doing it myself. I was hoping I wouldn't have to and that the answers from the web and the API call would be deterministic. Or that this is common enough a problem that there is some established way already of doing that processing.
Relevant content
- asked 8 months ago
- asked 10 months ago
- Accepted Answerasked 9 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
That is the conclusion I came to as well although I didn't know about the TypeScript option. Thanks.