- Newest
- Most votes
- Most comments
A couple of important suggestions here:
- If it's your first time using Amazon Textract and you're able to work in Python or JS/TS, I'd suggest using the open-source helper libraries amazon-textract-textractor (Py) or amazon-textract-response-parser (JS) which can greatly simplify your code to navigate the content returned by the Textract API.
- Especially if your goal is to extract the document into formats like HTML or Markdown, as these libraries already have tools for this.
- If you're specifically trying to detect page numbers written on the document, check out the Layout analysis feature, which costs extra when enabled but can detect regions like
LAYOUT_FOOTERandLAYOUT_PAGE_NUMBER
As the other answer mentioned, you can also determine the number of pages in the document from the returned DocumentMetadata and refer to the parent PAGE block sequencing for the current page index of a particular piece of content.
Hi,
Textract provides various blocks as result of its analysis. Some of them are of type pages: see https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html
You can locate them by using attribute BlocType = "PAGE".
You can also get the total number of pages in attribute DocumentMetadata which contains sub-attribute Pages.
Finally, you get the page number in PageClassification under PageNumber : see
https://docs.aws.amazon.com/textract/latest/dg/API_PageClassification.html
Best,
Didier
Relevant content
- asked a year ago
- asked 2 years ago
