Automatic bookmarking of PDF document, AWS Textract service

0

I was wondering whether your service can automatically recognise reports in a searchable PDF document, bookmark them according to date and tile then sort them in chronological order?

Dass
asked 9 months ago203 views
1 Answer
1

AWS Textract is a service that extracts text and data from scanned documents. It can extract the information from a PDF, but it doesn't have built-in functionality for recognizing specific report formats, bookmarking them, or sorting them. However, you can build this functionality using additional services

  • Use AWS Textract to extract all the text from the PDF. Textract can identify and extract text from scanned documents, and it provides the results in a structured format.
  • Analyze the extracted text to find the dates and titles of the reports. You can use Regex or a similar algorithm for this. This could be done using AWS Lambda.
  • PDF bookmarks can be created programmatically, but AWS doesn't offer a specific service for this. You would need to use a library or tool that supports this feature, such as PyPDF2 or PDFBox, within a Lambda function. -Sort the reports in chronological order. This could also be done in the Lambda function.
  • Finally, save the bookmarked and sorted PDF to a storage service such as Amazon S3.
profile picture
EXPERT
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions