1 Answer
- Newest
- Most votes
- Most comments
1
AWS Textract is a service that extracts text and data from scanned documents. It can extract the information from a PDF, but it doesn't have built-in functionality for recognizing specific report formats, bookmarking them, or sorting them. However, you can build this functionality using additional services
- Use AWS Textract to extract all the text from the PDF. Textract can identify and extract text from scanned documents, and it provides the results in a structured format.
- Analyze the extracted text to find the dates and titles of the reports. You can use Regex or a similar algorithm for this. This could be done using AWS Lambda.
- PDF bookmarks can be created programmatically, but AWS doesn't offer a specific service for this. You would need to use a library or tool that supports this feature, such as PyPDF2 or PDFBox, within a Lambda function. -Sort the reports in chronological order. This could also be done in the Lambda function.
- Finally, save the bookmarked and sorted PDF to a storage service such as Amazon S3.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 8 months ago