How to overcome document (pdf) size limit with Comprehend

0

Hello community, I am attempting to implement the tutorial found in the Kendra documentation (p. 245) to create an intelligent search tool, and the first step after storing the data in S3 is leveraging AWS Comprehend entities analysis. I'm using my own data instead of the tutorial, to test a real-world use case, and I'm finding the file size limit to be quite ridiculous* (1MB cap) on pdf or word doc's according to the documentation and the error I first got when attempting for the first time - "SINGLE_FILE_SIZE_LIMIT_EXCEEDED, etc. 1048576 bytes allowed for ONE_DOC_PER_FILE format".

I put an asterisk next to ridiculous because I suppose this is relative, but I would tend to believe that most real-world applications have documents that are larger. Not to mention what the limits are for other operations, like most other asynchronous operations. I'm someone who has some practical programming experience with ML in python, so when attempting to look at the possible work-arounds or solutions a couple of things came to mind -

  • Use the CreateEntityRecognizer API along with python/boto3 SDK - not sure this would work or be any different, according to documentation it appears this would fall under custom entity recognition
  • Do my own portion of solutioning in python and use something like a tokenizer - If I'm doing this I might as well do most of my work outside of leveraging any AWS ML platform...
  • The KISS approach: simply "chunk" up my pdf doc's so that they are all less than the 1MB cap, ensuring to keep context intact while doing so

Any thoughts, comments, or suggestions are appreciated.

Thanks!

asked 2 years ago1034 views
2 Answers
1

Amazon Comprehend limits are documented here https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html and indeed for entity detection documents cannot be bigger than 1MB in size. I would suggest you take your latest approach, that is splitting documents in 1MB max chunks and perform entity detection on those. When building the Kendra index you can then aggregate the entities for each chunk and associate them to the original document.

AWS
EXPERT
answered 2 years ago
  • That list bit about associating all the "bits" of the broken up documents together when using Kendra was something I was concerned about, so I think that answers that - with Comprehend it seems the ends shall justify the means, if you will. I'll continue testing and see what happens. Thanks for your answer!

    *Just wanted to update, the next roadblock is that although Comprehend will accept PDF files, it won't actually produce any metadata with them because they aren't UTF-8 formatted, which I found out the hard way after looking at the "output" file. So now I have to add an extra step to all of this and convert any pdf to UTF-8.

0

According to https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-custom-entity-recognition, the max document size for PDF and Word documents should be 50MB and 5MB, respectively, and the max document size for UTF-8 encoded plain-text documents is 1MB.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions