Hello community,
I am attempting to implement the tutorial found in the Kendra documentation (p. 245) to create an intelligent search tool, and the first step after storing the data in S3 is leveraging AWS Comprehend entities analysis. I'm using my own data instead of the tutorial, to test a real-world use case, and I'm finding the file size limit to be quite ridiculous* (1MB cap) on pdf or word doc's according to the documentation and the error I first got when attempting for the first time - "SINGLE_FILE_SIZE_LIMIT_EXCEEDED, etc. 1048576 bytes allowed for ONE_DOC_PER_FILE format".
I put an asterisk next to ridiculous because I suppose this is relative, but I would tend to believe that most real-world applications have documents that are larger. Not to mention what the limits are for other operations, like most other asynchronous operations. I'm someone who has some practical programming experience with ML in python, so when attempting to look at the possible work-arounds or solutions a couple of things came to mind -
- Use the CreateEntityRecognizer API along with python/boto3 SDK - not sure this would work or be any different, according to documentation it appears this would fall under custom entity recognition
- Do my own portion of solutioning in python and use something like a tokenizer - If I'm doing this I might as well do most of my work outside of leveraging any AWS ML platform...
- The KISS approach: simply "chunk" up my pdf doc's so that they are all less than the 1MB cap, ensuring to keep context intact while doing so
Any thoughts, comments, or suggestions are appreciated.
Thanks!
That list bit about associating all the "bits" of the broken up documents together when using Kendra was something I was concerned about, so I think that answers that - with Comprehend it seems the ends shall justify the means, if you will. I'll continue testing and see what happens. Thanks for your answer!
*Just wanted to update, the next roadblock is that although Comprehend will accept PDF files, it won't actually produce any metadata with them because they aren't UTF-8 formatted, which I found out the hard way after looking at the "output" file. So now I have to add an extra step to all of this and convert any pdf to UTF-8.