By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Amazon Comprehend

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Can Amazon Comprehend extract data from documents?

Hi! My team and I have the following scenario: we want to extract some fields from several PDF documents, that may or may not follow the same pattern. To exemplify, let's say we want to extract these 3 fields from these documents: ![Enter image description here](/media/postImages/original/IMRcmS97dmRTm4ZhZJRzLkbQ) So, we have a Name, a Code (called CNPJ) for this person, and its Address. Obviously, these fields would vary between documents, but the CNPJ would always keep its format, only changing the sequence of numbers. During our research to solve this challenge, we came across Amazon Comprehend and its Custom Named Entity Recognition. Our idea was to create these three entities - Name, CNPJ and Address - using a Ground Truth Labeling Job. To do this, we Textracted some of our PDF's, generating .txt files for each one of them, and then uploaded these files to an S3 Bucket. After that, we proceeded to create the Labeling Job, using an Automated data setup to generate the input manifest file so the labeling could start. And what happened was that as I inputted many .txt files, each line in these files got recognized as a separate object, resulting in more than 7700 objects to be labeled. Of course, approximately 90% of these objects didn't had any labeling to be done, resulting in me having to continuously skip these lines until I had to label one of those objects, and also in a very high money cost due to the high number of objects. So, I have a few questions. For starters, was Amazon Comprehend a good choice for this job? If it wasn't, what would be the best solution? If it was a good choice, what could I have done to optimize the labeling job? Were the "useless" objects really necessary?
1
answers
0
votes
23
views
asked a month ago

How to overcome document (pdf) size limit with Comprehend

Hello community, I am attempting to implement the tutorial found in the Kendra documentation (p. 245) to create an intelligent search tool, and the first step after storing the data in S3 is leveraging AWS Comprehend entities analysis. I'm using my own data instead of the tutorial, to test a real-world use case, and I'm finding the file size limit to be quite ridiculous* (1MB cap) on pdf or word doc's according to the documentation and the error I first got when attempting for the first time - "SINGLE_FILE_SIZE_LIMIT_EXCEEDED, etc. 1048576 bytes allowed for ONE_DOC_PER_FILE format". I put an asterisk next to ridiculous because I suppose this is relative, but I would tend to believe that most real-world applications have documents that are larger. Not to mention what the limits are for other operations, like most other asynchronous operations. I'm someone who has some practical programming experience with ML in python, so when attempting to look at the possible work-arounds or solutions a couple of things came to mind - * Use the CreateEntityRecognizer API along with python/boto3 SDK - not sure this would work or be any different, according to documentation it appears this would fall under custom entity recognition * Do my own portion of solutioning in python and use something like a tokenizer - If I'm doing this I might as well do most of my work outside of leveraging any AWS ML platform... * The KISS approach: simply "chunk" up my pdf doc's so that they are all less than the 1MB cap, ensuring to keep context intact while doing so Any thoughts, comments, or suggestions are appreciated. Thanks!
2
answers
0
votes
55
views
asked 3 months ago

Scaling AWS step-functions and comprehend jobs with Concurrent active asynchronous jobs quota

1. I am trying to implement a solution that integrates aws comprehend targeted sentiment along with step functions. And then make it public for people to use it as an api. 2. I need to wait until the job is complete before being able to move forward with the workflow. Since the comprehend job is asynchronous, **I created a wait time poller to periodically check the jobs status** using describe_targeted_sentiment_detection_job. Following a similar integration pattern as this https://docs.aws.amazon.com/step-functions/latest/dg/sample-project-job-poller.html. 3. However, there is seems to be a **Concurrent active asynchronous jobs quota of 10 jobs** according to https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-active-jobs. If this is the case, I was thinking of **creating another poll to check if comprehend is free to do targeted sentiment before starting another comprehend job** 4. Given that the step functions charge for each polling cycle. And that there is a concurrent job limit of 10. I am worried about the backlog and respective costs that may be created if many step-function executions were to be started. For example, if 1000 workflows are started. Workflow number 1000 will have to be polling for an available comprehend job for a long time. Does anyone know if a solution is available to get around the concurrent active asynchronous jobs quota or to reduce the cost of step functions continually polling for a long time?
1
answers
0
votes
51
views
asked 3 months ago