Custom entity recognition supports a maximum of 120000 samples.



When trying to create a custom entity recognizer I get the following message when attempting to train:

Custom entity recognition supports a maximum of 120000 samples.

The "samples" call out is vague as the data points of interest are documents (3800) and annotations (35539) neither of which are close to 120000. I had prior errors that I was able to fix so I know it's looking at the correct S3 locations.

Any direction about how to resolve this error would be appreciated, thanks!

asked 4 years ago72 views
4 Answers

The number of samples include both positive and negative samples.

For example, consider a document 'doc1.txt':

Bob lives in Seattle, Washington.
George Washington (February 22, 1732 – December 14, 1799) was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States (1789–1797).

Corresponding annotation:

File, Line, Begin Offset, End Offset, Type
doc1.txt, 0, 22, 32, STATE

So, in the above example: we have 2 samples, as opposed to 1 sample.

Comprehend considers all sentences in the document which are not annotated as negative samples, which helps to improve the model precision.
In the above example, "Washington" in the first sentence referes to "STATE" while second sentence refers to "PERSON".

So, in your training corpus, there are 35539 positive samples which are annotated. All other sentences are taken as negative samples from 3800 documents, which is exceeding our limit 120,000.

Please remove negative samples which is not helpful as training data.

Please also note, Comprehend reads all the input files present in the input-data-config, including sub directory as training data.

aws comprehend start-entities-detection-job \
     --entity-recognizer-arn "entity recognizer arn" \
     --job-name job name \
     --data-access-role-arn "data access role arn" \
     --language-code en \
     --input-data-config "S3Uri=s3://my_training_data/" \
     --output-data-config "S3Uri=s3://my_model_output/" \
     --region region
aws s3 ls s3://my_training_data

2013-07-11 17:08:50 train.txt
2013-07-24 14:55:44 folder2

So, my_training_data has subdirectory called "folder2" which may or may not contain training data.
All the files under folder2 will be considered as part of training data along with train.txt

Please let us know if you have more questions.

answered 4 years ago

Thanks that explains the error!

answered 4 years ago

It would be so much better if the recognizer just stopped processing more sample at 120 000 without crashing the job. It's really hard to remove all negative samples on production data.

answered 3 years ago

We did consider this option and decided it is best to leave the user, who is more familiar with data for discarding the negative samples than us selecting the negative samples. We are working on downsampling the negative samples and keep you posted once this improvement is added.

Ravindra M

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions