Custom entity recognition supports a maximum of 120000 samples.

0

Hello,

When trying to create a custom entity recognizer I get the following message when attempting to train:

Custom entity recognition supports a maximum of 120000 samples.

The "samples" call out is vague as the data points of interest are documents (3800) and annotations (35539) neither of which are close to 120000. I had prior errors that I was able to fix so I know it's looking at the correct S3 locations.

Any direction about how to resolve this error would be appreciated, thanks!

質問済み 5年前352ビュー
4回答
0

Hi,
The number of samples include both positive and negative samples.

For example, consider a document 'doc1.txt':

Bob lives in Seattle, Washington.
George Washington (February 22, 1732 – December 14, 1799) was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States (1789–1797).

Corresponding annotation:

File, Line, Begin Offset, End Offset, Type
doc1.txt, 0, 22, 32, STATE

So, in the above example: we have 2 samples, as opposed to 1 sample.

Comprehend considers all sentences in the document which are not annotated as negative samples, which helps to improve the model precision.
In the above example, "Washington" in the first sentence referes to "STATE" while second sentence refers to "PERSON".

So, in your training corpus, there are 35539 positive samples which are annotated. All other sentences are taken as negative samples from 3800 documents, which is exceeding our limit 120,000.

Please remove negative samples which is not helpful as training data.

Please also note, Comprehend reads all the input files present in the input-data-config, including sub directory as training data.
example:

aws comprehend start-entities-detection-job \
     --entity-recognizer-arn "entity recognizer arn" \
     --job-name job name \
     --data-access-role-arn "data access role arn" \
     --language-code en \
     --input-data-config "S3Uri=s3://my_training_data/" \
     --output-data-config "S3Uri=s3://my_model_output/" \
     --region region
aws s3 ls s3://my_training_data

2013-07-11 17:08:50 train.txt
2013-07-24 14:55:44 folder2

So, my_training_data has subdirectory called "folder2" which may or may not contain training data.
All the files under folder2 will be considered as part of training data along with train.txt

Please let us know if you have more questions.

回答済み 5年前
0

Thanks that explains the error!

回答済み 5年前
0

It would be so much better if the recognizer just stopped processing more sample at 120 000 without crashing the job. It's really hard to remove all negative samples on production data.

Ced
回答済み 5年前
0

Hi,
We did consider this option and decided it is best to leave the user, who is more familiar with data for discarding the negative samples than us selecting the negative samples. We are working on downsampling the negative samples and keep you posted once this improvement is added.

Thanks,
Ravindra M

回答済み 5年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ