- 新しい順
- 投票が多い順
- コメントが多い順
Hi,
The number of samples include both positive and negative samples.
For example, consider a document 'doc1.txt':
Bob lives in Seattle, Washington.
George Washington (February 22, 1732 – December 14, 1799) was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States (1789–1797).
Corresponding annotation:
File, Line, Begin Offset, End Offset, Type
doc1.txt, 0, 22, 32, STATE
So, in the above example: we have 2 samples, as opposed to 1 sample.
Comprehend considers all sentences in the document which are not annotated as negative samples, which helps to improve the model precision.
In the above example, "Washington" in the first sentence referes to "STATE" while second sentence refers to "PERSON".
So, in your training corpus, there are 35539 positive samples which are annotated. All other sentences are taken as negative samples from 3800 documents, which is exceeding our limit 120,000.
Please remove negative samples which is not helpful as training data.
Please also note, Comprehend reads all the input files present in the input-data-config, including sub directory as training data.
example:
aws comprehend start-entities-detection-job \
--entity-recognizer-arn "entity recognizer arn" \
--job-name job name \
--data-access-role-arn "data access role arn" \
--language-code en \
--input-data-config "S3Uri=s3://my_training_data/" \
--output-data-config "S3Uri=s3://my_model_output/" \
--region region
aws s3 ls s3://my_training_data
2013-07-11 17:08:50 train.txt
2013-07-24 14:55:44 folder2
So, my_training_data has subdirectory called "folder2" which may or may not contain training data.
All the files under folder2 will be considered as part of training data along with train.txt
Please let us know if you have more questions.
It would be so much better if the recognizer just stopped processing more sample at 120 000 without crashing the job. It's really hard to remove all negative samples on production data.
Hi,
We did consider this option and decided it is best to leave the user, who is more familiar with data for discarding the negative samples than us selecting the negative samples. We are working on downsampling the negative samples and keep you posted once this improvement is added.
Thanks,
Ravindra M