Multi-label - Insufficient data.

0

When I upload a CSV for training using multi-label classifier, AWS reports:

Insufficient data. Required: At least 10 examples for each label. Consider adding more training data.

I do have at least 10 examples. Much more than that.

natted
asked 4 years ago484 views
9 Answers
0

I have had no luck here.

Does multi-label require over ten examples for each "combination" of labels?

For example, my training data has ten categories with many thousands of trained examples.

LABEL1,"document"
LABEL2,"document"
LABEL3,"document"
LABEL2|LABEL3,"document"

When labels are combined there are situations where ten examples may not exist. Is this what would cause the training to fail?

I can add more training data but it seems a waste of time if it is not even clear. Amazon should provide better documentation on the training files.

natted
answered 4 years ago
0

Hi,

So when I ran into this issue the way I had to resolve it was by making sure my data/csv had 10 of each label, so:

Label1 and Label2 and Label3 etc. - should have 10 or more occurrences in the csv

if any one of your labels does not have 10 or more occurrences, the training will fail, Insufficient Data message (example below)

Fail:
Label1|Label2
Label1|Label2
Label1|Label2
Label1|Label2
Label1
Label1
Label1
Label1
Label1
Label1

prav1
answered 4 years ago
0

Thanks I finally found a rogue character in my file. Now resolved

natted
answered 4 years ago
0

Hi,

I am from the Comprehend Engineering team. Can you please PM your accountID, region, and classifier name which encountered the issue? We would like to improve the customer experience in detecting such characters that trip up our training and informing the user about where to look.

Thanks!
Seema Suresh

answered 4 years ago
0

I'm having a similar issue - I have enough labels, but the classifier training fails. What character was it that was causing an issue in the end, so I can check for that and remove it?

JDBaker
answered 3 years ago
0

Hello ,
I started creating my first labelling job(Crowd classifier-Multi select) using sagemaker console(workforce already setup). Input data is a CSV file with free text(chats from twitter data set). I added my own new labels. When I spin up the labelling tool for preview before creating the job, it shows no error but after I create the job and then spin up the labelling tool UI(Crowd source),
I get the following error :
(Element type CROWD-CLASSIFIER-MULTI-SELECT): attributes should have required property 'categories'

Details:
My CSV input file has two columns(text_id, text) -
text_id Text
1001 <text to be labelled by labelling job>

I added my own categories(labels) after creating the mainfest file

Any help is appreciated on this issue?
Looks like I am missing something basic here.

answered 3 years ago
0

Hi,

I am from the Comprehend engineering team.

It sounds like your issue is with Sagemaker Groundtruth and not with Comprehend. Is that right?

Thanks,
Seema Suresh

answered 3 years ago
0

Hey,

as the question wasn't answered in the forum yet: Do we need 10 samples per combination for a multilabel classification or just 10 samples per label? Is it e.g enough to have 10 samples for CLASS 1 and 10 for CLASS 2 or do I also need 10 samples for the combination of CLASS 1 & CLASS 2?

Thanks in advance

answered 3 years ago
0

Hey,

as the question wasn't answered in the forum yet: Do we need 10 samples per combination for a multilabel classification or just 10 samples per label? Is it e.g enough to have 10 samples for CLASS 1 and 10 for CLASS 2 or do I also need 10 samples for the combination of CLASS 1 & CLASS 2?

Thanks in advance

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions