Skip to content

AWS Comprehend training data csv decoding error : Bad data: b'\x96'., exit code: 1

0

Error message: The file webform-training-list-utf-csv-note.csv could not be decoded as valid utf-8 at position 6357 to 6358. Bad data: b'\x96'., exit code: 1

There are characters and linebreaks in my training data I removed "," and "|" Is there any other thing i have to watch out for when preparing data ? Any characters to remove or any other required?

asked 2 years ago268 views
2 Answers
2
Accepted Answer

Hi,

b'\x96' is not a valid utf-8 encoded character. Hence the error message as you specified that your file is utf-8 encoded

b'\x96' is dash ('-') in latin1: so, you may want to say to comprehend that you file is latin1 instead of utf-8.

Best,

Didier

EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago
  • the required format for AWS comprehend is CSV UTF-8 I tried to (1) remove all '-' , but still get same error message I tried to save as a UTF-8 file but causes some corruption of the file any other advise how to deal with this?

    I'm analyzing comments left the form enquiry . I'm trying to train a model then run asynchronous analysis of a larger dataset.

    • which is another large csv with possibly more "non-UTF" data
0

Thanks for the quick response, awesome! Are there any formatting guidelines for CSV that we can follow like removing these symbols?

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.