UTF-8 Characters not well supported in S3 Copy Operation

0

Hello community.

We are utilizing the COPY operation for loading files from S3 to Redshift, using the Airflow's S3ToRedshiftOperator. We've been observing issues when loading a few characters which are UTF-8 but are not getting recognized by Redshift. For example, for the character é present in our file, the copy operation is erroring out and writing this record to the stl_load_errors table, saying it's not a valid UTF-8 character. We've already put the 'ENCODING AS UTF8' parameter in the 'Copy Options' section. Can someone please take a look into this? Thanks in advance!

asked a year ago1129 views
1 Answer
0

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes. For more information, see Character types.

I tested loading both the characters you provided as examples é using copy command and am successfully able to load them into varchar column. You cannot load them into a column with char datatype.

If a line in your load data contains a character that is not valid for the column data type, COPY returns an error and logs a row in the STL_LOAD_ERRORS system log table with error number 1220. The ERR_REASON field includes the byte sequence, in hex, for the invalid character.

An alternative to fixing not valid characters in your load data is to replace the not valid characters during the load process. To replace not valid UTF-8 characters, specify the ACCEPTINVCHARS option with the COPY command. If the ACCEPTINVCHARS option is set, the character you specify replaces the code point. If the ACCEPTINVCHARS option isn't set, Amazon Redshift accepts the characters as valid UTF-8. For more information, see ACCEPTINVCHARS.

The following list of code points are valid UTF-8, COPY operations don't return an error if the ACCEPTINVCHARS option is not set. However, these code points are not valid characters. You can use the ACCEPTINVCHARS option to replace a code point with a character that you specify. These code points include the range of values from 0xFDD0 to 0xFDEF and values up to 0x10FFFF, ending with FFFE or FFFF:

0xFFFE, 0x1FFFE, 0x2FFFE, …, 0xFFFFE, 0x10FFFE

0xFFFF, 0x1FFFF, 0x2FFFF, …, 0xFFFFF, 0x10FFFF

The following example shows the error reason when COPY attempts to load UTF-8 character e0 a1 c7a4 into a CHAR column.

Multibyte character not supported for CHAR (Hint: Try using VARCHAR). Invalid char: e0 a1 c7a4

If the error is related to a VARCHAR data type, the error reason includes an error code as well as the not valid UTF-8 hex sequence. The following example shows the error reason when COPY attempts to load UTF-8 a4 into a VARCHAR field.

String contains invalid or unsupported UTF-8 codepoints. Bad UTF-8 hex sequence: a4 (error 3)

The following lists the descriptions and suggested workarounds for VARCHAR load errors. If one of these errors occurs, replace the character with a valid UTF-8 code sequence or remove the character.

Error codeDescription
1The UTF-8 byte sequence exceeds the four-byte maximum supported by VARCHAR.
2The UTF-8 byte sequence is incomplete. COPY did not find the expected number of continuation bytes for a multibyte character before the end of the string.
3The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255 or any character between 128 and 191 (inclusive).
4The value of the trailing byte in the byte sequence is out of range. The continuation byte must be between 128 and 191 (inclusive).
5The UTF-8 character is reserved as a surrogate. Surrogate code points (U+D800 through U+DFFF) are not valid.
8The byte sequence exceeds the maximum UTF-8 code point.
9The UTF-8 byte sequence does not have a matching code point.
AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions