glueContext handling java.io.CharConversionException with db2 driver

0

We are receiving the java.io.CharConversionException when trying to read data from a DB2 database that has characters outside the UTF-8 encoding. I tried adding an option("encoding", "ISO-8859-1") and option("charset", "ISO-8859-1") but neither seems to have any effect at all. Is it possible to ask the glueContext to use a specific type of encoding?

If not, what options do we have for handling characters that are throwing the CharConversionException? We have been excluding the rows specifically though SQL but this is not a tenable solution.

asked a year ago415 views
4 Answers
0

Please check if the below format works for you. I have set other options similar to this in GLue 3.0

sconf = SparkConf()

spark_conf.setAll([
    ('spark.executor.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3'), 
    ('spark.driver.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3')
    ])

sc = SparkContext(conf=sconf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
profile pictureAWS
answered a year ago
  • We have this in our code already, but unfortunately it does not have any effect. We have been able to set 'spark.sql.adaptive.enabled' to false, and that worked but the above code does not.

    I believe the difference might be due to this spark setting acting upon the already cluster, whereas the java level JVM settings need to be applied before the spark cluster is actually instantiated. But, this is conjecture as I do not know how Glue 2/3 initializes its cluster. All I know is that it is faster start up than Glue 1, which makes me think it is pulled from a "warm" image.

0

See the IBM Support article for more details related to this CharConversionException error https://www.ibm.com/support/pages/sqlexception-message-caught-javaiocharconversionexception-and-errorcode-4220

After setting the parameters

--conf 'spark.executor.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3' 
--conf 'spark.driver.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3'

an exception will not be thrown when a non-UTF8 character is encountered but rather it will be substituted by its equivalent Unicode replacement character.

profile pictureAWS
answered a year ago
0

Hi @ananthtm thanks for the reply. We have previously come across this solution, however the challenge we found was that unless we used Glue V1.0 we were unable to set these specific parameters. We tried with both 2.0 and 3.0 to set these settings in the configuration file but they were ignored. I believe it may be some issue with the VM for glue 2.0/3.0 are in some kind of warm-state and are not being loaded from a "cold-state" so there is no point at which these configuration parameters are being pulled in and configured in the Spark cluster.

If you know of a way that we can set these configuration values in our Glue 3.0 (or in future 4.0) settings then we would be able solve multiple issues we have been having with illegal characters.

Lastly, the reason we are using the V3.0 is because v1.0 is significantly slower to startup. Is there a possiblity that these SPARK configuration options can be configured in the AWS account or job level?

answered a year ago
0

I have just worked with this issue and adding the following job parameter helped read the data without any charset error:

+++++++ --java-options -Ddb2.jcc.charsetDecoderEncoder=3 +++++++

Thank you!

AWS
Aravind
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions