AWS Glue Job Redshift : Issue Reading Timestamp with timezone from Redshift in AWS Glue Jobs

0

I have an AWS Glue Job Setup which will read data from AWS Redshift using JDBC Connection.

  • Column Value From DBeaver : 2020-05-08 12:36:53.000 +0530
  • Column Value From RedShift Query Editor : 2020-05-08 07:06:53+00

Data Type in Redshift : timestamp with time zone Data Type in AWS Glue Catalog Table : timestamp

I have written a crawler job which is mapping that value to timestamp but when I am trying to process the AWS Glue Job I am getting below exception. I tried various. The code works fine I read the value from CSV instead of Redshift table. while reading from CSV the datatype being mapped by crawler job is string. I tried changed the glue catalog table to string but thats not working too.

This Guy also faced the similar issue but thats post is quite old without working solution https://github.com/databricks/spark-redshift/issues/391

Option 1 : Converting to string medicare_res_cast = medicare_dyf.resolveChoice(specs = [('updated_at','cast:string'),('created_at','cast:string')])

Option 2 : to_timestamp: df.withColumn("updated_at",to_timestamp("updated_at")).show(truncate=False)

Option 2 : split : df.withColumn("updated_at", split(col("updated_at"), "+").getItem(0)).show()


ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
  File "/tmp/etl-custom-redshift", line 36, in <module>
    medicare_df.withColumn("updated_at", split(col("updated_at"), "+").getItem(0)).show()
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 378, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o92.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.31.27.144, executor 1): java.lang.NumberFormatException: For input string: "53+00"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at java.sql.Timestamp.valueOf(Timestamp.java:259)
	at com.databricks.spark.redshift.Conversions$$anonfun$1$$anonfun$apply$11.apply(Conversions.scala:108)
	at com.databricks.spark.redshift.Conversions$$anonfun$1$$anonfun$apply$11.apply(Conversions.scala:108)
	at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:120)
	at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:116)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:295)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:266)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	... 1 more
gefragt vor 2 Jahren1428 Aufrufe
1 Antwort
1

Apparently, there is a limitation with Glue and Spark and the workaround is to use a temporary table. I have posted a complete solution for the same https://www.youtube.com/watch?v=bfXy1fTtrH0

beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen