Global outage event
If you're experiencing issues with your AWS services, then please refer to the AWS Health Dashboard. You can find the overall status of ongoing outages, the health of AWS services, and the latest updates from AWS engineers.
AWS Glue Job Redshift : Issue Reading Timestamp with timezone from Redshift in AWS Glue Jobs
I have an AWS Glue Job Setup which will read data from AWS Redshift using JDBC Connection.
- Column Value From DBeaver : 2020-05-08 12:36:53.000 +0530
- Column Value From RedShift Query Editor : 2020-05-08 07:06:53+00
Data Type in Redshift : timestamp with time zone Data Type in AWS Glue Catalog Table : timestamp
I have written a crawler job which is mapping that value to timestamp but when I am trying to process the AWS Glue Job I am getting below exception. I tried various. The code works fine I read the value from CSV instead of Redshift table. while reading from CSV the datatype being mapped by crawler job is string. I tried changed the glue catalog table to string but thats not working too.
This Guy also faced the similar issue but thats post is quite old without working solution https://github.com/databricks/spark-redshift/issues/391
Option 1 : Converting to string medicare_res_cast = medicare_dyf.resolveChoice(specs = [('updated_at','cast:string'),('created_at','cast:string')])
Option 2 : to_timestamp: df.withColumn("updated_at",to_timestamp("updated_at")).show(truncate=False)
Option 2 : split : df.withColumn("updated_at", split(col("updated_at"), "+").getItem(0)).show()
ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
File "/tmp/etl-custom-redshift", line 36, in <module>
medicare_df.withColumn("updated_at", split(col("updated_at"), "+").getItem(0)).show()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 378, in show
print(self._jdf.showString(n, 20, vertical))
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o92.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.31.27.144, executor 1): java.lang.NumberFormatException: For input string: "53+00"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at java.sql.Timestamp.valueOf(Timestamp.java:259)
at com.databricks.spark.redshift.Conversions$$anonfun$1$$anonfun$apply$11.apply(Conversions.scala:108)
at com.databricks.spark.redshift.Conversions$$anonfun$1$$anonfun$apply$11.apply(Conversions.scala:108)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:120)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:116)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:295)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:266)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
... 1 more
- Argomenti
- Analisi
- Lingua
- English
- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
Apparently, there is a limitation with Glue and Spark and the workaround is to use a temporary table. I have posted a complete solution for the same https://www.youtube.com/watch?v=bfXy1fTtrH0
Contenuto pertinente
AWS UFFICIALEAggiornata un anno fa