Unable to read orc file using glue context or Spark

Hi,
I am trying to read an orc file from an S3 bucket. I tried using Glue Catalog which gives the same schema as below. I am able to query the data via athena without any issues but when using Glue dynamic frame and catalog or spark dataframe to read or manipulate the orc file it is giving the below error.
Using Spark 2.4
Glue Context code -
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<database>", table_name = "<table>", transformation_ctx = "datasource0")
datasource0_df = datasource0.toDF()
datasource0_df = datasource0_df.drop("operation", "originaltransaction", "bucket", "rowid", "currenttransaction")

Spark code -
df = spark.read.format('orc').load('<S3 File path')
df.printSchema()
df.show()

However when I am trying to do any operation on the dataframe it is giving the below error.

ERROR [Executor task launch worker for task 0] executor.Executor (Logging.scala:logError(91)) - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Include vector the wrong length: {"category": "struct", "id": 0, "max": 44, "fields": [
"operation": {"category": "int", "id": 1, "max": 1},
"originalTransaction": {"category": "bigint", "id": 2, "max": 2},
"bucket": {"category": "int", "id": 3, "max": 3},
"rowId": {"category": "bigint", "id": 4, "max": 4},
"currentTransaction": {"category": "bigint", "id": 5, "max": 5},
"row": {"category": "struct", "id": 6, "max": 44, "fields": [
"_col0": {"category": "string", "id": 7, "max": 7},
"_col1": {"category": "string", "id": 8, "max": 8},
"_col2": {"category": "string", "id": 9, "max": 9},
"_col3": {"category": "timestamp", "id": 10, "max": 10},
"_col4": {"category": "timestamp", "id": 11, "max": 11},
"_col5": {"category": "string", "id": 12, "max": 12},
"_col6": {"category": "string", "id": 13, "max": 13},
"_col7": {"category": "boolean", "id": 14, "max": 14},
"_col8": {"category": "timestamp", "id": 15, "max": 15},
"_col9": {"category": "timestamp", "id": 16, "max": 16},
"_col10": {"category": "string", "id": 17, "max": 17},
"_col11": {"category": "string", "id": 18, "max": 18},
"_col12": {"category": "timestamp", "id": 19, "max": 19},
"_col13": {"category": "string", "id": 20, "max": 20},
"_col14": {"category": "timestamp", "id": 21, "max": 21},
"_col15": {"category": "string", "id": 22, "max": 22},
"_col16": {"category": "string", "id": 23, "max": 23},
"_col17": {"category": "string", "id": 24, "max": 24},
"_col18": {"category": "string", "id": 25, "max": 25},
"_col19": {"category": "string", "id": 26, "max": 26},
"_col20": {"category": "string", "id": 27, "max": 27},
"_col21": {"category": "timestamp", "id": 28, "max": 28},
"_col22": {"category": "timestamp", "id": 29, "max": 29},
"_col23": {"category": "string", "id": 30, "max": 30},
"_col24": {"category": "string", "id": 31, "max": 31},
"_col25": {"category": "string", "id": 32, "max": 32},
"_col26": {"category": "string", "id": 33, "max": 33},
"_col27": {"category": "string", "id": 34, "max": 34},
"_col28": {"category": "string", "id": 35, "max": 35},
"_col29": {"category": "string", "id": 36, "max": 36},
"_col30": {"category": "string", "id": 37, "max": 37},
"_col31": {"category": "string", "id": 38, "max": 38},
"_col32": {"category": "int", "id": 39, "max": 39},
"_col33": {"category": "int", "id": 40, "max": 40},
"_col34": {"category": "int", "id": 41, "max": 41},
"_col35": {"category": "int", "id": 42, "max": 42},
"_col36": {"category": "int", "id": 43, "max": 43},
"_col37": {"category": "int", "id": 44, "max": 44}]}]} with include length 45

Please let me know what needs to be done to fix this.

Edited by: amit94ece on Oct 23, 2019 8:48 PM

Edited by: amit94ece on Oct 23, 2019 9:08 PM

Edited by: amit94ece on Oct 24, 2019 1:15 AM

主題

分析

標籤

AWS Glue

語言

English

amit94ece

已提問 5 年前檢視次數 648 次

1 個回答

最新
最多得票
最多評論

這些答案是否有幫助？支持正確答案，以幫助社區從您的知識中受益。

Found out the issue with this. Spark currently doesn't support reading Hive ACID transactional tables. Since I am trying to read Hive ACID tables it was not working. Where as Athena used Presto ORC and can read Hive ACID tables hence it was able to read the same file. So, I am doing a 2 hop approach (Hive ACID ORC to Normal ORC format using Athena CTAS and then using Glue and able to convert the files). hope this helps others.

amit94ece

已回答 4 年前

Unable to read orc file using glue context or Spark

相關內容