Unable to read orc file using glue context or Spark

0

Hi,
I am trying to read an orc file from an S3 bucket. I tried using Glue Catalog which gives the same schema as below. I am able to query the data via athena without any issues but when using Glue dynamic frame and catalog or spark dataframe to read or manipulate the orc file it is giving the below error.
Using Spark 2.4
Glue Context code -
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<database>", table_name = "<table>", transformation_ctx = "datasource0")
datasource0_df = datasource0.toDF()
datasource0_df = datasource0_df.drop("operation", "originaltransaction", "bucket", "rowid", "currenttransaction")

Spark code -
df = spark.read.format('orc').load('<S3 File path')
df.printSchema()
df.show()

The printSchema shows the schema as follows -
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- _col0: string (nullable = true)
| |-- _col1: string (nullable = true)
| |-- _col2: string (nullable = true)
| |-- _col3: timestamp (nullable = true)
| |-- _col4: timestamp (nullable = true)
| |-- _col5: string (nullable = true)
| |-- _col6: string (nullable = true)
| |-- _col7: boolean (nullable = true)
| |-- _col8: timestamp (nullable = true)
| |-- _col9: timestamp (nullable = true)
| |-- _col10: string (nullable = true)
| |-- _col11: string (nullable = true)
| |-- _col12: timestamp (nullable = true)
| |-- _col13: string (nullable = true)
| |-- _col14: timestamp (nullable = true)
| |-- _col15: string (nullable = true)
| |-- _col16: string (nullable = true)
| |-- _col17: string (nullable = true)
| |-- _col18: string (nullable = true)
| |-- _col19: string (nullable = true)
| |-- _col20: string (nullable = true)
| |-- _col21: timestamp (nullable = true)
| |-- _col22: timestamp (nullable = true)
| |-- _col23: string (nullable = true)
| |-- _col24: string (nullable = true)
| |-- _col25: string (nullable = true)
| |-- _col26: string (nullable = true)
| |-- _col27: string (nullable = true)
| |-- _col28: string (nullable = true)
| |-- _col29: string (nullable = true)
| |-- _col30: string (nullable = true)
| |-- _col31: string (nullable = true)
| |-- _col32: integer (nullable = true)
| |-- _col33: integer (nullable = true)
| |-- _col34: integer (nullable = true)
| |-- _col35: integer (nullable = true)
| |-- _col36: integer (nullable = true)
| |-- _col37: integer (nullable = true)

However when I am trying to do any operation on the dataframe it is giving the below error.

ERROR [Executor task launch worker for task 0] executor.Executor (Logging.scala:logError(91)) - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Include vector the wrong length: {"category": "struct", "id": 0, "max": 44, "fields": [
"operation": {"category": "int", "id": 1, "max": 1},
"originalTransaction": {"category": "bigint", "id": 2, "max": 2},
"bucket": {"category": "int", "id": 3, "max": 3},
"rowId": {"category": "bigint", "id": 4, "max": 4},
"currentTransaction": {"category": "bigint", "id": 5, "max": 5},
"row": {"category": "struct", "id": 6, "max": 44, "fields": [
"_col0": {"category": "string", "id": 7, "max": 7},
"_col1": {"category": "string", "id": 8, "max": 8},
"_col2": {"category": "string", "id": 9, "max": 9},
"_col3": {"category": "timestamp", "id": 10, "max": 10},
"_col4": {"category": "timestamp", "id": 11, "max": 11},
"_col5": {"category": "string", "id": 12, "max": 12},
"_col6": {"category": "string", "id": 13, "max": 13},
"_col7": {"category": "boolean", "id": 14, "max": 14},
"_col8": {"category": "timestamp", "id": 15, "max": 15},
"_col9": {"category": "timestamp", "id": 16, "max": 16},
"_col10": {"category": "string", "id": 17, "max": 17},
"_col11": {"category": "string", "id": 18, "max": 18},
"_col12": {"category": "timestamp", "id": 19, "max": 19},
"_col13": {"category": "string", "id": 20, "max": 20},
"_col14": {"category": "timestamp", "id": 21, "max": 21},
"_col15": {"category": "string", "id": 22, "max": 22},
"_col16": {"category": "string", "id": 23, "max": 23},
"_col17": {"category": "string", "id": 24, "max": 24},
"_col18": {"category": "string", "id": 25, "max": 25},
"_col19": {"category": "string", "id": 26, "max": 26},
"_col20": {"category": "string", "id": 27, "max": 27},
"_col21": {"category": "timestamp", "id": 28, "max": 28},
"_col22": {"category": "timestamp", "id": 29, "max": 29},
"_col23": {"category": "string", "id": 30, "max": 30},
"_col24": {"category": "string", "id": 31, "max": 31},
"_col25": {"category": "string", "id": 32, "max": 32},
"_col26": {"category": "string", "id": 33, "max": 33},
"_col27": {"category": "string", "id": 34, "max": 34},
"_col28": {"category": "string", "id": 35, "max": 35},
"_col29": {"category": "string", "id": 36, "max": 36},
"_col30": {"category": "string", "id": 37, "max": 37},
"_col31": {"category": "string", "id": 38, "max": 38},
"_col32": {"category": "int", "id": 39, "max": 39},
"_col33": {"category": "int", "id": 40, "max": 40},
"_col34": {"category": "int", "id": 41, "max": 41},
"_col35": {"category": "int", "id": 42, "max": 42},
"_col36": {"category": "int", "id": 43, "max": 43},
"_col37": {"category": "int", "id": 44, "max": 44}]}]} with include length 45

Please let me know what needs to be done to fix this.

Edited by: amit94ece on Oct 23, 2019 8:48 PM

Edited by: amit94ece on Oct 23, 2019 9:08 PM

Edited by: amit94ece on Oct 24, 2019 1:15 AM

已提問 5 年前檢視次數 648 次
1 個回答
0

Found out the issue with this. Spark currently doesn't support reading Hive ACID transactional tables. Since I am trying to read Hive ACID tables it was not working. Where as Athena used Presto ORC and can read Hive ACID tables hence it was able to read the same file. So, I am doing a 2 hop approach (Hive ACID ORC to Normal ORC format using Athena CTAS and then using Glue and able to convert the files). hope this helps others.

已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南