Unable to read orc file using glue context or Spark

0

Hi,
I am trying to read an orc file from an S3 bucket. I tried using Glue Catalog which gives the same schema as below. I am able to query the data via athena without any issues but when using Glue dynamic frame and catalog or spark dataframe to read or manipulate the orc file it is giving the below error.
Using Spark 2.4
Glue Context code -
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<database>", table_name = "<table>", transformation_ctx = "datasource0")
datasource0_df = datasource0.toDF()
datasource0_df = datasource0_df.drop("operation", "originaltransaction", "bucket", "rowid", "currenttransaction")

Spark code -
df = spark.read.format('orc').load('<S3 File path')
df.printSchema()
df.show()

The printSchema shows the schema as follows -
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- _col0: string (nullable = true)
| |-- _col1: string (nullable = true)
| |-- _col2: string (nullable = true)
| |-- _col3: timestamp (nullable = true)
| |-- _col4: timestamp (nullable = true)
| |-- _col5: string (nullable = true)
| |-- _col6: string (nullable = true)
| |-- _col7: boolean (nullable = true)
| |-- _col8: timestamp (nullable = true)
| |-- _col9: timestamp (nullable = true)
| |-- _col10: string (nullable = true)
| |-- _col11: string (nullable = true)
| |-- _col12: timestamp (nullable = true)
| |-- _col13: string (nullable = true)
| |-- _col14: timestamp (nullable = true)
| |-- _col15: string (nullable = true)
| |-- _col16: string (nullable = true)
| |-- _col17: string (nullable = true)
| |-- _col18: string (nullable = true)
| |-- _col19: string (nullable = true)
| |-- _col20: string (nullable = true)
| |-- _col21: timestamp (nullable = true)
| |-- _col22: timestamp (nullable = true)
| |-- _col23: string (nullable = true)
| |-- _col24: string (nullable = true)
| |-- _col25: string (nullable = true)
| |-- _col26: string (nullable = true)
| |-- _col27: string (nullable = true)
| |-- _col28: string (nullable = true)
| |-- _col29: string (nullable = true)
| |-- _col30: string (nullable = true)
| |-- _col31: string (nullable = true)
| |-- _col32: integer (nullable = true)
| |-- _col33: integer (nullable = true)
| |-- _col34: integer (nullable = true)
| |-- _col35: integer (nullable = true)
| |-- _col36: integer (nullable = true)
| |-- _col37: integer (nullable = true)

However when I am trying to do any operation on the dataframe it is giving the below error.

ERROR [Executor task launch worker for task 0] executor.Executor (Logging.scala:logError(91)) - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Include vector the wrong length: {"category": "struct", "id": 0, "max": 44, "fields": [
"operation": {"category": "int", "id": 1, "max": 1},
"originalTransaction": {"category": "bigint", "id": 2, "max": 2},
"bucket": {"category": "int", "id": 3, "max": 3},
"rowId": {"category": "bigint", "id": 4, "max": 4},
"currentTransaction": {"category": "bigint", "id": 5, "max": 5},
"row": {"category": "struct", "id": 6, "max": 44, "fields": [
"_col0": {"category": "string", "id": 7, "max": 7},
"_col1": {"category": "string", "id": 8, "max": 8},
"_col2": {"category": "string", "id": 9, "max": 9},
"_col3": {"category": "timestamp", "id": 10, "max": 10},
"_col4": {"category": "timestamp", "id": 11, "max": 11},
"_col5": {"category": "string", "id": 12, "max": 12},
"_col6": {"category": "string", "id": 13, "max": 13},
"_col7": {"category": "boolean", "id": 14, "max": 14},
"_col8": {"category": "timestamp", "id": 15, "max": 15},
"_col9": {"category": "timestamp", "id": 16, "max": 16},
"_col10": {"category": "string", "id": 17, "max": 17},
"_col11": {"category": "string", "id": 18, "max": 18},
"_col12": {"category": "timestamp", "id": 19, "max": 19},
"_col13": {"category": "string", "id": 20, "max": 20},
"_col14": {"category": "timestamp", "id": 21, "max": 21},
"_col15": {"category": "string", "id": 22, "max": 22},
"_col16": {"category": "string", "id": 23, "max": 23},
"_col17": {"category": "string", "id": 24, "max": 24},
"_col18": {"category": "string", "id": 25, "max": 25},
"_col19": {"category": "string", "id": 26, "max": 26},
"_col20": {"category": "string", "id": 27, "max": 27},
"_col21": {"category": "timestamp", "id": 28, "max": 28},
"_col22": {"category": "timestamp", "id": 29, "max": 29},
"_col23": {"category": "string", "id": 30, "max": 30},
"_col24": {"category": "string", "id": 31, "max": 31},
"_col25": {"category": "string", "id": 32, "max": 32},
"_col26": {"category": "string", "id": 33, "max": 33},
"_col27": {"category": "string", "id": 34, "max": 34},
"_col28": {"category": "string", "id": 35, "max": 35},
"_col29": {"category": "string", "id": 36, "max": 36},
"_col30": {"category": "string", "id": 37, "max": 37},
"_col31": {"category": "string", "id": 38, "max": 38},
"_col32": {"category": "int", "id": 39, "max": 39},
"_col33": {"category": "int", "id": 40, "max": 40},
"_col34": {"category": "int", "id": 41, "max": 41},
"_col35": {"category": "int", "id": 42, "max": 42},
"_col36": {"category": "int", "id": 43, "max": 43},
"_col37": {"category": "int", "id": 44, "max": 44}]}]} with include length 45

Please let me know what needs to be done to fix this.

Edited by: amit94ece on Oct 23, 2019 8:48 PM

Edited by: amit94ece on Oct 23, 2019 9:08 PM

Edited by: amit94ece on Oct 24, 2019 1:15 AM

asked 4 years ago639 views
1 Answer
0

Found out the issue with this. Spark currently doesn't support reading Hive ACID transactional tables. Since I am trying to read Hive ACID tables it was not working. Where as Athena used Presto ORC and can read Hive ACID tables hence it was able to read the same file. So, I am doing a 2 hop approach (Hive ACID ORC to Normal ORC format using Athena CTAS and then using Glue and able to convert the files). hope this helps others.

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions