Unable to read orc file using glue context or Spark

0

Hi,
I am trying to read an orc file from an S3 bucket. I tried using Glue Catalog which gives the same schema as below. I am able to query the data via athena without any issues but when using Glue dynamic frame and catalog or spark dataframe to read or manipulate the orc file it is giving the below error.
Using Spark 2.4
Glue Context code -
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<database>", table_name = "<table>", transformation_ctx = "datasource0")
datasource0_df = datasource0.toDF()
datasource0_df = datasource0_df.drop("operation", "originaltransaction", "bucket", "rowid", "currenttransaction")

Spark code -
df = spark.read.format('orc').load('<S3 File path')
df.printSchema()
df.show()

The printSchema shows the schema as follows -
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- _col0: string (nullable = true)
| |-- _col1: string (nullable = true)
| |-- _col2: string (nullable = true)
| |-- _col3: timestamp (nullable = true)
| |-- _col4: timestamp (nullable = true)
| |-- _col5: string (nullable = true)
| |-- _col6: string (nullable = true)
| |-- _col7: boolean (nullable = true)
| |-- _col8: timestamp (nullable = true)
| |-- _col9: timestamp (nullable = true)
| |-- _col10: string (nullable = true)
| |-- _col11: string (nullable = true)
| |-- _col12: timestamp (nullable = true)
| |-- _col13: string (nullable = true)
| |-- _col14: timestamp (nullable = true)
| |-- _col15: string (nullable = true)
| |-- _col16: string (nullable = true)
| |-- _col17: string (nullable = true)
| |-- _col18: string (nullable = true)
| |-- _col19: string (nullable = true)
| |-- _col20: string (nullable = true)
| |-- _col21: timestamp (nullable = true)
| |-- _col22: timestamp (nullable = true)
| |-- _col23: string (nullable = true)
| |-- _col24: string (nullable = true)
| |-- _col25: string (nullable = true)
| |-- _col26: string (nullable = true)
| |-- _col27: string (nullable = true)
| |-- _col28: string (nullable = true)
| |-- _col29: string (nullable = true)
| |-- _col30: string (nullable = true)
| |-- _col31: string (nullable = true)
| |-- _col32: integer (nullable = true)
| |-- _col33: integer (nullable = true)
| |-- _col34: integer (nullable = true)
| |-- _col35: integer (nullable = true)
| |-- _col36: integer (nullable = true)
| |-- _col37: integer (nullable = true)

However when I am trying to do any operation on the dataframe it is giving the below error.

ERROR [Executor task launch worker for task 0] executor.Executor (Logging.scala:logError(91)) - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Include vector the wrong length: {"category": "struct", "id": 0, "max": 44, "fields": [
"operation": {"category": "int", "id": 1, "max": 1},
"originalTransaction": {"category": "bigint", "id": 2, "max": 2},
"bucket": {"category": "int", "id": 3, "max": 3},
"rowId": {"category": "bigint", "id": 4, "max": 4},
"currentTransaction": {"category": "bigint", "id": 5, "max": 5},
"row": {"category": "struct", "id": 6, "max": 44, "fields": [
"_col0": {"category": "string", "id": 7, "max": 7},
"_col1": {"category": "string", "id": 8, "max": 8},
"_col2": {"category": "string", "id": 9, "max": 9},
"_col3": {"category": "timestamp", "id": 10, "max": 10},
"_col4": {"category": "timestamp", "id": 11, "max": 11},
"_col5": {"category": "string", "id": 12, "max": 12},
"_col6": {"category": "string", "id": 13, "max": 13},
"_col7": {"category": "boolean", "id": 14, "max": 14},
"_col8": {"category": "timestamp", "id": 15, "max": 15},
"_col9": {"category": "timestamp", "id": 16, "max": 16},
"_col10": {"category": "string", "id": 17, "max": 17},
"_col11": {"category": "string", "id": 18, "max": 18},
"_col12": {"category": "timestamp", "id": 19, "max": 19},
"_col13": {"category": "string", "id": 20, "max": 20},
"_col14": {"category": "timestamp", "id": 21, "max": 21},
"_col15": {"category": "string", "id": 22, "max": 22},
"_col16": {"category": "string", "id": 23, "max": 23},
"_col17": {"category": "string", "id": 24, "max": 24},
"_col18": {"category": "string", "id": 25, "max": 25},
"_col19": {"category": "string", "id": 26, "max": 26},
"_col20": {"category": "string", "id": 27, "max": 27},
"_col21": {"category": "timestamp", "id": 28, "max": 28},
"_col22": {"category": "timestamp", "id": 29, "max": 29},
"_col23": {"category": "string", "id": 30, "max": 30},
"_col24": {"category": "string", "id": 31, "max": 31},
"_col25": {"category": "string", "id": 32, "max": 32},
"_col26": {"category": "string", "id": 33, "max": 33},
"_col27": {"category": "string", "id": 34, "max": 34},
"_col28": {"category": "string", "id": 35, "max": 35},
"_col29": {"category": "string", "id": 36, "max": 36},
"_col30": {"category": "string", "id": 37, "max": 37},
"_col31": {"category": "string", "id": 38, "max": 38},
"_col32": {"category": "int", "id": 39, "max": 39},
"_col33": {"category": "int", "id": 40, "max": 40},
"_col34": {"category": "int", "id": 41, "max": 41},
"_col35": {"category": "int", "id": 42, "max": 42},
"_col36": {"category": "int", "id": 43, "max": 43},
"_col37": {"category": "int", "id": 44, "max": 44}]}]} with include length 45

Please let me know what needs to be done to fix this.

Edited by: amit94ece on Oct 23, 2019 8:48 PM

Edited by: amit94ece on Oct 23, 2019 9:08 PM

Edited by: amit94ece on Oct 24, 2019 1:15 AM

질문됨 5년 전648회 조회
1개 답변
0

Found out the issue with this. Spark currently doesn't support reading Hive ACID transactional tables. Since I am trying to read Hive ACID tables it was not working. Where as Athena used Presto ORC and can read Hive ACID tables hence it was able to read the same file. So, I am doing a 2 hop approach (Hive ACID ORC to Normal ORC format using Athena CTAS and then using Glue and able to convert the files). hope this helps others.

답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠