Hi,
I am trying to read an orc file from an S3 bucket. I tried using Glue Catalog which gives the same schema as below. I am able to query the data via athena without any issues but when using Glue dynamic frame and catalog or spark dataframe to read or manipulate the orc file it is giving the below error.
Using Spark 2.4
Glue Context code -
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<database>", table_name = "<table>", transformation_ctx = "datasource0")
datasource0_df = datasource0.toDF()
datasource0_df = datasource0_df.drop("operation", "originaltransaction", "bucket", "rowid", "currenttransaction")
Spark code -
df = spark.read.format('orc').load('<S3 File path')
df.printSchema()
df.show()
The printSchema shows the schema as follows -
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- _col0: string (nullable = true)
| |-- _col1: string (nullable = true)
| |-- _col2: string (nullable = true)
| |-- _col3: timestamp (nullable = true)
| |-- _col4: timestamp (nullable = true)
| |-- _col5: string (nullable = true)
| |-- _col6: string (nullable = true)
| |-- _col7: boolean (nullable = true)
| |-- _col8: timestamp (nullable = true)
| |-- _col9: timestamp (nullable = true)
| |-- _col10: string (nullable = true)
| |-- _col11: string (nullable = true)
| |-- _col12: timestamp (nullable = true)
| |-- _col13: string (nullable = true)
| |-- _col14: timestamp (nullable = true)
| |-- _col15: string (nullable = true)
| |-- _col16: string (nullable = true)
| |-- _col17: string (nullable = true)
| |-- _col18: string (nullable = true)
| |-- _col19: string (nullable = true)
| |-- _col20: string (nullable = true)
| |-- _col21: timestamp (nullable = true)
| |-- _col22: timestamp (nullable = true)
| |-- _col23: string (nullable = true)
| |-- _col24: string (nullable = true)
| |-- _col25: string (nullable = true)
| |-- _col26: string (nullable = true)
| |-- _col27: string (nullable = true)
| |-- _col28: string (nullable = true)
| |-- _col29: string (nullable = true)
| |-- _col30: string (nullable = true)
| |-- _col31: string (nullable = true)
| |-- _col32: integer (nullable = true)
| |-- _col33: integer (nullable = true)
| |-- _col34: integer (nullable = true)
| |-- _col35: integer (nullable = true)
| |-- _col36: integer (nullable = true)
| |-- _col37: integer (nullable = true)
However when I am trying to do any operation on the dataframe it is giving the below error.
ERROR [Executor task launch worker for task 0] executor.Executor (Logging.scala:logError(91)) - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Include vector the wrong length: {"category": "struct", "id": 0, "max": 44, "fields": [
"operation": {"category": "int", "id": 1, "max": 1},
"originalTransaction": {"category": "bigint", "id": 2, "max": 2},
"bucket": {"category": "int", "id": 3, "max": 3},
"rowId": {"category": "bigint", "id": 4, "max": 4},
"currentTransaction": {"category": "bigint", "id": 5, "max": 5},
"row": {"category": "struct", "id": 6, "max": 44, "fields": [
"_col0": {"category": "string", "id": 7, "max": 7},
"_col1": {"category": "string", "id": 8, "max": 8},
"_col2": {"category": "string", "id": 9, "max": 9},
"_col3": {"category": "timestamp", "id": 10, "max": 10},
"_col4": {"category": "timestamp", "id": 11, "max": 11},
"_col5": {"category": "string", "id": 12, "max": 12},
"_col6": {"category": "string", "id": 13, "max": 13},
"_col7": {"category": "boolean", "id": 14, "max": 14},
"_col8": {"category": "timestamp", "id": 15, "max": 15},
"_col9": {"category": "timestamp", "id": 16, "max": 16},
"_col10": {"category": "string", "id": 17, "max": 17},
"_col11": {"category": "string", "id": 18, "max": 18},
"_col12": {"category": "timestamp", "id": 19, "max": 19},
"_col13": {"category": "string", "id": 20, "max": 20},
"_col14": {"category": "timestamp", "id": 21, "max": 21},
"_col15": {"category": "string", "id": 22, "max": 22},
"_col16": {"category": "string", "id": 23, "max": 23},
"_col17": {"category": "string", "id": 24, "max": 24},
"_col18": {"category": "string", "id": 25, "max": 25},
"_col19": {"category": "string", "id": 26, "max": 26},
"_col20": {"category": "string", "id": 27, "max": 27},
"_col21": {"category": "timestamp", "id": 28, "max": 28},
"_col22": {"category": "timestamp", "id": 29, "max": 29},
"_col23": {"category": "string", "id": 30, "max": 30},
"_col24": {"category": "string", "id": 31, "max": 31},
"_col25": {"category": "string", "id": 32, "max": 32},
"_col26": {"category": "string", "id": 33, "max": 33},
"_col27": {"category": "string", "id": 34, "max": 34},
"_col28": {"category": "string", "id": 35, "max": 35},
"_col29": {"category": "string", "id": 36, "max": 36},
"_col30": {"category": "string", "id": 37, "max": 37},
"_col31": {"category": "string", "id": 38, "max": 38},
"_col32": {"category": "int", "id": 39, "max": 39},
"_col33": {"category": "int", "id": 40, "max": 40},
"_col34": {"category": "int", "id": 41, "max": 41},
"_col35": {"category": "int", "id": 42, "max": 42},
"_col36": {"category": "int", "id": 43, "max": 43},
"_col37": {"category": "int", "id": 44, "max": 44}]}]} with include length 45
Please let me know what needs to be done to fix this.
Edited by: amit94ece on Oct 23, 2019 8:48 PM
Edited by: amit94ece on Oct 23, 2019 9:08 PM
Edited by: amit94ece on Oct 24, 2019 1:15 AM