View created by Athena query not accessible to Glue Jobs spark context, unless I change ViewExpandedText. Is this ok?

0

When a view is created by running an Athena SQL query, it appears in the Glue catalog but errors are raised when querying the view in a Glue job or a Glue Spark context.

I am able to make the view available to Glue jobs and query it with PySpark by changing the Glue table's ViewExpandedText property to contain the SQL of the view (i.e. the SELECT statement of the view), and removing the "SerDeProperties.Location" from the view. (See the code below.)

This is the opposite of what is documented for ViewExpandedText - the official Glue API docs say this field is "Not used in the normal course of AWS Glue operations."

The question is: Is the documentation incorrect? Is the functionality I am taking advantage of officially supported? If it is not supported, should I assume that at some point in the future, changing ViewExpandedText to make the view usable in glue jobs will stop working?

Code to hack the view: gist

# Making an Athena view queryable in a Glue job
#
# When a view is created by running an Athena SQL query, it appears in the Glue catalog but errors are raised when
# querying the view in a Glue job or a Glue Spark context.
#
# Use this gist to programmatically update the view's metadata in the Glue catalog.
#
# After this you should be able to run `spark.sql("select * from mydb.myview")` without errors.
#
# Assumes you have created the view `mydb.myview` in Athena.
# For testing I used `CREATE VIEW mydb.myview AS SELECT 1 as Index;`
#
import boto3
import json
import base64

glue = boto3.client("glue")

# Change these
database_name="mydb"
table_name="myview"

athena_table = glue.get_table(DatabaseName=database_name, Name=table_name)["Table"]

# We can only pass these keys in a boto3 call to update_table().
# Docs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/update_table.html
GLUE_UPDATE_TABLE_INPUT_KEYS={
    "ViewOriginalText",
    "LastAccessTime",
    "StorageDescriptor",
    "LastAnalyzedTime",
    "Name",
    "PartitionKeys",
    "Parameters",
    "Description",
    "Retention",
    "ViewExpandedText",
    "Owner",
    "TableType",
    "TargetTable"
}

for att in set(athena_table.keys()) - GLUE_UPDATE_TABLE_INPUT_KEYS:
    del (athena_table[att])

# It appears Glue gets confused by a blank string for StorageDescriptor.Location, so remove it.
if "StorageDescriptor" in athena_table and "Location" in athena_table["StorageDescriptor"]:
    if athena_table["StorageDescriptor"]["Location"] != "":
        raise Exception("Athena table has non-blank value for StorageDescriptor.Location. Don't know what to do.")
    del(athena_table["StorageDescriptor"]["Location"])

# Glue expects the view SQL to be in ViewExpandedText, while Athena uses a Hive-encoded string from "ViewOriginalText".
# Decode the Hive-encoded string to update ViewExpandedText:
athena_table["ViewExpandedText"] = json.loads(base64.b64decode(athena_table["ViewOriginalText"][16:-3]).decode("utf-8"))["originalSql"]

# update the table!
glue.update_table(DatabaseName=database_name, TableInput=athena_table)

ecmons
asked 4 months ago298 views
1 Answer
0

I think the documentation is still valid, maybe the confusion there is that by Glue operations it means Glue jobs processing the table, and you are thinking it means Glue catalog acting as a Hive compatible metastore.
Athena is using the catalog to store a Presto view and you are manipulating the catalog using boto3, Glue is not doing any "operations" there, in fact Glue cannot use that kind of view. Hope that makes sense.

profile pictureAWS
EXPERT
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions