View created by Athena query not accessible to Glue Jobs spark context, unless I change ViewExpandedText. Is this ok?

0

When a view is created by running an Athena SQL query, it appears in the Glue catalog but errors are raised when querying the view in a Glue job or a Glue Spark context.

I am able to make the view available to Glue jobs and query it with PySpark by changing the Glue table's ViewExpandedText property to contain the SQL of the view (i.e. the SELECT statement of the view), and removing the "SerDeProperties.Location" from the view. (See the code below.)

This is the opposite of what is documented for ViewExpandedText - the official Glue API docs say this field is "Not used in the normal course of AWS Glue operations."

The question is: Is the documentation incorrect? Is the functionality I am taking advantage of officially supported? If it is not supported, should I assume that at some point in the future, changing ViewExpandedText to make the view usable in glue jobs will stop working?

Code to hack the view: gist

# Making an Athena view queryable in a Glue job
#
# When a view is created by running an Athena SQL query, it appears in the Glue catalog but errors are raised when
# querying the view in a Glue job or a Glue Spark context.
#
# Use this gist to programmatically update the view's metadata in the Glue catalog.
#
# After this you should be able to run `spark.sql("select * from mydb.myview")` without errors.
#
# Assumes you have created the view `mydb.myview` in Athena.
# For testing I used `CREATE VIEW mydb.myview AS SELECT 1 as Index;`
#
import boto3
import json
import base64

glue = boto3.client("glue")

# Change these
database_name="mydb"
table_name="myview"

athena_table = glue.get_table(DatabaseName=database_name, Name=table_name)["Table"]

# We can only pass these keys in a boto3 call to update_table().
# Docs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/update_table.html
GLUE_UPDATE_TABLE_INPUT_KEYS={
    "ViewOriginalText",
    "LastAccessTime",
    "StorageDescriptor",
    "LastAnalyzedTime",
    "Name",
    "PartitionKeys",
    "Parameters",
    "Description",
    "Retention",
    "ViewExpandedText",
    "Owner",
    "TableType",
    "TargetTable"
}

for att in set(athena_table.keys()) - GLUE_UPDATE_TABLE_INPUT_KEYS:
    del (athena_table[att])

# It appears Glue gets confused by a blank string for StorageDescriptor.Location, so remove it.
if "StorageDescriptor" in athena_table and "Location" in athena_table["StorageDescriptor"]:
    if athena_table["StorageDescriptor"]["Location"] != "":
        raise Exception("Athena table has non-blank value for StorageDescriptor.Location. Don't know what to do.")
    del(athena_table["StorageDescriptor"]["Location"])

# Glue expects the view SQL to be in ViewExpandedText, while Athena uses a Hive-encoded string from "ViewOriginalText".
# Decode the Hive-encoded string to update ViewExpandedText:
athena_table["ViewExpandedText"] = json.loads(base64.b64decode(athena_table["ViewOriginalText"][16:-3]).decode("utf-8"))["originalSql"]

# update the table!
glue.update_table(DatabaseName=database_name, TableInput=athena_table)

ecmons
질문됨 5달 전318회 조회
1개 답변
0

I think the documentation is still valid, maybe the confusion there is that by Glue operations it means Glue jobs processing the table, and you are thinking it means Glue catalog acting as a Hive compatible metastore.
Athena is using the catalog to store a Presto view and you are manipulating the catalog using boto3, Glue is not doing any "operations" there, in fact Glue cannot use that kind of view. Hope that makes sense.

profile pictureAWS
전문가
답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠