AWS Glue Dynamid Dataframe relationalize

0

I load json data and use relationalize method on dynamic dataframe to flatten the otherwise nested json object and saving it into parquet format. The problem is that once saved into parquet format for faster Athena queries, the column names contain dots, which is against the Athena sql query syntax and thus I am unable to make column specific queries.

In order to tackle this problem I also rename the column names in the Glue job to exclude the dots and put underscores instead. My question is which approach of the two would be better and why? (Efficiency- memory? execution speed on nodes? etc.).

Also given the horrible aws glue documentation I could not come up with dynamic frame only solution. I have problems getting the column names in dynamic fashion, thus I am utilizing toDF().

  1. First approach is around getting the column names from df extracted from dynamic df
relationalize1 = Relationalize.apply(frame=datasource0, transformation_ctx="relationalize1").select("roottable")
    df_relationalize1 = relationalize1.toDF()
    for field in df_relationalize1.schema.fields:
        relationalize1 = RenameField.apply(frame = relationalize1, old_name = "`"+field.name+"`", new_name = field.name.replace(".","_"), transformation_ctx = "renamefield_" + field.name)
  1. Second approach would be to extract the df from dynamic df and perform the rename field on the pyspark df (instead of dynamic df), to then convert back to dynamic df and save it in parquet format.

Is there a better approach? Can a crawler rename columns? How fast is .fromDF() method? Is there a better documentation on functions and methods than the pdf developer guide?

cell
質問済み 4年前904ビュー
3回答
0
承認された回答

As far as I tried, I was able to run column specific queries against parquet table which contains dot in column name.

Table: parquet_table

root
|-- name: string
|-- url: string
|-- sample.key: string

Query:

SELECT "sample.key" FROM "parquet_table" limit 10;
SELECT * FROM "parquet_table" WHERE "sample.key" LIKE 'sample%' limit 10;

Can you explain bit more details? What schema does your table have? What query did you see errors in?

AWS
回答済み 4年前
0

It appears that when querying with Athena I did not enclose column names with dots into double quotes, thus the error.

I am still interested in removing the dots from column names and thus would like to know what would be good approach of renaming multiple columns in AWS Glue. I changed my approach to first converting the DynamicDataframe to PySpark dataframe and then using piece that I found on stackoverflow.

new_column_name_list= list(map(lambda x: x.replace(".", "_"), df_relationalized.columns))    df_renamed = df_relationalized.toDF(*new_column_name_list)
cell
回答済み 4年前
0

Do you have any STRUCT datatype in your columns, does this solution also change the fields inside the struct ?

回答済み 4年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ