Skip to content

Cloud Quest - "Cloud Data Warehouse" challenge - AWS Glue job script is not working

0

Hi,

AWS Cloud Quest (Skill builder) - Cloud Data Warehouse challenge - Step 22/64:

We are using Amazon Athena's query editor where we should paste a script to the editor, make some changes to it (replacing bucket names and table names), save it and run it. The script is provided by AWS.

I am unable to save the script, each time I try, it gives me the following error message:

"Failed to update job putObject: Failed to update script because XMLParserError error on line 358 at column 8: Opening and ending tag mismatch: meta line 7 and head"

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

glueContext = GlueContext(SparkContext.getOrCreate())

df = glueContext.create_dynamic_frame.from_catalog(database="games-data-db", table_name="YOUR AWS GLUE RAW DATA TABLE NAME")


## The built-in transformation 'relationalize' is used to flatten the nested data structures. 
## This transofromation relationalizes a DynamicFrame and produces a collection of DynamicFrames 
## that are generated by unnesting nested columns and pivoting array columns. 
dfc = df.relationalize("root", "s3://YOUR CONSUMPTION DATA S3 BUCKET NAME")

## In this case, two tables 'root' and 'root_game_details' are generated.
flatdf = dfc.select('root')
flatdf2 = dfc.select('root_game_details')

## Group all the partitions into one file
flatdf_group = flatdf.coalesce(1)

## In the root_game_details DynamicFrame, rename column names that have dots as separator to names without dots. 
flatdf2_reformat_1 = RenameField.apply(flatdf2,"`game_details.val.game_name`", "game_name")
flatdf2_reformat_2 = RenameField.apply(flatdf2_reformat_1,"`game_details.val.high_score`", "high_score")
flatdf2_reformat_3 = RenameField.apply(flatdf2_reformat_2,"`game_details.val.purchased_item`", "purchased_item")
flatdf2_reformat_4 = RenameField.apply(flatdf2_reformat_3,"`game_details.val.purchases`", "purchases")

## Group all the partitions into one file
flatdf2_group = flatdf2_reformat_4.coalesce(1)

## Write the two flatten tables into the S3 consumption data bucket in parquet format.
glueContext.write_dynamic_frame.from_options(flatdf,connection_type="s3",connection_options = {"path":"s3://YOUR CONSUMPTION DATA S3 BUCKET NAME/parquet/players_data/"}, format = "parquet")
glueContext.write_dynamic_frame.from_options(flatdf2_reformat_4,connection_type="s3",connection_options = {"path":"s3://YOUR CONSUMPTION DATA S3 BUCKET NAME/parquet/games_data/"}, format = "parquet")

job.commit()

I tried editing the query but no luck.

Can someone please point me in the right direction?

Thank you.

asked 14 days ago34 views
1 Answer
0

The error you're encountering is related to XML parsing, which suggests there might be an issue with how the script is being saved or processed in the AWS Glue job editor.

The error message "XMLParserError error on line 358 at column 8: Opening and ending tag mismatch: meta line 7 and head" indicates that there's an issue with the XML structure of the document, not with your Python script itself. This is likely happening because:

  1. There might be some hidden HTML/XML tags in the editor or in how the script is being processed when saving
  2. The editor might be trying to parse the script as XML instead of plain text

Here are some troubleshooting steps you can try:

  1. Try creating a new AWS Glue job from scratch instead of editing an existing one
  2. Copy your script to a plain text editor first to remove any potential hidden formatting, then copy it back to the AWS Glue editor
  3. Make sure you're using the correct editor in AWS Glue Studio (script editor) rather than the visual editor
  4. Try saving the script in smaller chunks to identify if a specific part is causing the issue
  5. Check if there are any special characters or encoding issues in your script

If you're specifically working on the Cloud Quest challenge, you might want to restart that particular step or contact AWS Cloud Quest support as this appears to be an issue with their lab environment rather than your script itself. The Python code you've provided looks structurally correct for an AWS Glue ETL job.
Sources
Error in AWS Glue ETL pipeline | AWS re:Post
AWS Glue - Error "Index 0 out of bounds for length 0" upon writing DynamicFrame | AWS re:Post
AWS Glue programming guide - AWS Glue

answered 14 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.