How to keep the source file name in the target output file with a AWS Glue job


A customer is running Glue jobs to convert some source data files in S3 (which contain date stamps) into a different format (CSV to Parquet). The target files will be placed in a different S3 bucket. When they ran a test, Glue did not retain the source file name. They wants to retain the date stamp on the target file. How can they do this?

asked 2 years ago1849 views
1 Answer
Accepted Answer

Glue is a data catalog and it organizes data in databases and tables. Although the underlying data storage layer can be S3, Glue does not care about object names on S3. As such, Glue is a higher level of abstraction on top of files or S3 objects. In this sense, the customer was using the wrong tool for what they were trying to achieve.

If the objective is to work with individual files or S3 objects, they might want to write a simple Python script to achieve what they want. This can be easily done with a combination of boto3 and pandas. In short, first use boto3 to ListObjects and then perform the conversion for each object.

Below is the sample code that performs the conversion from CSV to Parquet while retaining the filename. To run this code, they need to have boto3, pandas, fsspec, pyarrow, and s3fs:

import pandas as pd

def convert(src_bucket, src_key, dest_bucket, dest_prefix=None):
	src= 's3://' + src_bucket + '/' + src_key
	# extract the source filename
	filename = src_key[src_key.rfind("/")+1:]
	# form the output destination
	if dest_prefix is None:
		dest= 's3://' + dest_bucket + '/' + filename[:filename.rfind(".")] + ".parquet"
		dest= 's3://' + dest_bucket + '/' + dest_prefix + '/' + filename[:filename.rfind(".")] + ".parquet"
	# Perform conversion
	df = pd.read_csv(src)
convert('bucket-in', 'prefix-in/test12345.csv', 'bucket-out', 'prefix-out')

With this convert() method, they just need to perform a ListObjects with boto3, then call the convert() method for each object.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions