How to keep the source file name in the target output file with a AWS Glue job

0

A customer is running Glue jobs to convert some source data files in S3 (which contain date stamps) into a different format (CSV to Parquet). The target files will be placed in a different S3 bucket. When they ran a test, Glue did not retain the source file name. They wants to retain the date stamp on the target file. How can they do this?

AWS
已提問 3 年前檢視次數 4565 次
1 個回答
0
已接受的答案

Glue is a data catalog and it organizes data in databases and tables. Although the underlying data storage layer can be S3, Glue does not care about object names on S3. As such, Glue is a higher level of abstraction on top of files or S3 objects. In this sense, the customer was using the wrong tool for what they were trying to achieve.

If the objective is to work with individual files or S3 objects, they might want to write a simple Python script to achieve what they want. This can be easily done with a combination of boto3 and pandas. In short, first use boto3 to ListObjects and then perform the conversion for each object.

Below is the sample code that performs the conversion from CSV to Parquet while retaining the filename. To run this code, they need to have boto3, pandas, fsspec, pyarrow, and s3fs:

import pandas as pd

def convert(src_bucket, src_key, dest_bucket, dest_prefix=None):
	src= 's3://' + src_bucket + '/' + src_key
	# extract the source filename
	filename = src_key[src_key.rfind("/")+1:]
	# form the output destination
	if dest_prefix is None:
		dest= 's3://' + dest_bucket + '/' + filename[:filename.rfind(".")] + ".parquet"
	else:
		dest= 's3://' + dest_bucket + '/' + dest_prefix + '/' + filename[:filename.rfind(".")] + ".parquet"
	# Perform conversion
	df = pd.read_csv(src)
	df.to_parquet(dest)
		
convert('bucket-in', 'prefix-in/test12345.csv', 'bucket-out', 'prefix-out')

With this convert() method, they just need to perform a ListObjects with boto3, then call the convert() method for each object.

AWS
已回答 3 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南