How to keep the source file name in the target output file with a AWS Glue job

0

A customer is running Glue jobs to convert some source data files in S3 (which contain date stamps) into a different format (CSV to Parquet). The target files will be placed in a different S3 bucket. When they ran a test, Glue did not retain the source file name. They wants to retain the date stamp on the target file. How can they do this?

AWS
質問済み 3年前4565ビュー
1回答
0
承認された回答

Glue is a data catalog and it organizes data in databases and tables. Although the underlying data storage layer can be S3, Glue does not care about object names on S3. As such, Glue is a higher level of abstraction on top of files or S3 objects. In this sense, the customer was using the wrong tool for what they were trying to achieve.

If the objective is to work with individual files or S3 objects, they might want to write a simple Python script to achieve what they want. This can be easily done with a combination of boto3 and pandas. In short, first use boto3 to ListObjects and then perform the conversion for each object.

Below is the sample code that performs the conversion from CSV to Parquet while retaining the filename. To run this code, they need to have boto3, pandas, fsspec, pyarrow, and s3fs:

import pandas as pd

def convert(src_bucket, src_key, dest_bucket, dest_prefix=None):
	src= 's3://' + src_bucket + '/' + src_key
	# extract the source filename
	filename = src_key[src_key.rfind("/")+1:]
	# form the output destination
	if dest_prefix is None:
		dest= 's3://' + dest_bucket + '/' + filename[:filename.rfind(".")] + ".parquet"
	else:
		dest= 's3://' + dest_bucket + '/' + dest_prefix + '/' + filename[:filename.rfind(".")] + ".parquet"
	# Perform conversion
	df = pd.read_csv(src)
	df.to_parquet(dest)
		
convert('bucket-in', 'prefix-in/test12345.csv', 'bucket-out', 'prefix-out')

With this convert() method, they just need to perform a ListObjects with boto3, then call the convert() method for each object.

AWS
回答済み 3年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ