How to keep the source file name in the target output file with a AWS Glue job

0

A customer is running Glue jobs to convert some source data files in S3 (which contain date stamps) into a different format (CSV to Parquet). The target files will be placed in a different S3 bucket. When they ran a test, Glue did not retain the source file name. They wants to retain the date stamp on the target file. How can they do this?

AWS
질문됨 3년 전4699회 조회
1개 답변
0
수락된 답변

Glue is a data catalog and it organizes data in databases and tables. Although the underlying data storage layer can be S3, Glue does not care about object names on S3. As such, Glue is a higher level of abstraction on top of files or S3 objects. In this sense, the customer was using the wrong tool for what they were trying to achieve.

If the objective is to work with individual files or S3 objects, they might want to write a simple Python script to achieve what they want. This can be easily done with a combination of boto3 and pandas. In short, first use boto3 to ListObjects and then perform the conversion for each object.

Below is the sample code that performs the conversion from CSV to Parquet while retaining the filename. To run this code, they need to have boto3, pandas, fsspec, pyarrow, and s3fs:

import pandas as pd

def convert(src_bucket, src_key, dest_bucket, dest_prefix=None):
	src= 's3://' + src_bucket + '/' + src_key
	# extract the source filename
	filename = src_key[src_key.rfind("/")+1:]
	# form the output destination
	if dest_prefix is None:
		dest= 's3://' + dest_bucket + '/' + filename[:filename.rfind(".")] + ".parquet"
	else:
		dest= 's3://' + dest_bucket + '/' + dest_prefix + '/' + filename[:filename.rfind(".")] + ".parquet"
	# Perform conversion
	df = pd.read_csv(src)
	df.to_parquet(dest)
		
convert('bucket-in', 'prefix-in/test12345.csv', 'bucket-out', 'prefix-out')

With this convert() method, they just need to perform a ListObjects with boto3, then call the convert() method for each object.

AWS
답변함 3년 전
profile picture
전문가
검토됨 19일 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠