- Newest
- Most votes
- Most comments
Glue is a data catalog and it organizes data in databases and tables. Although the underlying data storage layer can be S3, Glue does not care about object names on S3. As such, Glue is a higher level of abstraction on top of files or S3 objects. In this sense, the customer was using the wrong tool for what they were trying to achieve.
If the objective is to work with individual files or S3 objects, they might want to write a simple Python script to achieve what they want. This can be easily done with a combination of boto3 and pandas. In short, first use boto3 to ListObjects and then perform the conversion for each object.
Below is the sample code that performs the conversion from CSV to Parquet while retaining the filename. To run this code, they need to have boto3, pandas, fsspec, pyarrow, and s3fs:
import pandas as pd
def convert(src_bucket, src_key, dest_bucket, dest_prefix=None):
src= 's3://' + src_bucket + '/' + src_key
# extract the source filename
filename = src_key[src_key.rfind("/")+1:]
# form the output destination
if dest_prefix is None:
dest= 's3://' + dest_bucket + '/' + filename[:filename.rfind(".")] + ".parquet"
else:
dest= 's3://' + dest_bucket + '/' + dest_prefix + '/' + filename[:filename.rfind(".")] + ".parquet"
# Perform conversion
df = pd.read_csv(src)
df.to_parquet(dest)
convert('bucket-in', 'prefix-in/test12345.csv', 'bucket-out', 'prefix-out')
With this convert() method, they just need to perform a ListObjects with boto3, then call the convert() method for each object.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a month ago