S3 Select doesn't delimite records when file is JSONL and GZIP

0

Hey

I am trying to use S3 Select on a gzipped file, containing JSONL records. My file is like this:

{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
...

I am using the following python code:

stream = s3_client.select_object_content(
            Bucket=bucket_name,
            Key=object_key,
            ExpressionType="SQL",
            Expression="SELECT * FROM s3object s",
            InputSerialization={"JSON": {"Type": "LINES"}, "CompressionType": "GZIP"},
            OutputSerialization={"JSON": {"RecordDelimiter": "\n"}},
            RequestProgress={"Enabled": False},
        )
        for record in stream["Payload"]:
            ...

My problem is that record is not a single JSONL, it is 65K bytes from the file.

I am able to query line-by-line when the file is not gzipped.

asked a year ago388 views
1 Answer
0

Given the limitations of S3 Select, I would suggest that you use Athena/Glue to CREATE TABLE - This can be done using boto3 Glue methods if the schema is know or using a crawler as shown below.

import boto3

# Create an AWS Glue client
glue_client = boto3.client('glue')

# Define the crawler settings
crawler_name = 'my-crawler'
s3_target_path = 's3://my-bucket/my-folder/'
database_name = 'my-database'
table_prefix = 'my-table-prefix'

# Create the database
glue_client.create_database(
    DatabaseInput={
        'Name': database_name
    }
)

# Create the crawler
glue_client.create_crawler(
    Name=crawler_name,
    Role='arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRoleDefault',
    DatabaseName=database_name,
    Targets={
        'S3Targets': [
            {
                'Path': s3_target_path
            }
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    },
    TablePrefix=table_prefix
)

# Start the crawler
glue_client.start_crawler(Name=crawler_name)

profile pictureAWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions