S3 Select doesn't delimite records when file is JSONL and GZIP

0

Hey

I am trying to use S3 Select on a gzipped file, containing JSONL records. My file is like this:

{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
...

I am using the following python code:

stream = s3_client.select_object_content(
            Bucket=bucket_name,
            Key=object_key,
            ExpressionType="SQL",
            Expression="SELECT * FROM s3object s",
            InputSerialization={"JSON": {"Type": "LINES"}, "CompressionType": "GZIP"},
            OutputSerialization={"JSON": {"RecordDelimiter": "\n"}},
            RequestProgress={"Enabled": False},
        )
        for record in stream["Payload"]:
            ...

My problem is that record is not a single JSONL, it is 65K bytes from the file.

I am able to query line-by-line when the file is not gzipped.

gefragt vor einem Jahr445 Aufrufe
1 Antwort
0

Given the limitations of S3 Select, I would suggest that you use Athena/Glue to CREATE TABLE - This can be done using boto3 Glue methods if the schema is know or using a crawler as shown below.

import boto3

# Create an AWS Glue client
glue_client = boto3.client('glue')

# Define the crawler settings
crawler_name = 'my-crawler'
s3_target_path = 's3://my-bucket/my-folder/'
database_name = 'my-database'
table_prefix = 'my-table-prefix'

# Create the database
glue_client.create_database(
    DatabaseInput={
        'Name': database_name
    }
)

# Create the crawler
glue_client.create_crawler(
    Name=crawler_name,
    Role='arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRoleDefault',
    DatabaseName=database_name,
    Targets={
        'S3Targets': [
            {
                'Path': s3_target_path
            }
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    },
    TablePrefix=table_prefix
)

# Start the crawler
glue_client.start_crawler(Name=crawler_name)

profile pictureAWS
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen