S3 Select doesn't delimite records when file is JSONL and GZIP

0

Hey

I am trying to use S3 Select on a gzipped file, containing JSONL records. My file is like this:

{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
{"foo": 1, "bar": True...}
...

I am using the following python code:

stream = s3_client.select_object_content(
            Bucket=bucket_name,
            Key=object_key,
            ExpressionType="SQL",
            Expression="SELECT * FROM s3object s",
            InputSerialization={"JSON": {"Type": "LINES"}, "CompressionType": "GZIP"},
            OutputSerialization={"JSON": {"RecordDelimiter": "\n"}},
            RequestProgress={"Enabled": False},
        )
        for record in stream["Payload"]:
            ...

My problem is that record is not a single JSONL, it is 65K bytes from the file.

I am able to query line-by-line when the file is not gzipped.

질문됨 일 년 전412회 조회
1개 답변
0

Given the limitations of S3 Select, I would suggest that you use Athena/Glue to CREATE TABLE - This can be done using boto3 Glue methods if the schema is know or using a crawler as shown below.

import boto3

# Create an AWS Glue client
glue_client = boto3.client('glue')

# Define the crawler settings
crawler_name = 'my-crawler'
s3_target_path = 's3://my-bucket/my-folder/'
database_name = 'my-database'
table_prefix = 'my-table-prefix'

# Create the database
glue_client.create_database(
    DatabaseInput={
        'Name': database_name
    }
)

# Create the crawler
glue_client.create_crawler(
    Name=crawler_name,
    Role='arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRoleDefault',
    DatabaseName=database_name,
    Targets={
        'S3Targets': [
            {
                'Path': s3_target_path
            }
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    },
    TablePrefix=table_prefix
)

# Start the crawler
glue_client.start_crawler(Name=crawler_name)

profile pictureAWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠