AWS Glue Include and Exclude Patterns: unexpected behaviour

0

I am writing a script using AWS Glue 3.0 using PySpark to read data from an S3 bucket, perform some transformations and write to an S3 bucket. To achieve this I am using GlueContext.create_dynamic_frame_from_options with the connection type S3. There is an optional parameter exclusions which uses Include and Exclude Patterns based on the glob syntax. In the S3 bucket there are deeply nested files and I only wish to read files with the extension .json and wish to exclude files with the extension .csv and .txt. To achieve this I have the following glob expressions "exclusions": ['**/*.csv', '**/*.txt']. When executing the PySpark script below, I get the following error: An error occurred while calling o90.pyWriteDynamicFrame. Unable to parse file: <file-name>.data.csv, where the <file-name> is replaced with the name of the file.

dyf_read_source_s3 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [<path>],
        "exclusions": ['**/*.csv', '**/*.txt'],
        "recurse": True,
        "groupFiles": "inPartition",
    },
    transformation_ctx="dyf_read_source_s3",
)

I have locally re-created an example of the dir structure and imported the glob module which successfully uses this syntax to extract the correct files which leaves me to believe there is an issue/bug with the source code.

질문됨 2년 전84회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인