Glue job failing with exit code 10 (unable to decompress ~50gb file on S3)

0

Hello,
I have a .sql.gz file (~50gb) on S3 - I'm attempting to download it, unzip it, and upload the decompressed contents back to S3 (as .sql).
The Glue job is able to successfully decompress/upload smaller files (largest I've tested is ~1gb).
However, whenever I attempt to process the larger ~50gb file I get back the following error:
"Command failed with exit code 10"

job_run_id:
jr_6517bedbe85935d03bca9b8797df4d357885398c86a3b907dee4c7f8dab42b6f

Some info about the job:
--No. of workers => 75 (G1.X)
--Glue Version ==> Spark 2.4, Python 3 (Glue Version 2.0)

Source code:

import boto3
from io import BytesIO
import gzip

s3_bucket_name = 'some-bucket'
stage_s3_key_prefix = 'prefix/to/stage'
source_s3_key = f'{stage_s3_key_prefix}/somefile.sql.gz'
target_s3_key = f'{stage_s3_key_prefix}/somefile.sql'

s3_client = boto3.client('s3')
s3_client.upload_fileobj(
    Fileobj=gzip.GzipFile( 
        None,
        'rb',
        fileobj=BytesIO(
            s3_client.get_object(
                Bucket=s3_bucket_name,
                Key=source_s3_key
             // line below should be: body.read()
             // I cant index Body b/c this markdown will display a URL instead
            )[" "].read()
        )
    ),
    Bucket=s3_bucket_name, 
    Key=target_s3_key 
)

Complete error message:

timestampmessage
1603652066955awsglue-todworkers-iad-prod-2d-37f92aea.us-east-1.amazon.com Mon Sep 28 18:09:19 UTC 2020 gluetod
1603652066957Preparing ...
1603652067097Sun Oct 25 18:54:26 UTC 2020
1603652067098/usr/bin/java -cp /opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/:/opt/amazon/lib/emrfs-lib/:/opt/amazon/spark/jars/:/opt/amazon/superjar/:/opt/amazon/lib/:/opt/amazon/Scala2.11/ com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=74 --conf spark.executor.memory=10g --conf spark.executor.cores=8 --conf spark.driver.memory=10g --conf spark.default.parallelism=600 --conf spark.sql.shuffle.partitions=600 --conf spark.network.timeout=600 --JOB_ID j_4a14dc4e1fdbd099a4fb00ce7bfa27d1cfea60c075858eee61aa091355122e90 --JOB_RUN_ID jr_6517bedbe85935d03bca9b8797df4d357885398c86a3b907dee4c7f8dab42b6f --job-bookmark-option job-bookmark-disable --scriptLocation s3://roivant-data/scripts/unzip_gzip_on_s3 --job-language python --TempDir s3://aws-glue-temporary-692327028194-us-east-1/admin --JOB_NAME unzip_gzip_on_s3
16036520883201603652088317
1603652089451SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
1603652089451SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/amazon/lib/log4j-slf4j-impl-2.8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
1603652089451SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
1603652089455SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
1603652093703WARN 2020-10-25 18:54:53,703 0 com.amazonaws.http.apache.utils.ApacheUtils [main] NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
1603652094245Launching ...
1603652094246Sun Oct 25 18:54:54 UTC 2020
1603652094411/usr/bin/java -cp /tmp:/opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/:/opt/amazon/lib/emrfs-lib/:/opt/amazon/lib/emr-goodies/:/opt/amazon/lib/hive-jars/:/opt/amazon/spark/jars/:/opt/amazon/superjar/:/opt/amazon/lib/:/opt/amazon/Scala2.11/:/tmp/** -Dlog4j.configuration=log4j -server -Xmx10g -XX:_UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:_CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' -XX:+UseCompressedOops -Djavax.net.ssl.trustStore=/opt/amazon/certs/ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=/opt/amazon/certs/rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=/opt/amazon/certs/redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:/opt/amazon/certs/RDSTrustStore.jks -Dspark.network.timeout=600 -Dspark.dynamicAllocation.enabled=false -Dspark.dynamicAllocation.minExecutors=1 -Dspark.shuffle.service.enabled=false -Dspark.hadoop.mapred.output.committer.class=org.apache.hadoop.mapred.DirectOutputCommitter -Dspark.driver.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/:/opt/amazon/lib/emrfs-lib/:/opt/amazon/lib/emr-goodies/:/opt/amazon/lib/hive-jars/:/opt/amazon/spark/jars/:/opt/amazon/superjar/:/opt/amazon/lib/:/opt/amazon/Scala2.11/ -Dspark.glue.JOB_NAME=unzip_gzip_on_s3 -Dspark.dynamicAllocation.maxExecutors=74 -Dspark.default.parallelism=600 -Dspark.hadoop.lakeformation.credentials.url=http://localhost:9998/lakeformationcredentials -Dspark.sql.shuffle.partitions=600 -Dspark.app.name=nativespark-unzip_gzip_on_s3-jr_6517bedbe85935d03bca9b8797df4d357885398c86a3b907dee4c7f8dab42b6f -Dspark.glue.GLUE_TASK_GROUP_ID=54429570-df1a-4dd3-9db8-61e6984666f5 -Dspark.hadoop.mapred.output.direct.EmrFileSystem=true -Dspark.glue.USE_PROXY=false -Dspark.eventLog.dir=/tmp/spark-event-logs/ -Dspark.rpc.askTimeout=600 -Dspark.executor.instances=74 -Dspark.executor.cores=8 -Dspark.driver.host=172.36.138.239 -Dspark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem -Dspark.authenticate.secret=<HIDDEN> -Dspark.glue.JOB_RUN_ID=jr_6517bedbe85935d03bca9b8797df4d357885398c86a3b907dee4c7f8dab42b6f -Dspark.executor.memory=10g -Dspark.hadoop.mapred.output.direct.NativeS3FileSystem=true -Dspark.driver.memory=10g -Dspark.pyspark.python=/usr/bin/python3 -Dspark.glue.GLUE_COMMAND_CRITERIA=glueetl -Dspark.master=jes -Dspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false -Dspark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 -Dspark.unsafe.sorter.spill.read.ahead.enabled=false -Dspark.hadoop.parquet.enable.summary-metadata=false -Dspark.hadoop.glue.michiganCredentialsProviderProxy=com.amazonaws.services.glue.remote.LakeformationCredentialsProvider -Dspark.executor.extraClassPath=/tmp:/opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/:/opt/amazon/lib/emrfs-lib/:/opt/amazon/lib/emr-goodies/:/opt/amazon/lib/hive-jars/:/opt/amazon/spark/jars/:/opt/amazon/superjar/:/opt/amazon/lib/**:/opt/amazon/Scala2.11/* -Dspark.glue.GLUE_VERSION=2.0 -Dspark.glue.endpoint=https://glue-jes-prod.us-east-1.amazonaws.com -Dspark.ui.enabled=false -Dspark.files.overwrite=true -Dspark.authenticate=true com.amazonaws.services.glue.ProcessLauncher --launch-class org.apache.spark.deploy.PythonRunner /opt/amazon/bin/runscript.py /tmp/unzip_gzip_on_s3 --JOB_ID j_4a14dc4e1fdbd099a4fb00ce7bfa27d1cfea60c075858eee61aa091355122e90 --JOB_RUN_ID jr_6517bedbe85935d03bca9b8797df4d357885398c86a3b907dee4c7f8dab42b6f --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-692327028194-us-east-1/admin --JOB_NAME unzip_gzip_on_s3
1603652095339SLF4J: Class path contains multiple SLF4J bindings.
1603652095339SLF4J: Found binding in [jar:file:/opt/amazon/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
1603652095339SLF4J: Found binding in [jar:file:/opt/amazon/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
1603652095339SLF4J: Found binding in [jar:file:/opt/amazon/lib/log4j-slf4j-impl-2.8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
1603652095339SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
1603652095342SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16036524213922020-10-25 19:00:21,384 INFO [main] glue.ProcessLauncher (Logging.scala:logInfo(54)): postprocessing
16036524214192020-10-25 19:00:21,410 INFO [pool-1-thread-1] util.ShutdownHookManager (Logging.scala:logInfo(54)): Shutdown hook called
---------------------------------------------------------------------------------------------------------------------
asked 4 years ago3442 views
4 Answers
0

Hi @ablange93,

Can you resolve this? I have a similar problem with Glue.
I'm trying read 15.000.000 of records from a JDBC origin.
I have a "Command failed with exit code 10" and Cloudwatch Logs haven't errors.

My code work with 10.000 records.

kevdfs
answered 4 years ago
0

On my Glue job, I got this error when I imported the same external python library twice. Removing the duplicate library from the job parameter 'Python lib path' worked for me and the job runs now.

cleggr2
answered 4 years ago
0

Did you find any solution?Thanks

answered 3 years ago
0

The error Command failed with exit code 10 happens to be an out-of-memory issue. Here you are trying to read a 50GB gzip file which is unsplittable. So here even if you use 100 workers for your job only 1 worker will be used to read this whole 50GB file. For G.1x worker type which you are using has only 4 vCPUs and 16 GB of memory which is definitely not enough to read this 50GB file and hence you get an OOM error


Side note: the code you have shared is pure python code, which will only be executed on the driver node, so of the 75 workers you have the code will be executed on the driver node and you are paying for 74 idle workers in your job. Here it's best you use a python shell as the job type in Glue. And if you are using a spark code in python, called as pyspark only then you use the Glue ETL as the job type

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions