How to add certificate to TrustStore for EMR Serverless application?

0

How can a root CA certificate required by an EMR Serverless application be added to Java's TrustStore, or, alternatively, how can the location of a custom TrustStore be specified? In my case the TrustStore is required to contain the Amazon RDS SSL certificate in order to connect to a DocumentDB cluster via the MongoDB Spark Connector using TLS. Ordinarily the process of adding the certificate to the TrustStore on each node could be carried out by a bootstrap script, but EMR Serverless does not provide this option. I've downloaded the certificate from https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem and added it to a JKS file that can be sent to the cluster using the --files option. I have attempted to specify the location of this file by using the configuration properties spark.driver.extraJavaOptions and spark.executor.extraJavaOptions to pass the JMV option -Djavax.net.ssl.trustStore=cacerts.jks, but the file's absolute path is not the same across nodes, and cannot be known until after the job has been submitted and the SparkSession initialised. What then is the correct approach?

ali-m-d
asked a year ago2312 views
7 Answers
1

Hello there,

Thank you for raising this question on re:Post.

I would like to answer the question in two parts

  1. If you would like to refer to a file in an Serverless EMR Spark Application, you will need to use --archive option. The --files does not help as the files will be downloaded to /tmp/spark-${UUID}/ hence not useful in this case. Please find below an example on how you can use the --archives option.

    a) Use --archives/spark.yarn.dist.archives option in spark to download the jks file archive under home directory of hadoop (/home/hadoop). Set the archives config like below, please note #rds marks the name of the folder the zip will be extracted to, you may name it differently as per your preference

    --archives 's3://YOUR-BUCKET/test-scriptsprefix/cacerts.jks.zip#rds'

    b) Now you can refer the file under the folder eg: /home/hadoop/rds/cacerts.jks

  2. If you would like to use the SSL certificate bundle in the connection URL, you can follow the steps similar to below

    Please note: Steps below is an example for a connection to an external hive metastore with SSL in a EMR Serverless Spark application and are shared for reference only. You will need to change it based on how you would form the connection URL for the MongoDB Spark connector you are using.

    a) Prepare the certificates zip file and copy it to your S3 bucket

    zip -r rds-combined-ca-bundle.pem.zip rds-combined-ca-bundle.pem
    aws s3 cp rds-combined-ca-bundle.pem.zip s3://your-bucket/prefix/
    

    c) Similar to the previous option, we will use --archives option here as well

    Sample CLI:

    aws emr-serverless start-job-run \
      --application-id "00f2hg0781i57409" \
      --execution-role-arn "arn:aws:iam::111111111111:role/emrsvrlss" \
      --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://your-bucket/prefix/spark-jdbc.py",
                "sparkSubmitParameters": "--jars s3://your-bucket/prefix/mariadb-connector-java.jar --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver --archives 's3://your-bucket/prefix/rds-combined-ca-bundle.pem.zip#rds' --conf spark.hadoop.javax.jdo.option.ConnectionUserName=hive --conf spark.hadoop.javax.jdo.option.ConnectionPassword=******* --conf spark.hadoop.javax.jdo.option.ConnectionURL='jdbc:mysql://database-1.************.us-east-1.rds.amazonaws.com:3306/hive\?createDatabaseIfNotExist=false\&useSSL=true\&serverSslCert=/home/hadoop/rds/rds-combined-ca-bundle.pem\&enabledSslProtocolSuites=TLSv1.2' --conf spark.driver.cores=2 --conf spark.executor.memory=10G --conf spark.driver.memory=6G --conf spark.executor.cores=4"
            }
        }' \
        --configuration-overrides '{
            "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://DOC-EXAMPLE-BUCKET/emrlogs/"
            }
        }
    }'
    

    Important:

    • Make sure you escape special characters and reserved keywords within your AWS CLI (as shown in working example).
    • Use .zip to zip appropriate pem file and upload to your S3 bucket and make sure execution role arn has access to the S3 bucket used.
    • Make sure you download appropriate jar file for connecting to your database
AWS
SUPPORT ENGINEER
answered a year ago
  • Thank you for your answer Krishnadas M. I'm still struggling with this, however, due to the problem detailed in the post below.

0

Thank you for your answer Krishnadas M. I’ve used the approach you suggested to extract the zipped JKS archive to the /home/hadoop directory. However, passing the JKS file location to the -Djavax.net.ssl.trustStore JVM option causes a com.amazonaws.SdkClientException error, presumably because the client requires access to a CA certificate that is stored in the default cacerts store, and the default store can no longer be located due to my system-wide override. The connection URI approach won’t work in the case of the MongoDB Spark Connector, as the tlsCAFile parameter is not supported (https://jira.mongodb.org/browse/DOCS-14874). Is there any way that I could add the CA certificate for my DocumentDB cluster to the default cacerts store, or obtain a copy of the default cacerts store that I could modify and submit using the —archives option? A partial trace is included below.

22/12/06 13:29:29 ERROR DefaultEmrServerlessRMClient: Encountered exception when requesting SPARK_EXECUTOR containers. Future container launch rate will be slowed down until recovered. Next call will be allowed after 4000ms.
com.amazonaws.SdkClientException: Unable to execute HTTP request: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
	at com.amazonaws.services.emrserverlessresourcemanager.EmrServerlessResourceManagerClient.doInvoke(EmrServerlessResourceManagerClient.java:682)
	at com.amazonaws.services.emrserverlessresourcemanager.EmrServerlessResourceManagerClient.invoke(EmrServerlessResourceManagerClient.java:649)
	at com.amazonaws.services.emrserverlessresourcemanager.EmrServerlessResourceManagerClient.invoke(EmrServerlessResourceManagerClient.java:638)
	at com.amazonaws.services.emrserverlessresourcemanager.EmrServerlessResourceManagerClient.executeRequestContainers(EmrServerlessResourceManagerClient.java:604)
	at com.amazonaws.services.emrserverlessresourcemanager.EmrServerlessResourceManagerClient.requestContainers(EmrServerlessResourceManagerClient.java:573)
	at org.apache.spark.deploy.emrserverless.client.DefaultEmrServerlessRMClient.createSingleBatchContainers(DefaultEmrServerlessRMClient.scala:136)
	at org.apache.spark.deploy.emrserverless.client.DefaultEmrServerlessRMClient.$anonfun$createContainers$2(DefaultEmrServerlessRMClient.scala:64)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.deploy.emrserverless.client.DefaultEmrServerlessRMClient.createContainers(DefaultEmrServerlessRMClient.scala:63)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.requestNewExecutors(ExecutorContainerAllocator.scala:282)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.processSingleResourceProfile(ExecutorContainerAllocator.scala:259)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.$anonfun$processExecutorsForAllResourceProfiles$5(ExecutorContainerAllocator.scala:182)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.$anonfun$processExecutorsForAllResourceProfiles$5$adapted(ExecutorContainerAllocator.scala:180)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.processExecutorsForAllResourceProfiles(ExecutorContainerAllocator.scala:180)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.onNewSnapshot(ExecutorContainerAllocator.scala:130)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.$anonfun$start$1(ExecutorContainerAllocator.scala:79)
	at org.apache.spark.scheduler.cluster.emrserverless.ExecutorContainerAllocator.$anonfun$start$1$adapted(ExecutorContainerAllocator.scala:79)
	at org.apache.spark.scheduler.cluster.emrserverless.store.ExecutorContainerStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$emrserverless$store$ExecutorContainerStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorContainerStoreImpl.scala:129)
	at org.apache.spark.scheduler.cluster.emrserverless.store.ExecutorContainerStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorContainerStoreImpl.scala:119)
	at org.apache.spark.scheduler.cluster.emrserverless.store.ExecutorContainerStoreImpl.$anonfun$addSubscriber$1(ExecutorContainerStoreImpl.scala:53)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
	at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
	at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
	at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
	at sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:152)
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1397)
	at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1305)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
	at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
	at com.amazonaws.http.conn.$Proxy33.connect(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1343)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
	... 46 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:456)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323)
	at sun.security.validator.Validator.validate(Validator.java:271)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:223)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129)
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638)
	... 76 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:451)
	... 82 more
ali-m-d
answered a year ago
0

Hello there,

Thank you for your reply.

While I'm not able to test this currently, I can see that the documentation for MongoDB Connection URL does have reference for tlsCAFile for TLS connections. would you be able to quickly run a test using the same? As in use the 's3://your-bucket/prefix/rds-combined-ca-bundle.pem.zip#rds' and then refer the same in the connection URL like for eg: mongodb://db0.example.com,db1.example.com,db2.example.com/?tls=true&tlsCAFile=/home/hadoop/rds/rds-combined-ca-bundle.pem

Please note, this is untested and shared with assumption that this may help, please test and confirm. If you need further support on this, I would highly recommend reaching out to the AWS Support by raising a case with us.

AWS
SUPPORT ENGINEER
answered a year ago
0

Krishnadas M the MongoDB Spark Connector uses the MongoDB Java Driver, which, unlike the mongo shell and other drivers, doesn't support tlsCAFile as one of the connection options. Accordingly, the driver reports WARN uri: Connection string contains unsupported option 'tlscafile'. when the parameter is included in the connection URI (https://jira.mongodb.org/browse/JAVA-3066).

ali-m-d
answered a year ago
0

Thank you for your reply.

I understand the issue better now and I'll need to run tests at my end to see if there is any workaround to accomplish what you are asking here.

Is it possible for you to share the steps you followed including the code (please remove any sensitive information before sharing) that I can use as a template to run my tests?

I can also use the sample available in the documentation to prepare and run my tests, but unsure if that would be the same as what you are using. Please advise.

AWS
SUPPORT ENGINEER
answered a year ago
0

Sorry for the delay in replying Krishnadas M. Below is the CDK code I used to create the ZIP archive containing the JKS file with the SSL certificate (cacerts.Dockerfile contains only the line FROM --platform=linux/amd64 amazoncorretto:11.0.17-alpine), along with the CLI command for starting the Spark job on EMR Serverless.

Since your last message, the AWS team has announced the ability to customise the EMR Serverless base image. This allows me to use a custom Java image that has the certificate pre-installed into the TrustStore, so this approach, rather than the approach demonstrated below, is the one I am now using to solve the problem.

Thank you for your help!

const certificatesAsset = new assets.Asset(
    this,
    'BundledCertificates',
    {
        path: path.join(os.homedir(), '.ssh'),
        bundling: {
            image: DockerImage.fromBuild('../ml-pipeline', {
                file: 'cacerts.Dockerfile'
            }),
            command: [
                'sh',
                '-c',
                `keytool -importcert \
                -keystore /asset-output/cacerts.p12 \
                -storepass ${process.env.TRUSTSTORE_PASSWORD} \
                -file rds-ca-2019-root.pem \
                -alias RDS \
                -no-prompt; \
                keytool -importkeystore \
                -srckeystore /asset-output/cacerts.p12 \
                -srcstoretype pkcs12 \
                -srcstorepass ${process.env.TRUSTSTORE_PASSWORD} \
                -destkeystore /asset-output/cacerts.jks \
                -deststorepass ${process.env.TRUSTSTORE_PASSWORD} \
                -deststoretype jks`
            ]
        }
    }
);

const certificatesDeployment = new s3deploy.BucketDeployment(
    this,
    'CertificatesDeployment',
    {
        sources: [
            s3deploy.Source.bucket(
                certificatesAsset.bucket,
                certificatesAsset.s3ObjectKey
            )
        ],
        destinationBucket: bucket,
        destinationKeyPrefix: 'cacerts',
        extract: false
    }
);

new CfnOutput(this, 'CertificatesURI', {
    value: certificatesDeployment.deployedBucket.s3UrlForObject(
        path.join(
            'cacerts',
            Fn.select(0, certificatesDeployment.objectKeys)
        )
    ),
    exportName: 'CertificatesURI'
});
aws emr-serverless start-job-run \
    --region eu-west-1 \
    --application-id <application-ID> \
    --execution-role-arn <role-ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<bucket-name>/jobs/<filename>.py",
            "entryPointArguments": [],
            "sparkSubmitParameters": "--conf spark.archives=s3://<bucket-name>/cacerts/<asset-hash>.zip#cacerts,s3://<bucket-name>/artifacts/packages.tar.gz#environment --conf spark.jars=s3://<bucket-name>/artifacts/uber-JAR.jar --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.mongodb.input.uri=mongodb://<username>:<password>@<instance-name>.••••••.eu-west-1.docdb.amazonaws.com:27017/••••••?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&directConnection=true&retryWrites=false --conf spark.mongodb.output.uri=mongodb://<username>:<password>@<instance-name>.••••••.eu-west-1.docdb.amazonaws.com:27017/••••••?tls=true&replicaSet=rs0&readPreference=secondaryPreferred&directConnection=true&retryWrites=false --conf spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=./cacerts/cacerts.jks --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=./cacerts/cacerts.jks"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<bucket-name>/logs/"
            }
        }
    }'
ali-m-d
answered a year ago
0

@ali-m-d could you please provide the detailed answer on how you solved this issue?

Rajesh
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions