AWS re:Postを使用することにより、以下に同意したことになります AWS re:Post 利用規約

Configuration for EMR Serverless step using LibPostal

0

Hi, we are in the process of moving from EMR to EMR serveless. Currently we are converting our Airflow DAGS from EMR to EMR serverless format.

We are having trouble getting one of our steps to work with a LibPostal tar file. Previously our step would work with LibPostal on regular EMR, but now we are receiving an error on EMR Serverless where it cannot find the files.

This is the error we are receiving:

23/12/06 05:43:47 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 16) ([2406:da1c:335:9401:3576:7d9e:47d7:8c2c] executor 5): org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:500)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:322)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:230)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:133)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.UnsatisfiedLinkError: no jpostal_expander in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
    at java.lang.Runtime.loadLibrary0(Runtime.java:843)
    at java.lang.System.loadLibrary(System.java:1136)
    at com.mapzen.jpostal.ExpanderOptions$Builder.<clinit>(ExpanderOptions.java:180)

We have checked through our own script and the LibPostal tar file uploaded does contain all files needed and does successfully extract.

Here is the configuration for our step method on EMR:

                    "--conf","spark.executor.extraJavaOptions=-XX:InitiatingHeapOccupancyPercent=" + str(config[data_source]['cleanseCaseClass']['initiatingHeapOccupancyPercent']) + " -DlibpostalDataDir=./libpostal-1.1_libpostal_datadir.tar.gz",                
                    "--conf","spark.executor.memoryOverhead=" + str(config[data_source]['cleanseCaseClass']['spark.executor.memoryOverhead']), 
                    "--conf","spark.yarn.dist.archives=/home/hadoop/libpostal/libpostal-1.1_joint.tar.gz,/home/hadoop/libpostal/libpostal-1.1_libpostal_datadir.tar.gz",
                    "--conf","spark.executor.extraLibraryPath=./libpostal-1.1_joint.tar.gz",

Here is the configuration we have set for EMR Serverless:

                "--conf "+"spark.executor.extraJavaOptions=-XX:InitiatingHeapOccupancyPercent=35 "+
                "--conf "+"spark.executor.memoryOverhead=1g "+
                "--conf "+"spark.archives=s3://[bucket]/jars/libpostal/libpostal-1.1_joint.tar.gz,s3://[bucket]/jars/libpostal/libpostal-1.1_libpostal_datadir.tar.gz "+
                "--conf "+"spark.executor.extraJavaOptions=-Djava.library.path=./libpostal-1.1_joint.tar.gz "+
                "--conf "+"spark.executor.extraJavaOptions=-DlibpostalDataDir=./libpostal-1.1_libpostal_datadir.tar.gz "+

Fixes we have tried:

  • Splitting -DlibpostalDataDir and -Djava.library.path in extra.java.options to 2 separate lines - it does not recognise the -DlibpostalDataDir command in one line anymore
  • Changing file path of -DlibpostalDataDir and -Djava.library.path
  • Including tar file in --jars and --files config
  • Changing spark.yarn.dist.archives to spark.archives
  • Setting extraction location in spark.archives

Any help is appreciated, thank you

  • Have you tried custom image to bundle these dependencies ?

  • Hi Yokesh, thank you for your reply. Could you please elaborate on this? LibPostal is a C library - can these be bundled and used? and what would the input configuration be once these dependencies are bundled? e.g jar, java library path. Thank you

1回答
0

Hello,

The error message you are encountering, java.lang.UnsatisfiedLinkError: no jpostal_expander in java.library.path, indicates that the Java Virtual Machine (JVM) is unable to find the native library for LibPostal (jpostal_expander) in the specified library path. This problem often arises due to misconfiguration of library paths or issues with the deployment of native libraries. Can you please try configuration like below example:

"--conf", "spark.executor.extraJavaOptions=-XX:InitiatingHeapOccupancyPercent=35 -Djava.library.path=<extracted_libpostal_library_path> -DlibpostalDataDir=<extracted_libpostal_data_directory>", "--conf", "spark.executor.memoryOverhead=1g", "--conf", "spark.archives=s3://[bucket]/jars/libpostal/libpostal-1.1_joint.tar.gz,s3://[bucket]/jars/libpostal/libpostal-1.1_libpostal_datadir.tar.gz",

Alternatively, you can use custom image with libpostal installed. Please see: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html

AWS
サポートエンジニア
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ