AWS Glue Notebook Read Excel

0

Hi, I am trying to read Excel file in Glue PySpark notebook. I am using Crealytics Spark Excel package. I have added all the extra jars and packages through the magic commands, before creating the sessions. But still I get this error

Py4JJavaError: An error occurred while calling o91.load.
: java.lang.NoSuchMethodError: org.apache.xmlbeans.XmlOptions.setDisallowDocTypeDeclaration(Z)Lorg/apache/xmlbeans/XmlOptions;
	at org.apache.poi.ooxml.POIXMLTypeLoader.<clinit>(POIXMLTypeLoader.java:44)
	at org.apache.poi.xssf.model.ThemesTable.readFrom(ThemesTable.java:119)
	at org.apache.poi.xssf.model.ThemesTable.<init>(ThemesTable.java:87)
	at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
	at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
	at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:53)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)

I am using Glue version 4, Spark 3.3.0.

I have added following dependent jars as well.

com.crealytics_spark-excel_2.12-3.3.1_0.18.7.jar,
com.github.pjfanning_excel-streaming-reader-4.0.5.jar,
com.github.pjfanning_poi-shared-strings-2.5.6.jar,
com.github.tototoshi_scala-csv_2.12-1.3.10.jar,
com.github.virtuald_curvesapi-1.07.jar,
com.h2database_h2-2.1.214.jar,
com.norbitltd_spoiwo_2.12-2.2.1.jar,
com.zaxxer_SparseBitSet-1.2.jar,
commons-codec_commons-codec-1.15.jar,
commons-io_commons-io-2.11.0.jar,
org.apache.commons_commons-collections4-4.4.jar,
org.apache.commons_commons-compress-1.23.0.jar,
org.apache.commons_commons-lang3-3.12.0.jar,
org.apache.commons_commons-math3-3.6.1.jar,
org.apache.commons_commons-text-1.10.0.jar,
org.apache.logging.log4j_log4j-api-2.20.0.jar,
org.apache.logging.log4j_log4j-core-2.20.0.jar,
org.apache.poi_poi-5.2.3.jar,
org.apache.poi_poi-ooxml-5.2.3.jar,
org.apache.poi_poi-ooxml-lite-5.2.3.jar,
org.apache.xmlbeans_xmlbeans-5.1.1.jar,
org.scala-lang.modules_scala-collection-compat_2.12-2.9.0.jar,
org.slf4j_slf4j-api-1.7.36.jar,
xml-apis_xml-apis-1.4.01.jar 

Can you please help me in solving this issue? And how do you go about resolving such dependency issue in Glue. As in local environment or in Databricks these dependencies get resolved automatically.

Regards

質問済み 1年前813ビュー
1回答
1
承認された回答

That means the version XmlBeans Glue 4.0 brings is 3.1.0 which is much older than the one you bring 5.1.1
You can tell Glue to use your version by using the job argument --user-jars-first=true
Bear in mind that is a risk but as long as all the libraries you add are newer, it should work.

profile pictureAWS
エキスパート
回答済み 1年前
AWS
サポートエンジニア
レビュー済み 3ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ