AWS Glue Notebook Read Excel

0

Hi, I am trying to read Excel file in Glue PySpark notebook. I am using Crealytics Spark Excel package. I have added all the extra jars and packages through the magic commands, before creating the sessions. But still I get this error

Py4JJavaError: An error occurred while calling o91.load.
: java.lang.NoSuchMethodError: org.apache.xmlbeans.XmlOptions.setDisallowDocTypeDeclaration(Z)Lorg/apache/xmlbeans/XmlOptions;
	at org.apache.poi.ooxml.POIXMLTypeLoader.<clinit>(POIXMLTypeLoader.java:44)
	at org.apache.poi.xssf.model.ThemesTable.readFrom(ThemesTable.java:119)
	at org.apache.poi.xssf.model.ThemesTable.<init>(ThemesTable.java:87)
	at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
	at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
	at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:53)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)

I am using Glue version 4, Spark 3.3.0.

I have added following dependent jars as well.

com.crealytics_spark-excel_2.12-3.3.1_0.18.7.jar,
com.github.pjfanning_excel-streaming-reader-4.0.5.jar,
com.github.pjfanning_poi-shared-strings-2.5.6.jar,
com.github.tototoshi_scala-csv_2.12-1.3.10.jar,
com.github.virtuald_curvesapi-1.07.jar,
com.h2database_h2-2.1.214.jar,
com.norbitltd_spoiwo_2.12-2.2.1.jar,
com.zaxxer_SparseBitSet-1.2.jar,
commons-codec_commons-codec-1.15.jar,
commons-io_commons-io-2.11.0.jar,
org.apache.commons_commons-collections4-4.4.jar,
org.apache.commons_commons-compress-1.23.0.jar,
org.apache.commons_commons-lang3-3.12.0.jar,
org.apache.commons_commons-math3-3.6.1.jar,
org.apache.commons_commons-text-1.10.0.jar,
org.apache.logging.log4j_log4j-api-2.20.0.jar,
org.apache.logging.log4j_log4j-core-2.20.0.jar,
org.apache.poi_poi-5.2.3.jar,
org.apache.poi_poi-ooxml-5.2.3.jar,
org.apache.poi_poi-ooxml-lite-5.2.3.jar,
org.apache.xmlbeans_xmlbeans-5.1.1.jar,
org.scala-lang.modules_scala-collection-compat_2.12-2.9.0.jar,
org.slf4j_slf4j-api-1.7.36.jar,
xml-apis_xml-apis-1.4.01.jar 

Can you please help me in solving this issue? And how do you go about resolving such dependency issue in Glue. As in local environment or in Databricks these dependencies get resolved automatically.

Regards

已提问 1 年前823 查看次数
1 回答
1
已接受的回答

That means the version XmlBeans Glue 4.0 brings is 3.1.0 which is much older than the one you bring 5.1.1
You can tell Glue to use your version by using the job argument --user-jars-first=true
Bear in mind that is a risk but as long as all the libraries you add are newer, it should work.

profile pictureAWS
专家
已回答 1 年前
AWS
支持工程师
已审核 3 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则