AWS Glue Notebook Read Excel

0

Hi, I am trying to read Excel file in Glue PySpark notebook. I am using Crealytics Spark Excel package. I have added all the extra jars and packages through the magic commands, before creating the sessions. But still I get this error

Py4JJavaError: An error occurred while calling o91.load.
: java.lang.NoSuchMethodError: org.apache.xmlbeans.XmlOptions.setDisallowDocTypeDeclaration(Z)Lorg/apache/xmlbeans/XmlOptions;
	at org.apache.poi.ooxml.POIXMLTypeLoader.<clinit>(POIXMLTypeLoader.java:44)
	at org.apache.poi.xssf.model.ThemesTable.readFrom(ThemesTable.java:119)
	at org.apache.poi.xssf.model.ThemesTable.<init>(ThemesTable.java:87)
	at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
	at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
	at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:53)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)

I am using Glue version 4, Spark 3.3.0.

I have added following dependent jars as well.

com.crealytics_spark-excel_2.12-3.3.1_0.18.7.jar,
com.github.pjfanning_excel-streaming-reader-4.0.5.jar,
com.github.pjfanning_poi-shared-strings-2.5.6.jar,
com.github.tototoshi_scala-csv_2.12-1.3.10.jar,
com.github.virtuald_curvesapi-1.07.jar,
com.h2database_h2-2.1.214.jar,
com.norbitltd_spoiwo_2.12-2.2.1.jar,
com.zaxxer_SparseBitSet-1.2.jar,
commons-codec_commons-codec-1.15.jar,
commons-io_commons-io-2.11.0.jar,
org.apache.commons_commons-collections4-4.4.jar,
org.apache.commons_commons-compress-1.23.0.jar,
org.apache.commons_commons-lang3-3.12.0.jar,
org.apache.commons_commons-math3-3.6.1.jar,
org.apache.commons_commons-text-1.10.0.jar,
org.apache.logging.log4j_log4j-api-2.20.0.jar,
org.apache.logging.log4j_log4j-core-2.20.0.jar,
org.apache.poi_poi-5.2.3.jar,
org.apache.poi_poi-ooxml-5.2.3.jar,
org.apache.poi_poi-ooxml-lite-5.2.3.jar,
org.apache.xmlbeans_xmlbeans-5.1.1.jar,
org.scala-lang.modules_scala-collection-compat_2.12-2.9.0.jar,
org.slf4j_slf4j-api-1.7.36.jar,
xml-apis_xml-apis-1.4.01.jar 

Can you please help me in solving this issue? And how do you go about resolving such dependency issue in Glue. As in local environment or in Databricks these dependencies get resolved automatically.

Regards

已提問 10 個月前檢視次數 700 次
1 個回答
1
已接受的答案

That means the version XmlBeans Glue 4.0 brings is 3.1.0 which is much older than the one you bring 5.1.1
You can tell Glue to use your version by using the job argument --user-jars-first=true
Bear in mind that is a risk but as long as all the libraries you add are newer, it should work.

profile pictureAWS
專家
已回答 10 個月前
AWS
支援工程師
已審閱 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南