Hi,
I am trying to read Excel file in Glue PySpark notebook.
I am using Crealytics Spark Excel package. I have added all the extra jars and packages through the magic commands, before creating the sessions. But still I get this error
Py4JJavaError: An error occurred while calling o91.load.
: java.lang.NoSuchMethodError: org.apache.xmlbeans.XmlOptions.setDisallowDocTypeDeclaration(Z)Lorg/apache/xmlbeans/XmlOptions;
at org.apache.poi.ooxml.POIXMLTypeLoader.<clinit>(POIXMLTypeLoader.java:44)
at org.apache.poi.xssf.model.ThemesTable.readFrom(ThemesTable.java:119)
at org.apache.poi.xssf.model.ThemesTable.<init>(ThemesTable.java:87)
at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:53)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
I am using Glue version 4, Spark 3.3.0.
I have added following dependent jars as well.
com.crealytics_spark-excel_2.12-3.3.1_0.18.7.jar,
com.github.pjfanning_excel-streaming-reader-4.0.5.jar,
com.github.pjfanning_poi-shared-strings-2.5.6.jar,
com.github.tototoshi_scala-csv_2.12-1.3.10.jar,
com.github.virtuald_curvesapi-1.07.jar,
com.h2database_h2-2.1.214.jar,
com.norbitltd_spoiwo_2.12-2.2.1.jar,
com.zaxxer_SparseBitSet-1.2.jar,
commons-codec_commons-codec-1.15.jar,
commons-io_commons-io-2.11.0.jar,
org.apache.commons_commons-collections4-4.4.jar,
org.apache.commons_commons-compress-1.23.0.jar,
org.apache.commons_commons-lang3-3.12.0.jar,
org.apache.commons_commons-math3-3.6.1.jar,
org.apache.commons_commons-text-1.10.0.jar,
org.apache.logging.log4j_log4j-api-2.20.0.jar,
org.apache.logging.log4j_log4j-core-2.20.0.jar,
org.apache.poi_poi-5.2.3.jar,
org.apache.poi_poi-ooxml-5.2.3.jar,
org.apache.poi_poi-ooxml-lite-5.2.3.jar,
org.apache.xmlbeans_xmlbeans-5.1.1.jar,
org.scala-lang.modules_scala-collection-compat_2.12-2.9.0.jar,
org.slf4j_slf4j-api-1.7.36.jar,
xml-apis_xml-apis-1.4.01.jar
Can you please help me in solving this issue?
And how do you go about resolving such dependency issue in Glue. As in local environment or in Databricks these dependencies get resolved automatically.
Regards