AWS Glue Notebook Read Excel

0

Hi, I am trying to read Excel file in Glue PySpark notebook. I am using Crealytics Spark Excel package. I have added all the extra jars and packages through the magic commands, before creating the sessions. But still I get this error

Py4JJavaError: An error occurred while calling o91.load.
: java.lang.NoSuchMethodError: org.apache.xmlbeans.XmlOptions.setDisallowDocTypeDeclaration(Z)Lorg/apache/xmlbeans/XmlOptions;
	at org.apache.poi.ooxml.POIXMLTypeLoader.<clinit>(POIXMLTypeLoader.java:44)
	at org.apache.poi.xssf.model.ThemesTable.readFrom(ThemesTable.java:119)
	at org.apache.poi.xssf.model.ThemesTable.<init>(ThemesTable.java:87)
	at org.apache.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
	at org.apache.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
	at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:260)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:53)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)

I am using Glue version 4, Spark 3.3.0.

I have added following dependent jars as well.

com.crealytics_spark-excel_2.12-3.3.1_0.18.7.jar,
com.github.pjfanning_excel-streaming-reader-4.0.5.jar,
com.github.pjfanning_poi-shared-strings-2.5.6.jar,
com.github.tototoshi_scala-csv_2.12-1.3.10.jar,
com.github.virtuald_curvesapi-1.07.jar,
com.h2database_h2-2.1.214.jar,
com.norbitltd_spoiwo_2.12-2.2.1.jar,
com.zaxxer_SparseBitSet-1.2.jar,
commons-codec_commons-codec-1.15.jar,
commons-io_commons-io-2.11.0.jar,
org.apache.commons_commons-collections4-4.4.jar,
org.apache.commons_commons-compress-1.23.0.jar,
org.apache.commons_commons-lang3-3.12.0.jar,
org.apache.commons_commons-math3-3.6.1.jar,
org.apache.commons_commons-text-1.10.0.jar,
org.apache.logging.log4j_log4j-api-2.20.0.jar,
org.apache.logging.log4j_log4j-core-2.20.0.jar,
org.apache.poi_poi-5.2.3.jar,
org.apache.poi_poi-ooxml-5.2.3.jar,
org.apache.poi_poi-ooxml-lite-5.2.3.jar,
org.apache.xmlbeans_xmlbeans-5.1.1.jar,
org.scala-lang.modules_scala-collection-compat_2.12-2.9.0.jar,
org.slf4j_slf4j-api-1.7.36.jar,
xml-apis_xml-apis-1.4.01.jar 

Can you please help me in solving this issue? And how do you go about resolving such dependency issue in Glue. As in local environment or in Databricks these dependencies get resolved automatically.

Regards

1 Respuesta
1
Respuesta aceptada

That means the version XmlBeans Glue 4.0 brings is 3.1.0 which is much older than the one you bring 5.1.1
You can tell Glue to use your version by using the job argument --user-jars-first=true
Bear in mind that is a risk but as long as all the libraries you add are newer, it should work.

profile pictureAWS
EXPERTO
respondido hace 10 meses
AWS
INGENIERO DE SOPORTE
revisado hace un mes

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas