Issue with accessing Glue Data Catalog with Spark

0

Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM

asked 5 years ago2623 views
1 Answer
0

awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
I had a lot of issues like that these days and after dozens of hours investigating, I found the solution:
Set up a location URI for the database you are using in Glue Data Catalog UI (in your case for "emrdb").

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
The documentation talks about this problem but it was not so clear to me.

"Having a default database without a location URI causes failures when you create a table. As a workaround, use the LOCATION clause to specify a bucket location, such as s3://mybucket, when you use CREATE TABLE. Alternatively create tables within a database other than the default database."

mpierre
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions