Issue with accessing Glue Data Catalog with Spark

Question

Hello!  
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.  
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

```
val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
```

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

```
{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}
```

**Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir.** (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

```
scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more
```

I got same error trying to execute below:

```
filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")
```

but I'm able to store data bypassing Glue Data Catalog:

```
filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")
```

This make me thinking the issue is on Data Catalog site.  
  
The solution suggested on StackOverflow throws same error:

```
filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")
```

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?  
  
And link to StackOverflow if you need more details on answer I received.  
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug  
Thanks in advance.  
Andrzej  
  
Edited by: awenclaw on Feb 6, 2019 1:25 AM

Answer

> _awenclaw wrote:_    
> Hello!  
> I'm forwarding my issue from StackOverflow where I was unable to find correct answer.  
> I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

> ```
> val peopleTable = spark.sql("select * from emrdb.testtableemr")
> val filtered = peopleTable.filter("name = 'Andrzej'")
> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
> ```

> table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

> ```
> {"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
> {"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
> {"Name": "Andrzej", "Surname": "Golota", "age": "52"}
> ```

> **Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir.** (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

> ```
> scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
> org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
>   at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
>   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
>   at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
>   at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
>   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
>   at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
>   ... 49 elided
> Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
>   at org.apache.hadoop.fs.Path.(Path.java:175)
>   at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
>   at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
>   at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
>   at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
>   at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
>   at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
>   at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
>   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
>   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
>   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
>   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>   at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>   at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>   at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>   at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
>   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
>   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
>   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
>   at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   ... 74 more
> ```

>   
> I got same error trying to execute below:

> ```
> filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")
> ```

>   
> but I'm able to store data bypassing Glue Data Catalog:

> ```
> filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")
> ```

> This make me thinking the issue is on Data Catalog site.  
>   
> The solution suggested on StackOverflow throws same error:

> ```
> filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")
> ```

>   
> So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?  
>   
> And link to StackOverflow if you need more details on answer I received.  
> https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug  
> Thanks in advance.  
> Andrzej  
>   
> Edited by: awenclaw on Feb 6, 2019 1:25 AM  
> _awenclaw wrote:_    
> Hello!  
> I'm forwarding my issue from StackOverflow where I was unable to find correct answer.  
> I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below: