Issue with accessing Glue Data Catalog with Spark

0

Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM

질문됨 5년 전2672회 조회
1개 답변
0

awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
awenclaw wrote:
Hello!
I'm forwarding my issue from StackOverflow where I was unable to find correct answer.
I'm using Spark 2.4.0 on EMR from spark-shell (executed by user hadoop on master node) and trying to store simple Dataframe in S3 using AWS Glue Data Catalog. EMR has automatically generated default IAM roles. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")

table emrdb.testtableemr exists in my Glue Data Catalog and was created by Glue Crawler on S3 directory where only one json file exists:

{"Name": "Andrzej", "Surname": "WenWen", "age": "32"}
{"Name": "Tomasz", "Surname": "Tomtom", "age": "42"}
{"Name": "Andrzej", "Surname": "Golota", "age": "52"}

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.destDir. (emrdb.destDir table was also created by crowler- in table's directory I put same file for crowler to create same structure) The issue I got is: although it works correctly it still throws below exception:

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.destDir")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error trying to execute below:

filtered.repartition(1).write.mode("append").insertInto("emrdb.destDir")

but I'm able to store data bypassing Glue Data Catalog:

filtered.repartition(1).write.format("json").mode("append").save("s3://awenclaw-emr-test/destDir/")

This make me thinking the issue is on Data Catalog site.

The solution suggested on StackOverflow throws same error:

filtered.repartition(1).write.option("path", "s3://awenclaw-emr-test/destDir/").format("hive").mode("append").saveAsTable("emrdb.destDir")

So my question is how to correctly store Spark DataFrame into Glue Data Catalog table without all error messages mentioned above?

And link to StackOverflow if you need more details on answer I received.
https://stackoverflow.com/questions/54441163/writing-spark-dataframe-to-hive-table-through-aws-glue-data-cataloug
Thanks in advance.
Andrzej

Edited by: awenclaw on Feb 6, 2019 1:25 AM
I had a lot of issues like that these days and after dozens of hours investigating, I found the solution:
Set up a location URI for the database you are using in Glue Data Catalog UI (in your case for "emrdb").

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
The documentation talks about this problem but it was not so clear to me.

"Having a default database without a location URI causes failures when you create a table. As a workaround, use the LOCATION clause to specify a bucket location, such as s3://mybucket, when you use CREATE TABLE. Alternatively create tables within a database other than the default database."

mpierre
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인