I've been setting an AWS Glue test environment in which business users should use tables available in Glue catalog as data sources for their Glue jobs. The tables in the catalog come from data marts in Snowflake, which are crawled with Glue crawler via JDBC connection.
The crawler runs successfully with the provided connection, creating all the tables expected from the database into Glue catalog.
My issue is when I try to use the tables from catalog the job throws the following error :
An error occurred while calling o87.getCatalogSource.
: java.io.InvalidObjectException: JDBC connection URL jdbc:snowflake://<ORG>-<ACC>.snowflakecomputing.com/?user=<USER>&db=<DB>&role=<ROLE>&warehouse=<VW> is not supported. Check the Developer Guide for the list of supported data stores / URL formatting.
For the connection details, it was created via management console, using the JDBC URL present on the error log. Since AWS Glue detects snowflake protocol in the URL I let it handle the driver and related configurations. The same failure occurs either using secrets manager or storing the credentials directly in the connection. The network configuration was also correctly provided.
Considering other aspects worthy of mentioning, the IAM Role provided has all the policies necessary to run the jobs, as suggested in the docs. Snowflake credentials stored on secrets manager have all the necessary rights to access the data. The job used Glue 4.0.
The generated script is as simples as this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data Catalog table
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
database="northwind",
table_name="northwind_database_public_categories",
transformation_ctx="DataCatalogtable_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=DataCatalogtable_node1,
mappings=[
("description", "string", "description", "string"),
("picture", "binary", "picture", "boolean"),
("category_name", "string", "category_name", "string"),
("category_id", "long", "category_id", "bigint"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Snowflake Connector 2.11.1 for AWS Glue 4.0
SnowflakeConnector2111forAWSGlue40_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="marketplace.spark",
connection_options={
"sfWarehouse": "VW",
"dbtable": "test",
"sfDatabase": "GLUE_TEST",
"sfSchema": "PUBLIC",
"connectionName": "marketplace_conn",
},
transformation_ctx="SnowflakeConnector2111forAWSGlue40_node3",
)
job.commit()
and the complete exception raised:
ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
File "/tmp/catalog_test.py", line 16, in <module>
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
File "/opt/amazon/lib/python3.7/site-packages/awsglue/dynamicframe.py", line 629, in from_catalog
return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 184, in create_dynamic_frame_from_catalog
source = DataSource(self._ssql_ctx.getCatalogSource(db, table_name, redshift_tmp_dir, transformation_ctx,
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o87.getCatalogSource.
: java.io.InvalidObjectException: JDBC connection URL jdbc:snowflake://<ORG>-<ACC>.snowflakecomputing.com/?user=<USER>&db=<DB>&role=<ROLE>&warehouse=<VW> is not supported. Check the Developer Guide for the list of supported data stores / URL formatting.
at com.amazonaws.glue.jdbc.commons.JdbcUrl.getMatcherForUrl(JdbcUrl.java:181)
at com.amazonaws.services.glue.util.DataCatalogWrapper.$anonfun$getJDBCConf$1(DataCatalogWrapper.scala:263)
at scala.util.Try$.apply(Try.scala:209)
at com.amazonaws.services.glue.util.DataCatalogWrapper.getJDBCConf(DataCatalogWrapper.scala:223)
at com.amazonaws.services.glue.GlueContext.getGlueNativeJDBCSource(GlueContext.scala:494)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:327)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
I've managed to use catalog tables crawled from native sources, such as S3, but couldn't find a way to achieve the requirements presented.
Thanks in advance for the support.