I have a Glue Job for migrating data from a Postgres database in RDS to a S3 bucket in Parquet format and a Crawler that connects to Postgres to infer the table schema. Previously, the Glue Connection used by both was configured to authenticate to Postgres via Username and Password, however now I would like it to authenticate via credentials stored in Secrets Manager instead. After updating the Glue Connection to use Secrets Manager, the Glue Job is failing with the following error:
2022-12-14 14:58:07,092 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
File "/tmp/parquet-job.py", line 25, in <module>
database=glue_source_database, table_name=table, transformation_ctx="Datasource")
File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 787, in from_catalog
return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)
File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 186, in create_dynamic_frame_from_catalog
makeOptions(self._sc, additional_options), catalog_id),
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o75.getCatalogSource.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:349)
at scala.None$.get(Option.scala:347)
at com.amazonaws.services.glue.util.DataCatalogWrapper.$anonfun$getJDBCConf$1(DataCatalogWrapper.scala:218)
at scala.util.Try$.apply(Try.scala:209)
at com.amazonaws.services.glue.util.DataCatalogWrapper.getJDBCConf(DataCatalogWrapper.scala:209)
at com.amazonaws.services.glue.GlueContext.getGlueNativeJDBCSource(GlueContext.scala:487)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:320)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
This is the code for the Glue Job script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'TABLE_NAME', 'PARQUET_LOCATION', 'PARQUET_TABLE_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
glue_source_database = "postgres-db"
table = args['TABLE_NAME'].replace("-", "_")
output_s3_path = args['PARQUET_LOCATION'] + \
"/" + args['PARQUET_TABLE_NAME']
Datasource = glueContext.create_dynamic_frame.from_catalog(
database=glue_source_database, table_name=table, transformation_ctx="Datasource") # the script fails at this line
print("Items Count: ", Datasource.count())
Datasource.printSchema()
Transform = ApplyMapping.apply(frame=Datasource, mappings=[
("id", "int", "id", "int"),
("name", "string", "name", "string")],
transformation_ctx="Transform")
DF = Transform.toDF()
formatted_path = output_s3_path.replace("_", "-")
DF.write.mode('overwrite').format(
'parquet').save(formatted_path)
job.commit()
I have confirmed that the IAM role for AWS Glue has permission to access my secret. The Crawler is running successfully using the same connection as the Glue Job. I also tried reverting the connection to use Username and Password, and the job succeeded. Is there anything that I am missing? Could this possibly be a bug with AWS Glue?
That makes sense. Hopefully that feature will be implemented soon. Thank you!
I get the same error when using
from_options
. Is it a similar issue?Is there an ETA for this fix now ? I ran into the same issue recently and was wondering if something would change soon.
Thanks a lot for the help