- 最新
- 最多得票
- 最多評論
Sounds that's just a connectivity issue.
If the server url is not public, you will need to run the Glue job inside a VPC (using a Network type connection and assigning it to the Glue job).
More info: https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html
BTW, you can tell it to take the config from the connection directly instead of extracting it yourself. https://docs.amazonaws.cn/en_us/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc
41 minutes and then timeout:
ConnectTimeoutError: Connect timeout on endpoint URL: "https://glue.us-east-1.amazonaws.com/"
That means your job runs in a VPC that neither has internet connectivity nor the Glue regional endpoint added to the VPC (com.amazonaws.us-east-1.glue)
Thank you for that info. Let me be more specific. I get that error when I specify the connector in the Glue job configuration. If I remove it, I get further along. Without it, I am able to use two other methods to retrieve connection details
import boto3 secrets_manager_client = boto3.client( service_name="secretsmanager", region_name="us-east-1", ) secret = secrets_manager_client.get_secret_value(SecretId=secret_name) connection_info = json.loads(secret["SecretString"]) return connection_info
import boto3 glue_client = boto3.client(service_name="glue", region_name="us-east-1") connection_info = glue_client.get_connection(Name=connection_name) return connection_info
But I am unable to connect in the following methods based on different input credentials format:
df = spark.read.jdbc( url="<url>", table=db_table, properties={ "driver": "org.postgresql.Driver", # "user": connection_details["username"], # "password": connection_details["password"], "user": connection_details["Connection"]["username"], "password": connection_details["Connection"]["password"] }, )
dynamic_frame = ( glue_context .read.format("jdbc") .option("driver", "org.postgresql.Driver") .option("url", "<url>") # .option("url", connection_details["Connection"]["ConnectionProperties"]["JDBC_CONNECTION_URL"]) .option("dbtable", db_table) .option("user", connection_details["username"]) .option("password", connection_details["password"]) # .option("user", connection_details["Connection"]["ConnectionProperties"]["USERNAME"]) # .option("password", connection_details["Connection"]["ConnectionProperties"]["PASSWORD"]) .load() )
connection_details = { "dbTable": db_table, "connectionName" : connector_name, "useConnectionProperties": True, "url": "<url>" } dynamic_frame = glue_context.create_dynamic_frame_from_options( connection_type="postgresql", connection_options=connection_details, )
Depending on the API might check the catalog or not, the point is that the job should be able to do that, you should have the Glue service endpoint accessible or you can run into issues any time
The issue with VPC is only when trying to write data out to S3. I can df.collect() and log a sample of data just fine.
connection_details = {
"dbTable": db_table,
"connectionName" : connector_name,
"useConnectionProperties": True,
"url": "<url>",
}
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type="postgresql",
connection_options=connection_details,
)
df = dynamic_frame.toDF()
Yes as long as your job doesn't try to access the catalog for any reason, directly or indirectly (for instance using a SparkSession might list the databases in the catalog). It's better if you allow the job to connect to the catalog with the endpoint, otherwise any minute it can break
相關內容
- AWS 官方已更新 3 年前
I ended up going with the following because converting a dynamic_frame to a Spark dataframe is eager, which caused performance issues in some of my jobs that use this util function I created.
Allowing the VPC access to the Glue endpoint enabled some Glue jobs to now work properly with the connection. However, jobs requiring a public Python package (e.g., apache-sedona) no longer have access to install the public package when running within the connector VPC.
Do you have any specific recommendation as to how to account for this in addition to accounting for
com.amazonaws.us-east-1.glue
?