Questions tagged with AWS Glue
Content language: English
Sort by most recent
Connecting to mysql database store hosted in AWS EC2 from another AWS account using glue connection.
I have two AWS accounts say S1 and S2. In S1 AWS account i created and EC2 instance and within that instance i hosted mysql database containing some tables. Now i want to connect to this data store from other AWS account S2 using AWS glue connection. I created connection in S2 AWS account using JDBC and other related credentials but when i test my connection it fails. Can you please guide me how can i successfully create connection. Thank you.
AWS Glue connection does not have option test connection
Hello, I have created a jdbc connection for mysql database which is hosted on another aws ec2 instance. When create connection i dont find option for test connection. i am following a youtube video in which he has an option to test the connection but in my console there i no such option. can you help me in this regard. Thank you.
Cannot Install external python packages in AWS Glue spark script?
Cannot install python packages like opencv-python(cv2),scikit-image in aws glue script. Tried whl and zip files still cant in the job parameters still facing issues in installing the python packages.![![![Enter image description here](/media/postImages/original/IM_C-S_459TLKK07sasau9Qw) Enter image description here](/media/postImages/original/IMcpdik4-8TguwsfdBvv4piA) Enter image description here](/media/postImages/original/IMpmpJs5EkTnCi8DdIsijvRg) Any help would be great and also facing issues in converting list of tuples to rdd using parallelize any help in this regard also would be really great
Glue Start-Blueprint-Run running into timeout issues with increased number of jobs
**Describe the bug** Using AWS-CLI (Version: aws-cli/2.8.6 Python/3.9.11 Windows/10 exe/AMD64 prompt/off), starting a Glue Blueprint Run works fine when the number of objects generated inside the workflow (Triggers/Glue Jobs) is under 30-40 objects in total. But when there's more objects being generated inside the glue workflow, the Blueprint Run seems to be timing out and gets stuck in RUNNING state. **Expected Behavior** We expect the Blueprint Run to spin up the workflow with all the number of jobs as needed. This is a sample example of the workflow where each row consists of 2 jobs with a trigger in the middle for each task, and there could be N number of tasks like this: ![AWS Glue workflow from Blueprint run](/media/postImages/original/IMdi9yk9dcS7m5jVqlSi8u-A) **Current Behavior** These number of rows of tasks when under 8-10 tasks, the blueprint run is successful and doesn't time out but when it's more the Blueprint Run is stuck in RUNNING state and we never get the workflow generated. **Reproduction Steps** This is the AWS CLI command we're using right now: aws glue start-blueprint-run --blueprint-name BLUEPRINT_NAME --role-arn IAMRoleARN --parameters "file://FILE_PATH.json" --region us-east-1 --profile test-naga --cli-connect-timeout 900 --cli-read-timeout 900 The JSON object takes in a collection of table names and loops over them and the layout file creates the workflow as shown in the image above. It's not easy to reproduce this with the exact same example but maybe one of the samples here https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples/crawl_s3_locations could be used but for a higher number of jobs/objects created through the Blueprint Run **Possible Solution** I'm wondering if it's related to the --cli-connect-timeout and --cli-read-timeout issue as the default value is 60 seconds and it seems like the Blueprint Run tries to spin up all the resources in that time but if there are more objects to spin up and it crosses this time, the whole process times out and gets stuck in the RUNNING state without doing anything. We also tried setting these values to 0 and still the same issue. The number of objects it spins up when timing out seems to be random across each runs. **CLI version used** 2.8.6 **Environment details (OS name and version, etc.)** Windows 10
Unable to Relaunch Elasticsearch Connector for AWS Glue from Marketplace
Prior, I was able to Subscribe to [Relaunch Elasticsearch Connector](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj9kJCX3vz6AhVyMjQIHdn6D4UQFnoECBsQAQ&url=https%3A%2F%2Faws.amazon.com%2Fmarketplace%2Fpp%2Fprodview-v5ygernwn2gb6&usg=AOvVaw0TMrJCuyHDp4nv1T9PSuBd). Upon subscribing and following the instructions on that page, I ended at "Configure this software" in the Elasticsearch Connector Marketplace subscription. Upon choosing "Glue version 3.0" and selecting software version "7.13.4-2", initially, I was met with a box labeled "Usage instructions", which I was able to follow, get SecretsManager set up successfully. Initially, I was met with a generated link below "Usage instructions" reading "Deployment template: Activate connector in AWS Glue". Upon clicking this convenient link, I was taken to "AWS Glue Studio > Connectors" page in my account, with a generated MARKETPLACE type connector. Through trial and error in getting the connector set up, I had along the lines deleted this MARKETPLACE connector, with seemingly no obvious way to restore or retrieve it. After googling around, I was unable to find any similar issues as this, so I attempted to unsubscribe and re-subscribe to the Marketplace connector. Upon doing so, and reaching "Usage instructions" (on [Relaunch Elasticsearch Connector](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj9kJCX3vz6AhVyMjQIHdn6D4UQFnoECBsQAQ&url=https%3A%2F%2Faws.amazon.com%2Fmarketplace%2Fpp%2Fprodview-v5ygernwn2gb6&usg=AOvVaw0TMrJCuyHDp4nv1T9PSuBd)), the option "Deployment template: Activate connector in AWS Glue" had disappeared with no apparent way to re-create this MARKETPLACE connector in glue. Is this expected behavior for Marketplace custom glue connectors, and, if so, are there any steps to properly recreate a MARKETPLACE custom connector within my account?
EventBridge and Glue Workflow
Hi, I don't have much expertise with EventBridge + Glue Workflow. I do have an AWS DMS configured to migrate our database to a S3 Bucket, I want to perform ETL on the landed data. I can enable the Event Notification to notify whenever a file is written on the bucket, and create EventBridge rules to filter by the S3 Key. Is it possible to multiples EventBridge rules trigger the same Glue workflow passing different parameters? Or I should have one event bridge and glue workflow for each table ? e.g. 1st Approach ~~~~ /database/table1/file1.csv -> EventBridge Rule 1 -> Glue Workflow 1 /database/table2/file2.csv -> EventBridge Rule 2 -> Glue Workflow 2 ~~~~ vs 2nd Approach, different events share the same glue workflow but passing different parameters. ~~~~ /database/table1/file1.csv -> EventBridge Rule 1 -> Glue Workflow /database/table2/file2.csv -> EventBridge Rule 2 -> Glue Workflow ~~~~ The glue job will perform deduplication and will do the upsert on a S3 bucket using the Apache Iceberg.
Glue Jobs have no access to current schemas (Glue Catalog)
Hi, Context : I would like to extract a table from an oracle database and write the data in a parquet format on S3. I use " glue connection", "glue database" and "glue crawler". All works fine ! Issue: Glue set decimal(38,0) as column type in the data catalog rather than string. I updated the data Catalog with the new column type. Nevertheless in my Glue ETL job the column type is stil "decimal" ( I can see it because I print the schema). I extract the data using the create_dynamic_frame.from_catalog( database="", table_name="", transformation_ctx=""). I set the role with fullAcess to Glue/S3/Cloudwatch/EC2. When I print the schema, there is no difference after changing the column type in the data catalog. Could you help me ?
How to pass credentials in Glue Notebooks - Interactive Session using Magic Commands to override the 1 hour temporary token expiration
Hi, If I start the notebook via the console the token/credentials expire after one hour, Gives the following error "Exception encountered while retrieving session: An error occurred (ExpiredTokenException) when calling the GetSession operation: The security token included in the request is expired " . I am guessing this is happening since its using temporary credentials by default. How does one pass the credentials using the magic commands such that credentials do not expire or workaround? I can run notebooks locally using the profile in local .aws folder, but can't use TAGS for the sessions to account for costs.
AWS Glue Security Group error confusing
I am receiving the following error from a glue job I am trying to run: > JobName:... and JobRunId:... failed to execute with exception At least one security group must open all egress ports.To limit traffic, the source security group in your outbound rule can be restricted to the same security group (Service: AWSGlueJobExecutor ... I have verified that creating an outbound rule for ALL Traffic, All Ports, and Destination 0.0.0.0/0 resolves the problem, but I would ideally like to restrict the traffic as much as possible, and I am stuck on the second part of the error where it claims >To limit traffic, the source security group in your outbound rule can be restricted to the same security group Problem is, last time I checked, outbound (egress) security group rules don't have a "source", they have a "destination". Am I missing something here, or is the error message problematic?
Way to split up lengthy glue job scripts?
There are many lengthy (> 1000 LOC) glue job scripts for our customers, and they are **sharing common code blocks** that are identical in all of them. The glue jobs are not executed interactively by a person, but instead triggered for execution at a particular point in time. Is there a way how to split up these lengthy glue job scripts in several smaller scripts, in order **to isolate the common code blocks** as python functions? Can you give an example? Having this common code blocks just **in one place** instead in **all of the glue scripts** would make maintenance much easier.
Cause and Solution for com.amazonaws.services.gluejobexecutor.model.InvalidInputException: Entity size has exceeded the maximum allowed size
We have a glue job using workload partitioning by bounded execution. In a recent run the job failed during the Job.commit call. Based on the message I assume that the bookmark was too large to save. 1) How would this occur? 2) What options are available to prevent this from happening? 3) How would we recover from this if this occurred in a production environment? The error stack trace provides: 2022-10-20 17:05:44,413 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(91)): Exception in User Class com.amazonaws.services.gluejobexecutor.model.InvalidInputException: Entity size has exceeded the maximum allowed size. (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: xxx; Proxy: null) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1372) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) at com.amazonaws.services.gluejobexecutor.AWSGlueJobExecutorClient.doInvoke(AWSGlueJobExecutorClient.java:6964) at com.amazonaws.services.gluejobexecutor.AWSGlueJobExecutorClient.invoke(AWSGlueJobExecutorClient.java:6931) at com.amazonaws.services.gluejobexecutor.AWSGlueJobExecutorClient.invoke(AWSGlueJobExecutorClient.java:6920) at com.amazonaws.services.gluejobexecutor.AWSGlueJobExecutorClient.executeUpdateJobBookmark(AWSGlueJobExecutorClient.java:6610) at com.amazonaws.services.gluejobexecutor.AWSGlueJobExecutorClient.updateJobBookmark(AWSGlueJobExecutorClient.java:6580) at com.amazonaws.services.glue.util.AWSGlueJobBookmarkService$$anonfun$commit$1.apply(AWSGlueJobBookmarkService.scala:184) at com.amazonaws.services.glue.util.AWSGlueJobBookmarkService$$anonfun$commit$1.apply(AWSGlueJobBookmarkService.scala:183) at scala.Option.foreach(Option.scala:257) at com.amazonaws.services.glue.util.AWSGlueJobBookmarkService.commit(AWSGlueJobBookmarkService.scala:183) at com.amazonaws.services.glue.util.JobBookmark$.commit(JobBookmarkUtils.scala:88) at com.amazonaws.services.glue.util.Job$.commit(Job.scala:121) at ...