Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hi community, I am trying to perform an ETL job using AWS Glue. Our data is stored in MongoDB Atlas, inside a VPC. Our AWS is connected to our MongoDB Atlas using VPC peering. To perform the ETL job in AWS Glue I have first created a connection using the VPC details and the mongoDB Atlas URI along with the password and username. The connection is used by the AWS Glue crawlers to extract the schema to AWS Data Catalog Tables. This connection works! However, I am then attempting to perform the actual ETL job using the following pySpark code: #My Temp Variables source_database="d*********a" source_table_name="main_businesses source_mongodb_db_name = "main" source_mongodb_collection = "businesses" glueContext.create_dynamic_frame.from_catalog(database=source_database,table_name=source_table_name,additional_options = {"database": source_mongodb_db_name,"collection":source_mongodb_collection}) However the connection times out and for some reason mongodb atlas is blocking the connection from the ETL job. It's as if the ETL Job is using the connection differently than the crawler does. Maybe the ETL Job is not able to run the job inside our AWS VPC that is connected to the MongoDB Atlas VPC (VPC Peering is not possible?). Does anyone have any idea what might be going on or how I can fix this? Thank you!
0
answers
0
votes
3
views
asked 3 minutes ago
My files are csv files with 3 fields using tab separation. The builtin classifier CSV creates a schema of 3 strings for the 3 attributes a:string b:string c:string however my last attribute **c** is a json string. I would like to know if it's possible using a costum classifier to create an extra attribute **endpoint** that results from som pattern matching grok or regex. Let's say , if the json string **c** looks like below . ``` {"log":{"timestampReceived":"2023-03-10 01:43:24,563 +0000","component":"ABC","subp":"0","eventStream":"DEF","endpoint":"sales"}} ``` I would like to grab the endpoint:sales into a new attribute of my table and use it as a partition. it would end up like something below a:string b:string c:string **endpoint**:string (partitioned)
0
answers
0
votes
8
views
asked 6 hours ago
How to get the "RuleResults" generated by a glue studio job that have some dataquality rules, in the dataQuality Tab of the job i can manually download the results to a file when appears the "RuleResults". i have one step function where calls that job, i would like to know the output of that file(where was generated and the key in s3, too) to evaluate in a next step(i.e lambda function) which rules were'nt passed and which ones yes. tkx
0
answers
0
votes
3
views
Willi5
asked 8 hours ago
Hi I have a glue job running with PySpark. Its taking too long to write the dynamic frame to s3. For around 1200 records writing it too around 500 seconds alone for writing to s3. I have observed that even if data frame is empty still it takes same amount of time to write to s3. Below are code snippets - > test1_df = test_df.repartition(1) > invoice_extract_final_dyf = DynamicFrame.fromDF(test1_df, glueContext, "invoice_extract_final_dyf") > glueContext.write_dynamic_frame.from_options(frame=invoice_extract_final_dyf, connection_type="s3", connection_options={"path": destination_path}, format="json") The conversion in 2nd line and writing to s3 both of these consumes most of the time. Any help will be appreciated. Let me know if any further details are needed.
1
answers
0
votes
15
views
asked 11 hours ago
I set up the resources to trigger glue job through eventbridge. But when I tested in console, Invocations == FailedInvocations == TriggeredRules == 1. What can I do to fix it? ``` ######### AWS Glue Workflow ############ # Create a Glue workflow that triggers the Glue job resource "aws_glue_workflow" "example_glue_workflow" { name = "example_glue_workflow" description = "Glue workflow that triggers the example_glue_job" } resource "aws_glue_trigger" "example_glue_trigger" { name = "example_glue_trigger" workflow_name = aws_glue_workflow.example_glue_workflow.name type = "EVENT" actions { job_name = aws_glue_job.example_glue_job.name } } ######### AWS EventBridge ############## resource "aws_cloudwatch_event_rule" "example_etl_trigger" { name = "example_etl_trigger" description = "Trigger Glue job when a request is made to the API endpoint" event_pattern = jsonencode({ "source": ["example_api"] }) } resource "aws_cloudwatch_event_target" "glue_job_target" { rule = aws_cloudwatch_event_rule.example_etl_trigger.name target_id = "example_event_target" arn = aws_glue_workflow.example_glue_workflow.arn role_arn = local.example_role_arn } ```
0
answers
0
votes
4
views
asked 12 hours ago
Unable to add Custom Data Types of a classifier through Cloud Formation Template. It is done throigh only Console, but there is no such parameter available in Cloud formation template.
0
answers
1
votes
14
views
nikhil
asked 2 days ago
Using Glue we can crawl snowflake table properly in catalog but Athena failed to query table data: HIVE_UNSUPPORTED_FORMAT: Unable to create input format Googling results suggested it's because of table created by crawler has empty "input format", "Output format"... yes they are empty for this table crawled from Snowflake. So the question is 1 why didn't crawler set them? (crawler can classify the table is snowflake correctly) 2 what the values should be if manual edit is needed? Is Snowflake table able to be queried by Athena? Any idea? Thanks,
1
answers
0
votes
32
views
profile picture
asked 2 days ago
Hi AWS expert, I have a code read data from AWS aurora PostgreSQL, I want to bookmark the table with custom column named 'ceres_mono_index'. But it seems like the bookmark is still uses the primary key as the bookmark key instead of column 'ceres_mono_index'. Here is the code ```python cb_ceres = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": f"jdbc:postgresql://{ENDPOINT}:5432/{DBNAME}", "dbtable": "xxxxx_raw_ceres", "user": username, "password": password, }, additional_options={"jobBookmarkKeys": ["ceres_mono_index"], "jobBookmarkKeysSortOrder": "asc"}, transformation_ctx="cb_ceres_bookmark", ) ``` How could I fix the issue? Thank you
1
answers
0
votes
31
views
asked 3 days ago
Hello, I am trying to run my first job in AWS Glue, but I am encountering the following error: "An error occurred while calling o103.pyWriteDynamicFrame. /run-1679066163418-part-r-00000 (Permission denied)". The error message indicates that the permission has been denied. I am using an IAM Role that has AmazonS3FullAccess, AWSGlueServiceRole, and even AdministratorAccess. Although I understand that this is not ideal for security reasons, I added this policy to ensure that the IAM Role is not the issue. I have attempted to use different sources (such as DynamoDB and S3) and targets (such as Redshift and Datacatalog), but I consistently receive the same error. Does anyone know how I can resolve this issue? Thank you in advance!
1
answers
0
votes
18
views
asked 4 days ago
How can I apply a negative exclusion pattern on the configuration of my Crawler. I would like to negate every folder that does not match the following !prd**/queries/** I want to exclude everything that does not match with this
0
answers
0
votes
9
views
asked 4 days ago
I tried to set up an cross-account Athena access. I could see the database in Lake formation, Glue and Athena under target account. At the beginning I don't see any tables in the target Athena console. After I did something in Lake formation console (target account) I could see a table in target Athena console and query it successfully. But I could not see other tables from the same database even I tried many ways. I always got below error even I the gave the KMS access everywhere (both KMS and IAM role) or turn off the kms encryption in Glue. I don't know what is the actual reason. Below is an example of the error message: The ciphertext refers to a customer master key that does not exist, does not exist in this region, or you are not allowed to access. (Service: AWSKMS; Status Code: 400; Error Code: AccessDeniedException; Request ID: cb9a754f-fc1c-414d-b526-c43fa96d3c13; Proxy: null) (Service: AWSGlue; Status Code: 400; Error Code: GlueEncryptionException; Request ID: 0c785fdf-e3f7-45b2-9857-e6deddecd6f9; Proxy: null) This query ran against the "xxx_lakehouse" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b2c74c7e-21ed-4375-8712-cd1579eab9a7. I have already added the permissions pointed out in https://repost.aws/knowledge-center/cross-account-access-denied-error-s3? Does anyone know how to fix the error and see the cross-account tables in Athena? Thank you very much.
1
answers
0
votes
30
views
asked 5 days ago
I followed all the steps mentioned in [https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html ]() and followed steps mentioned [https://www.youtube.com/watch?v=04LMQxDxjGM]() When I run jupyter notebook command from pycharm it is getting opened on internet explorer . But When I am trying to create juyptor notebook directly in pycharm I am not getting option of GLue pyspark or glue spark as mentioned in video. ![Enter image description here](/media/postImages/original/IMnbE0t-KjQFqO5Ux1CTKNYQ) ![Enter image description here](/media/postImages/original/IMVjMw2IzQQn6b7LDUrY935A) Also it never showed me https://localhost as shown in video
1
answers
0
votes
31
views
asked 5 days ago