Questions tagged with AWS Glue
Content language: English
Sort by most recent
When I click button "Create Crawler" in AWS Glue service, I failed. But I set up generated IAM Role with permission policy "AdministratorAccess" for this subcribe account. Please help me solve this issue. Thank you so much.
my error: Account xxxxxxxxxxxx denied access
Hi TEam,
Me ew to AWS Data lake configuration. As per the doc, it refers to create the Glue connection which will be internally used by AWS BluePrint.
My database is running on EC2 instance of different AWS accounts.
I have setup the connection using jdbc with connection string "jdbc:sqlserver://SRV_IP:1433;database=db_name" and "jdbc:sqlserver://SRV_IP:1433;database=db_name", but the test connection failed with following error:
Check that your connection definition references your JDBC database with correct URL syntax, username, and password. The TCP/IP connection to the host SRV_IP, port 1433 has failed. Error: "Connection timed out: no further information. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".
Following already configured/tested,
1) The connection security group has inbound traffic for port,
2) the subnet used in connection has the route table points to NAT gateway,
3)Able to connect the DB over the internet hence no issue with db/ec2 security group.
Please advise.
me referring: https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/
Except the DB is self-managed on AWS EC2.
REgards,
Nikhil Shah
Hello,
Within step function, I want to pass parameters received from input to a Glue:ListCrawler state to run related crawlers inside a Glue and after running all crawlers, I also need to pass the input parameters to a choice step to run glue jobs. Because the output of Glue:ListCrawler is the CrawlerName, I can not use OutputPath to include input parameters. Can anyone help me how to receive input parameters within choice state, in this regard?
step function run with this Json format:
{
"Comment": "Run Glue workflow",
"tags": {
"client": "test",
"environment": "qa"},
"glue":{
"client": "test",
"environment": "qa",
"partitionDate" : "2023-03-2"
}}
Step function definition:
{
"Comment": "A utility state machine to run all Glue Crawlers that match tags",
"StartAt": "Pass",
"States": {
"Pass": {
"Type": "Pass",
"Next": "Get Glue Crawler List based on Client, Environment"
},
"Get Glue Crawler List based on Client, Environment": {
"Next": "Run Glue Crawlers for all LOB",
"OutputPath": "$.CrawlerNames",
"Parameters": {
"Tags.$": "$.tags"
},
"Resource": "arn:aws:states:::aws-sdk:glue:listCrawlers",
"Retry": [
{
"BackoffRate": 5,
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 2,
"MaxAttempts": 3
}
],
"Type": "Task"
},
"Run Glue Crawlers for all LOB": {
"Iterator": {
"StartAt": "Run Glue Crawler for each LOB",
"States": {
"Run Glue Crawler for each LOB": {
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Crawler Success"
}
],
"Next": "Crawler Success",
"Parameters": {
"Input": {
"crawler_name.$": "$"
},
"StateMachineArn": "arn:aws:states:ca-central-1:467688788830:stateMachine:run_each_crawler"
},
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Retry": [
{
"BackoffRate": 5,
"ErrorEquals": [
"States.ALL"
],
"IntervalSeconds": 2,
"MaxAttempts": 1
}
],
"Type": "Task"
},
"Crawler Success": {
"Type": "Succeed"
}
}
},
"ResultPath": null,
"Type": "Map",
"Next": "Choice",
"MaxConcurrency": 4
},
"Choice": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.environment",
"StringEquals": "qa",
"Next": "start glue-qa-dh"
},
{
"Variable": "$.environment",
"StringEquals": "uat",
"Next": "start glue-uat-dh"
},
Hi community,
I am trying to perform an ETL job using AWS Glue.
Our data is stored in MongoDB Atlas, inside a VPC.
Our AWS is connected to our MongoDB Atlas using VPC peering.
To perform the ETL job in AWS Glue I have first created a connection using the VPC details and the mongoDB Atlas URI along with the password and username. The connection is used by the AWS Glue crawlers to extract the schema to AWS Data Catalog Tables.
This connection works!
However, I am then attempting to perform the actual ETL job using the following pySpark code:
#My Temp Variables
source_database="d*********a"
source_table_name="main_businesses
source_mongodb_db_name = "main"
source_mongodb_collection = "businesses"
glueContext.create_dynamic_frame.from_catalog(database=source_database,table_name=source_table_name,additional_options = {"database": source_mongodb_db_name,"collection":source_mongodb_collection})
However the connection times out and for some reason mongodb atlas is blocking the connection from the ETL job.
It's as if the ETL Job is using the connection differently than the crawler does. Maybe the ETL Job is not able to run the job inside our AWS VPC that is connected to the MongoDB Atlas VPC (VPC Peering is not possible?).
Does anyone have any idea what might be going on or how I can fix this?
Thank you!
My files are csv files with 3 fields using tab separation.
The builtin classifier CSV creates a schema of 3 strings for the 3 attributes
a:string
b:string
c:string
however my last attribute **c** is a json string.
I would like to know if it's possible using a costum classifier to create an extra attribute **endpoint** that results from som pattern matching grok or regex.
Let's say , if the json string **c** looks like below .
```
{"log":{"timestampReceived":"2023-03-10 01:43:24,563 +0000","component":"ABC","subp":"0","eventStream":"DEF","endpoint":"sales"}}
```
I would like to grab the endpoint:sales into a new attribute of my table and use it as a partition.
it would end up like something below
a:string
b:string
c:string
**endpoint**:string (partitioned)
How to get the "RuleResults" generated by a glue studio job that have some dataquality rules, in the dataQuality Tab of the job i can manually download the results to a file when appears the "RuleResults". i have one step function where calls that job, i would like to know the output of that file(where was generated and the key in s3, too) to evaluate in a next step(i.e lambda function) which rules were'nt passed and which ones yes. tkx
Hi I have a glue job running with PySpark. Its taking too long to write the dynamic frame to s3. For around 1200 records writing it too around 500 seconds alone for writing to s3. I have observed that even if data frame is empty still it takes same amount of time to write to s3.
Below are code snippets -
> test1_df = test_df.repartition(1)
> invoice_extract_final_dyf = DynamicFrame.fromDF(test1_df, glueContext, "invoice_extract_final_dyf")
> glueContext.write_dynamic_frame.from_options(frame=invoice_extract_final_dyf,
connection_type="s3",
connection_options={"path": destination_path},
format="json")
The conversion in 2nd line and writing to s3 both of these consumes most of the time. Any help will be appreciated. Let me know if any further details are needed.
I set up the resources to trigger glue job through eventbridge. But when I tested in console, Invocations == FailedInvocations == TriggeredRules == 1.
What can I do to fix it?
```
######### AWS Glue Workflow ############
# Create a Glue workflow that triggers the Glue job
resource "aws_glue_workflow" "example_glue_workflow" {
name = "example_glue_workflow"
description = "Glue workflow that triggers the example_glue_job"
}
resource "aws_glue_trigger" "example_glue_trigger" {
name = "example_glue_trigger"
workflow_name = aws_glue_workflow.example_glue_workflow.name
type = "EVENT"
actions {
job_name = aws_glue_job.example_glue_job.name
}
}
######### AWS EventBridge ##############
resource "aws_cloudwatch_event_rule" "example_etl_trigger" {
name = "example_etl_trigger"
description = "Trigger Glue job when a request is made to the API endpoint"
event_pattern = jsonencode({
"source": ["example_api"]
})
}
resource "aws_cloudwatch_event_target" "glue_job_target" {
rule = aws_cloudwatch_event_rule.example_etl_trigger.name
target_id = "example_event_target"
arn = aws_glue_workflow.example_glue_workflow.arn
role_arn = local.example_role_arn
}
```
Unable to add Custom Data Types of a classifier through Cloud Formation Template. It is done throigh only Console, but there is no such parameter available in Cloud formation template.
Using Glue we can crawl snowflake table properly in catalog but Athena failed to query table data:
HIVE_UNSUPPORTED_FORMAT: Unable to create input format
Googling results suggested it's because of table created by crawler has empty "input format", "Output format"... yes they are empty for this table crawled from Snowflake. So the question is
1 why didn't crawler set them? (crawler can classify the table is snowflake correctly)
2 what the values should be if manual edit is needed?
Is Snowflake table able to be queried by Athena?
Any idea? Thanks,
Hi AWS expert, I have a code read data from AWS aurora PostgreSQL, I want to bookmark the table with custom column named 'ceres_mono_index'. But it seems like the bookmark is still uses the primary key as the bookmark key instead of column 'ceres_mono_index'. Here is the code
```python
cb_ceres = glueContext.create_dynamic_frame.from_options(
connection_type="postgresql",
connection_options={
"url": f"jdbc:postgresql://{ENDPOINT}:5432/{DBNAME}",
"dbtable": "xxxxx_raw_ceres",
"user": username,
"password": password,
},
additional_options={"jobBookmarkKeys": ["ceres_mono_index"], "jobBookmarkKeysSortOrder": "asc"},
transformation_ctx="cb_ceres_bookmark",
)
```
How could I fix the issue? Thank you
Hello,
I am trying to run my first job in AWS Glue, but I am encountering the following error: "An error occurred while calling o103.pyWriteDynamicFrame. /run-1679066163418-part-r-00000 (Permission denied)".
The error message indicates that the permission has been denied. I am using an IAM Role that has AmazonS3FullAccess, AWSGlueServiceRole, and even AdministratorAccess. Although I understand that this is not ideal for security reasons, I added this policy to ensure that the IAM Role is not the issue.
I have attempted to use different sources (such as DynamoDB and S3) and targets (such as Redshift and Datacatalog), but I consistently receive the same error. Does anyone know how I can resolve this issue?
Thank you in advance!