Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

How can one set up an Execution Class = FLEX on a Jupyter Job Run , im using the %magic on my %%configure cell like below and also setting the input arguments with --execution_class = FLEX But still the jobs are quicking as STANDARD %%configure { "region": "us-east-1", "idle_timeout": "480", "glue_version": "3.0", "number_of_workers": 10, "execution_class": "FLEX", "worker_type": "G.1X" } ![Enter image description here](/media/postImages/original/IMgaPRfCicTAKewOu41SXTqw)
1
answers
0
votes
15
views
asked a day ago
I attempted to create a partition index on a table. The index failed to create with a backfill error. I see this by calling client.get_partition_indexes(). 'IndexStatus': 'FAILED', 'BackfillErrors': [{'Code': 'ENCRYPTED_PARTITION_ERROR'..... I cannot delete this failed index either through the console or via API client.delete_partition_index(). The attempt to delete returns: EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the DeletePartitionIndex operation: Index with the given indexName : <index_name> does not exist. The failed index continues to be visible in the console. 2 questions: 1. How do I get rid of this failed index? 2. What encryption is causing the error? Is it the bucket being encrypted where the data is stored? Or is it catalog metadata encryption? Or other?
1
answers
0
votes
18
views
asked 2 days ago
Hello, I have two questions issues: 1. ``` spark = ( SparkSession.builder .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://co-raw-sales-dev") .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .enableHiveSupport() .getOrCreate() ) df.writeTo("glue_catalog.co_raw_sales_dev.new_test").using("iceberg").create() ``` CREATED TABLE DDL: ``` CREATE TABLE co_raw_sales_dev.new_test ( id bigint, name string, points bigint) LOCATION 's3://co-raw-sales-dev**//**new_test' TBLPROPERTIES ( 'table_type'='iceberg' ); ``` The problem I am having is that there is double // in location between bucket and table name in s3. 2. This one wokrs: ``` df.writeTo("glue_catalog.co_raw_sales_dev.new_test2").using("iceberg").create()``` but if I remove "glue_catalog" like: ```df.writeTo("co_raw_sales_dev.new_test2").using("iceberg").create()``` I am getting error : An error occurred while calling o339.create. Table implementation does not support writes: co_raw_sales_dev.new_test2 am I missing some parameter in SparkSession config? Thank you, Adas.
1
answers
0
votes
11
views
asked 3 days ago
I set up AdministratorAccess for my role, this is a master level policy for this role to pass all the services, specially is AWS Glue, I want to create crawler for build etl pipeline and pour data to database in catalog of AWS Glue, but I stuck in the error 400 denied access. I tried many way like: - Change the credit card, set default on it - Add permission many times, still failed.
0
answers
0
votes
15
views
asked 3 days ago
I have cluster A and cluster B. Cluster A has an external schema called 'landing_external' that contains many tables from our glue data catalog. Cluster A also has a local schema that is comprised of views that leverage data from 'landing_external' - this schema is called 'landing'. Cluster A has a datashare that Cluster B is the consumer of. The 'landing' schema is shared with Cluster B, however, anytime a user attempts to select data from any of the views in 'landing' schema, they receive an error `ERROR: permission denied for schema landing_external`. I thought that creating all of the views with option 'WITH NO SCHEMA BINDING' would address this permission gap but it does not. Any ideas on what I am missing?
3
answers
0
votes
24
views
tjtoll
asked 4 days ago
I am trying to create an ETL where I need to bring in data from redshift tables but the dataset is too large and I need to filter it before applying transformations on it . Glue filter node and the SQL query option does not filter data according to the requirement . The job keeps running for a long time and then fails , possibly due to the size of data . It seems that Glue is brining in all the data and then tries to apply the filter but before the filter is applied , the job fails . Is there a way to only bring in filtered data from redshift and then apply transformation on it ?
1
answers
0
votes
19
views
aneeq10
asked 4 days ago
Hi There, I was adding a VPC Network connection to an AWS Glue job and I got this error: JobRunId:jr_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to execute with exception Could not find connection for the given criteria Failed to get catalog connections given names: xxx-xxxxx-xxxx,none Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: xxxxxxx-xxxxx-xxxxx-xxxxxxxxx; Proxy: null I checked my VPC connecton and it all looked fine with correct security. Eventually I realized that I had also added the "None" Connection to the job EG ![Enter image description here](/media/postImages/original/IMXWnSSNrzRNyltg8WKE_f1Q) Surely the "None" connection should be ignored or else shouldn't be selectable. Thanks John
1
answers
0
votes
15
views
asked 5 days ago
When I click button "Create Crawler" in AWS Glue service, I failed. But I set up generated IAM Role with permission policy "AdministratorAccess" for this subcribe account. Please help me solve this issue. Thank you so much. my error: Account xxxxxxxxxxxx denied access
1
answers
0
votes
13
views
asked 5 days ago
Hi TEam, Me ew to AWS Data lake configuration. As per the doc, it refers to create the Glue connection which will be internally used by AWS BluePrint. My database is running on EC2 instance of different AWS accounts. I have setup the connection using jdbc with connection string "jdbc:sqlserver://SRV_IP:1433;database=db_name" and "jdbc:sqlserver://SRV_IP:1433;database=db_name", but the test connection failed with following error: Check that your connection definition references your JDBC database with correct URL syntax, username, and password. The TCP/IP connection to the host SRV_IP, port 1433 has failed. Error: "Connection timed out: no further information. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.". Following already configured/tested, 1) The connection security group has inbound traffic for port, 2) the subnet used in connection has the route table points to NAT gateway, 3)Able to connect the DB over the internet hence no issue with db/ec2 security group. Please advise. me referring: https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/ Except the DB is self-managed on AWS EC2. REgards, Nikhil Shah
2
answers
0
votes
21
views
asked 6 days ago
Hello, Within step function, I want to pass parameters received from input to a Glue:ListCrawler state to run related crawlers inside a Glue and after running all crawlers, I also need to pass the input parameters to a choice step to run glue jobs. Because the output of Glue:ListCrawler is the CrawlerName, I can not use OutputPath to include input parameters. Can anyone help me how to receive input parameters within choice state, in this regard? step function run with this Json format: { "Comment": "Run Glue workflow", "tags": { "client": "test", "environment": "qa"}, "glue":{ "client": "test", "environment": "qa", "partitionDate" : "2023-03-2" }} Step function definition: { "Comment": "A utility state machine to run all Glue Crawlers that match tags", "StartAt": "Pass", "States": { "Pass": { "Type": "Pass", "Next": "Get Glue Crawler List based on Client, Environment" }, "Get Glue Crawler List based on Client, Environment": { "Next": "Run Glue Crawlers for all LOB", "OutputPath": "$.CrawlerNames", "Parameters": { "Tags.$": "$.tags" }, "Resource": "arn:aws:states:::aws-sdk:glue:listCrawlers", "Retry": [ { "BackoffRate": 5, "ErrorEquals": [ "States.ALL" ], "IntervalSeconds": 2, "MaxAttempts": 3 } ], "Type": "Task" }, "Run Glue Crawlers for all LOB": { "Iterator": { "StartAt": "Run Glue Crawler for each LOB", "States": { "Run Glue Crawler for each LOB": { "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Crawler Success" } ], "Next": "Crawler Success", "Parameters": { "Input": { "crawler_name.$": "$" }, "StateMachineArn": "arn:aws:states:ca-central-1:467688788830:stateMachine:run_each_crawler" }, "Resource": "arn:aws:states:::states:startExecution.sync:2", "Retry": [ { "BackoffRate": 5, "ErrorEquals": [ "States.ALL" ], "IntervalSeconds": 2, "MaxAttempts": 1 } ], "Type": "Task" }, "Crawler Success": { "Type": "Succeed" } } }, "ResultPath": null, "Type": "Map", "Next": "Choice", "MaxConcurrency": 4 }, "Choice": { "Type": "Choice", "Choices": [ { "Variable": "$.environment", "StringEquals": "qa", "Next": "start glue-qa-dh" }, { "Variable": "$.environment", "StringEquals": "uat", "Next": "start glue-uat-dh" },
1
answers
0
votes
25
views
asked 6 days ago
Hi community, I am trying to perform an ETL job using AWS Glue. Our data is stored in MongoDB Atlas, inside a VPC. Our AWS is connected to our MongoDB Atlas using VPC peering. To perform the ETL job in AWS Glue I have first created a connection using the VPC details and the mongoDB Atlas URI along with the password and username. The connection is used by the AWS Glue crawlers to extract the schema to AWS Data Catalog Tables. This connection works! However, I am then attempting to perform the actual ETL job using the following pySpark code: #My Temp Variables source_database="d*********a" source_table_name="main_businesses source_mongodb_db_name = "main" source_mongodb_collection = "businesses" glueContext.create_dynamic_frame.from_catalog(database=source_database,table_name=source_table_name,additional_options = {"database": source_mongodb_db_name,"collection":source_mongodb_collection}) However the connection times out and for some reason mongodb atlas is blocking the connection from the ETL job. It's as if the ETL Job is using the connection differently than the crawler does. Maybe the ETL Job is not able to run the job inside our AWS VPC that is connected to the MongoDB Atlas VPC (VPC Peering is not possible?). Does anyone have any idea what might be going on or how I can fix this? Thank you!
1
answers
0
votes
20
views
asked 6 days ago
My files are csv files with 3 fields using tab separation. The builtin classifier CSV creates a schema of 3 strings for the 3 attributes a:string b:string c:string however my last attribute **c** is a json string. I would like to know if it's possible using a costum classifier to create an extra attribute **endpoint** that results from som pattern matching grok or regex. Let's say , if the json string **c** looks like below . ``` {"log":{"timestampReceived":"2023-03-10 01:43:24,563 +0000","component":"ABC","subp":"0","eventStream":"DEF","endpoint":"sales"}} ``` I would like to grab the endpoint:sales into a new attribute of my table and use it as a partition. it would end up like something below a:string b:string c:string **endpoint**:string (partitioned)
0
answers
0
votes
17
views
asked 6 days ago