Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

My requirement is like : I want to query the iceberg table present in another AWS account. Let's say I am a user of account A and want to query account B's iceberg tables present in that account's glue. I followed the steps from AWS docs [glue cross account](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) and [s3 cross account](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) to attach permission to accountB's glue and s3 bucket where the data is stored and the policy contains required permission with principal `account A:root` means any user from account A should be able to query. Then I also attached the glue and s3 polices to the account A's user. Then I go to Athena and create the data source as glue with catalog ID as account B's account ID and then I am able to see all the glue databases and tables of account B. But when I query the table such as `select * from table` it gives the error as `HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info`. But I am able to query the table properly in account B. But yeah it's serDeInfo is empty.
0
answers
0
votes
35
views
asked 19 days ago
The [Glue JDBC Connection documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) states: > If you already have a JDBC connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. To do so, use the following connection properties: >* "useConnectionProperties": Set it to "true" to indicate you want to use the configuration from a connection. >* >* "connectionName": Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job. There is no further documentation of how to use these properties. I have tried to set these properties as kwargs to `glueContext.create_dynamic_frame.from_options()`, but the method continues to throw an error if `url` is not specified in `connectionOptions`: ``` dbtable = 'schema.table' query = f"select top 1 * from {dbtable}" dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver", useConnectionProperties="true", connectionName="My-Glue-Connection", connection_options={ "dbtable": dbtable, "sampleQuery": query }) Py4JJavaError: An error occurred while calling o212.getDynamicFrame. : java.util.NoSuchElementException: key not found: url at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:913) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:909) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:943) at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:709) ... ``` The same error occurs if I pass the two properties via `connection_options`: ``` dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver", connection_options={ "useConnectionProperties": "true", "connectionName": "IDAP-Glue-Connection", "dbtable": dbtable, "sampleQuery": query }) Py4JJavaError: An error occurred while calling o123.getDynamicFrame. : java.util.NoSuchElementException: key not found: url ... ``` How is this feature intended to be used? The only methods I've found to use a Glue Connection to read from a JDBC database are really unwieldy, I would expect tighter integration of Glue Connections in Glue Jobs.
1
answers
0
votes
22
views
asked 20 days ago
Sample code (minus boilerplate): ``` conn = glueContext.extract_jdbc_conf(connection_name="My-Glue-Connection") for i in conn.items: print(i) ``` Output from notebook: ``` ('enforceSSL', 'false') ('skipCustomJDBCCertValidation', 'false') ('url', 'jdbc:sqlserver://0.0.0.0:1433') ('customJDBCCertString', '') ('user', 'test') ('customJDBCCert', '') ('password', 'xxx') ('vendor', 'sqlserver') ``` Output from running job: ``` **('fullUrl', 'jdbc:sqlserver://0.0.0.0:1433/mydb')** ('enforceSSL', 'false') ('skipCustomJDBCCertValidation', 'false') ('url', 'jdbc:sqlserver://0.0.0.0:1433') ('customJDBCCertString', '') ('user', 'test') ('customJDBCCert', '') ('password', 'xxx') ('vendor', 'sqlserver') ``` `fullUrl` (and thus the name of the database to connect to) is not available when using this method in a notebook.
2
answers
0
votes
15
views
asked 20 days ago
AWS provides AWs Glue Data Quality, powered by DQDL. Is there a DQDL example for Time-series sensor data. It also offers "Data Quality and Insights Report in Sagemaker under Data Wrangler, but that is also not a great way, IMHO. Is there a tool in AWS that can provide custom report for Data Quality for Timeseries data? - Asking because the timeseries data is simple, usually having following fields, [Device, Timestamp, Sensorname, Value] where the sensors can be unevenly spaced timeseries etc. Too many options but nothing complete.
0
answers
0
votes
15
views
jinman
asked 20 days ago
I'm running many Glue jobs. How can we set the limit or alarm if jobs runs more than 6 hours or 10DPU hours. Is there any Cloudwatch metrics available on which alarm can be set for cost monitoring purpose.
1
answers
0
votes
29
views
asked 20 days ago
I am transforming my table by adding new columns using SQL Query Transform in AWS Glue Job studio. ![visual diagram for transformation](/media/postImages/original/IMwcLXRM0iTROC0Uqb5lvOGg) SQL aliases- study Existing Schema from data catalog - study id, patient id, patient age I want to transform the existing schema by adding new columns. new columns - AccessionNo Transformed schema - study id, patient id, patient age, AccessionNo SQL query - **alter table study add columns (AccessionNo int)** Error it gives- pyspark.sql.utils.AnalysisException: Invalid command: 'study' is a view not a table.; line 2 pos 0; 'AlterTable V2SessionCatalog(spark_catalog), default.study, 'UnresolvedV2Relation [study], V2SessionCatalog(spark_catalog), default.study, [org.apache.spark.sql.connector.catalog.TableChange$AddColumn@1e7cbfec] I tried looking at AWS official doc for SQL transform and it says queries should be in Spark Sql syntax and my query is also in Spark Sql syntax. https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html What is the exact issue and please help me resolve. Thanks
1
answers
0
votes
48
views
Prabhu
asked 20 days ago
1
answers
0
votes
35
views
asked 21 days ago
Hello, I'm creating a Glue Job using Jupyter notebook and I'm currently using Ray as the ETL type. After running the job once, I noticed I can no longer save my notebook or push it to a repository because the Glue Version randomly downgraded to 3.0 in the Job Details and I have no way to convert it back to 4.0.
1
answers
0
votes
30
views
asked 21 days ago
Running a glue job to fetch records from Microsoft sql server but glue jobs keeps running and does not show any results. Job is scheduled with G.2X worker with 5 works with auto scheduling. Logs:- 23/02/27 09:02:45 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2
answers
0
votes
30
views
asked 22 days ago
I am trying to join two tables in CSV format(saved in S3 bucket). Target is also an empty folder in S3. Everytime getting the following error: **AnalysisException: Cannot resolve column name "device_id" among ()** where**** device_id**** is the unique id for joining the tables. Please help
0
answers
0
votes
20
views
asked 22 days ago
I have programmatically defined a eventbridge rule to send an event when a crawler completes. response = event_client.put_rule( Name="newmyrule", EventPattern='{"detail-type": ["Glue Crawler State Change"],"source": ["aws.glue"],"detail": {"crawlerName":["'+crawler_name+'"],"state": ["Succeeded"]}}' ) print("put_rule="+str(response)) put_target_response = event_client.put_targets( Rule='newmyrule', Targets=[{ 'Id': 'mylambdafn', 'Arn': 'arn:aws:lambda:us-west-1:xxxxxxxxxxxxx:function:mylambdafn' }] ) enable_rule_response = event_client.enable_rule(Name='newmyrule') I have also defined the crawler through boto3. create_crawler_response =glue.create_crawler( Name=crawler_name, Role='arn:aws:iam::xxxxxxxxxx:role/ravi-glue-access', DatabaseName='noah-ingest', #TablePrefix="", Targets={'S3Targets': [{'Path': s3_target}]}, SchemaChangePolicy={ 'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'DELETE_FROM_DATABASE' } ) It looks similar to rules defined through the console but results in failedinvocations. How do I fix this. thanks, Ravi.
2
answers
0
votes
29
views
asked 24 days ago
I want to join two tables.I have the tables in CSV format stored in S3 bucket 1.Is Amazon Glue studio,the right option? 2.What is the correct procedure? 3.What are the IAM permissions required? 4.Where to see the joined table output? Please throw some light
2
answers
0
votes
41
views
asked 24 days ago