Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

hello, AWS recently announced that Glue crawlers can now create native delta lake tables (last December, https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/). We tested and it works fine. However, we would like to not use crawlers. Is this the only way to create a native delta lake table for now? Is this planned to allow this through the Glue console table creation screen? As a side note, it looks like terraform is still missing a "CreateNativeDeltaTable" option in their latest provider (they have an open issue for that). Thanks. Cheers, Fabrice
2
answers
0
votes
49
views
asked a month ago
I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details: 1. Partition columns: Year,Month & Day 2. catalogPartitionPredicate: year>='2021' and month>='12' If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work. Is there any way to implement to_date functionality or any other workaround to handle this scenario?
1
answers
0
votes
46
views
asked a month ago
I was able to include the DatabricksJDBC42.jar in my Glue Docker container used for local machine development ([link](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)). I am able to reach the host using Jupyter notebook, but I am getting an SSL type error `Py4JJavaError: An error occurred while calling o80.load. : java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.` My connection string looks like this: `.option("url","jdbc:databricks://host.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/111111111111111/1111-111111-abcdefghi;AuthMech=3;UseNativeQuery=0;StripCatalogName=0;")\ .option("dbtable","select 1")\ .option("driver", "com.databricks.client.jdbc.Driver")\ .load()` I used the same JDBC string in my code uploaded to our live account and the AWS Glue job runs and executes the queries in dbtable just fine. Its just in the local Docker Glue development container where we get this SSL error. I tried adding a separate **option** for the sslConnection and sslCertLocation and placed tried the files in /root/,aws as well as the jupyter notebook folder. The cert is showing in directory listings and is correctly assigned but the jdbc connection is failing with the SSL error. Anyone see this before or have a suggestion for next steps? Thanks.
2
answers
0
votes
55
views
asked a month ago
Trying to load tables from single source.. Source has data for EMP NAME , ADDRESS … target table A has EMP ID(Auto generated PK) and EMP NAME … table B has EMP ID(foreign key) ADDRESS ID ( Auto generated PK) AND ADDRESS… Now how to load these two tables using AWS GLUE ? No proper notes any where for this… can you help clarify
2
answers
0
votes
21
views
asked a month ago
Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker. My docker setup is: ```yaml version: '3.7' services: glue: container_name: "dev-glue" image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64 ports: - "4040:4040" - "18080:18080" volumes: - ~/.aws:/home/glue_user/.aws - /workspace:/home/glue_user/workspace/ environment: - "AWS_PROFILE=$AWS_PROFILE" - "AWS_REGION=us-west-2" - "AWS_DEFAULT_REGION=us-west-2" - "DISABLE_SSL=true" stdin_open: true mysql: image: mysql:8.0 container_name: 'dev-mysql' command: --default-authentication-plugin=mysql_native_password restart: always environment: MYSQL_ROOT_PASSWORD: password volumes: - mysql-db:/var/lib/mysql ports: - '3306:3306' volumes: mysql-db: ``` And according to the documentation found here the following should work without a problem ```python from awsglue.context import GlueContext from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.dynamicframe import DynamicFrame glueContext = GlueContext(SparkContext.getOrCreate()) def df_mysql(glue_context: GlueContext, schema: str, table: str): connection = { "url": f"jdbc:mysql://dev-mysql/{schema}", "dbtable": table, "user": "root", "password": "password", "customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar", "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver" } data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection) data_frame.printSchema() df_mysql(glueContext, "my_schema", "my_table") ``` However this fails with ``` Traceback (most recent call last): File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module> df_mysql(glueContext, "my_schema", "my_table") File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection) File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102) at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14) at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17) at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29) at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878) at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86) at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) ``` What's interesting here is its failing due to `java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver` but if I change the config to use something else, it will fail with the same error message. Is there some other environment variable or configuration I'm supposed to set in order for this to work?
1
answers
0
votes
54
views
asked a month ago
Hello, We are using the AWS Docker container for Glue (available [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)) and we are trying to connect to a Databricks JDBC connection using the DatabricksJDBC42.jar (available [here](https://docs.databricks.com/integrations/jdbc-odbc-bi.html)). We placed the jar file both in the same folder as the jupyter notebook, and have also placed it in the C:/.aws/ folder. When we try to connect we get the error "java.lang.ClassNotFoundException: com.databricks.client.jdbc.Driver". We have used DB2 driver without issue, using the same setup. Also, when we upload the jar to AWS and attach it to the glue job as an --extra-jars parameter it works fine. Has anyone gotten this to successfully work?
3
answers
0
votes
73
views
asked a month ago
Trying to DROP TABLE in Athena and getting the following error: [ErrorCategory:USER_ERROR, ErrorCode:PERMISSION_ERROR], Detail:Amazon Athena experienced a permission error. Please provide proper permission and submitting the query again. If the issue reoccurs, contact AWS support for further assistance. I also tried deleting from Glue under tables and got this error: Failed to delete all tables. AccessDeniedException: Insufficient Lake Formation permission(s): Required Drop on [table name] I went to Lake Formation and granted myself SUPER on the database I'm trying to delete from but I'm still getting the same errors.
1
answers
0
votes
33
views
AWS
asked a month ago
We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job _fails_ the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed. Here's an example of the task as defined in SFN: ``` { "Type": "Task", "Resource": "arn:aws:states:::glue:startJobRun.sync", "Parameters": { "JobName.$": ..., "Arguments": { ... } }, "Next": "Notify Success", "ResultPath": null, "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Notify Failure" } ] } ``` The job fails, and even has `"Attempt": 0` in the `cause` field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries? We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.
1
answers
0
votes
24
views
asked a month ago
Hello All, I am trying to create small python shell script to connect to oracle DB In code I have mentioned import oracledb Since this module is not available. I ran pip install oracledb on my local system and folder got generated in C:\Dev\Python3.11\Lib\site-packages as oracledb. Then copied that folder to s3 bucket and passed the path of s3 bucket s3://XXXXXXXX/oracledb/ like this . Now when i am running it I am getting below error CommandFailedException: Library file doesn't exist: /tmp/glue-python-libs-jmz3/
2
answers
0
votes
51
views
asked a month ago
### Issue `relationalize` is coercing timestamp columns of nested child tables to string. Note I've tested this using Glue 4.0 locally with the AWS provided `amazon/aws-glue-libs:glue_libs_4.0.0_image_01` Docker image, however it is derived from a live Glue issue we have. Our real example is much more complicated with more levels of nesting, and as such re-casting to timestamp is a possible workaround but really not ideal. ### Minimal Reproduction Script ``` from pyspark.context import SparkContext from pyspark import SparkConf from awsglue.context import GlueContext from awsglue.dynamicframe import DynamicFrame from pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType from datetime import datetime sc = SparkContext.getOrCreate(SparkConf()) # df = spark.createDataFrame(data=data, schema=schema) glueContext = GlueContext(sc) spark = glueContext.spark_session schema: StructType = StructType( [ StructField("parent_id", StringType(), True), StructField("timestamp_col", TimestampType(), True), StructField( "child", ArrayType( StructType( [ StructField("child_id", StringType(), True), StructField("child_field", StringType(), True), StructField("child_timestamp", TimestampType(), True), ] ), True, ), True, ), ] ) input_data = { "parent_id": "test_id", "updatedAt": datetime(2022, 4, 1, 20, 50, 2, 515000), "child": [ { "_id": "test_child_id", "child_field": "test_child_field", "child_timestamp": datetime(2022, 4, 6, 19, 0, 0), }, ], } data = [tuple(input_data.values())] df = spark.createDataFrame(data=data, schema=schema) df.printSchema() """ root |-- parent_id: string (nullable = true) |-- timestamp_col: timestamp (nullable = true) |-- child: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- child_id: string (nullable = true) | | |-- child_field: string (nullable = true) | | |-- child_timestamp: timestamp (nullable = true) <- Correct type """ dynamic_frame = DynamicFrame.fromDF(df, glueContext, "test") dfc = dynamic_frame.relationalize("root", "./tmp/") root = dfc.select("root").toDF() root.printSchema() """ root |-- parent_id: string (nullable = true) |-- timestamp_col: timestamp (nullable = true) <- Correct type |-- child: long (nullable = true) """ viewers = dfc.select("root_child").toDF() viewers.printSchema() """ root |-- id: long (nullable = true) |-- index: integer (nullable = true) |-- child.val.child_field: string (nullable = true) |-- child.val.child_timestamp: string (nullable = true) <- Only incorrect type """ ``` As called out in the comments with the printed output, `child.val.child_timestamp` has been coerced to a string. Its value is just the string representation of the Timestamp. ### Question As a user it looks like a bug as the `relationalize` documentation makes no mention of modifying datatypes. Is there a way to control this behavior or prevent it?
1
answers
1
votes
28
views
asked a month ago
Need advise on which approach is considered as best practice and correct while writing data to s3 and populating metadata to glue catalog. Is there any pros and cons? 1. df.write.parquet("s3://path_to_my_database_and_table") and then run crawler on the same s3 path. 2. df.write.mode("overwrite").saveAsTable("my_database.my_table") 3. df.write.mode("overwrite").saveAsTable("my_database.my_table") and run crawler on already created tables. Would appreciate if you would share your experiance. Thank you.
2
answers
0
votes
23
views
asked a month ago
Each supplier provides me with their product data in their own format, so I need to map them in order to import the data into Magento. As the product data needs to be updated frequently, I'm wondering if AWS Glue (Databrew) would be a suitable service to map these different Excel files into the same unified Magento format. I would configure it once and once I get an updated file, I can just use the glue job again. Would you recommend Glue (Databrew) for this or is it just for data analysis preparation?
0
answers
0
votes
19
views
Max
asked a month ago