Questions tagged with AWS Glue
Content language: English
Sort by most recent
hello,
AWS recently announced that Glue crawlers can now create native delta lake tables (last December, https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/). We tested and it works fine. However, we would like to not use crawlers.
Is this the only way to create a native delta lake table for now? Is this planned to allow this through the Glue console table creation screen?
As a side note, it looks like terraform is still missing a "CreateNativeDeltaTable" option in their latest provider (they have an open issue for that).
Thanks.
Cheers,
Fabrice
I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:
1. Partition columns: Year,Month & Day
2. catalogPartitionPredicate: year>='2021' and month>='12'
If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.
Is there any way to implement to_date functionality or any other workaround to handle this scenario?
I was able to include the DatabricksJDBC42.jar in my Glue Docker container used for local machine development ([link](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)).
I am able to reach the host using Jupyter notebook, but I am getting an SSL type error
`Py4JJavaError: An error occurred while calling o80.load.
: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.`
My connection string looks like this:
`.option("url","jdbc:databricks://host.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/111111111111111/1111-111111-abcdefghi;AuthMech=3;UseNativeQuery=0;StripCatalogName=0;")\
.option("dbtable","select 1")\
.option("driver", "com.databricks.client.jdbc.Driver")\
.load()`
I used the same JDBC string in my code uploaded to our live account and the AWS Glue job runs and executes the queries in dbtable just fine. Its just in the local Docker Glue development container where we get this SSL error.
I tried adding a separate **option** for the sslConnection and sslCertLocation and placed tried the files in /root/,aws as well as the jupyter notebook folder. The cert is showing in directory listings and is correctly assigned but the jdbc connection is failing with the SSL error.
Anyone see this before or have a suggestion for next steps?
Thanks.
Trying to load tables from single source.. Source has data for EMP NAME , ADDRESS … target table A has EMP ID(Auto generated PK) and EMP NAME … table B has EMP ID(foreign key) ADDRESS ID ( Auto generated PK) AND ADDRESS…
Now how to load these two tables using AWS GLUE ?
No proper notes any where for this… can you help clarify
Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.
My docker setup is:
```yaml
version: '3.7'
services:
glue:
container_name: "dev-glue"
image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
ports:
- "4040:4040"
- "18080:18080"
volumes:
- ~/.aws:/home/glue_user/.aws
- /workspace:/home/glue_user/workspace/
environment:
- "AWS_PROFILE=$AWS_PROFILE"
- "AWS_REGION=us-west-2"
- "AWS_DEFAULT_REGION=us-west-2"
- "DISABLE_SSL=true"
stdin_open: true
mysql:
image: mysql:8.0
container_name: 'dev-mysql'
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- mysql-db:/var/lib/mysql
ports:
- '3306:3306'
volumes:
mysql-db:
```
And according to the documentation found here the following should work without a problem
```python
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
def df_mysql(glue_context: GlueContext, schema: str, table: str):
connection = {
"url": f"jdbc:mysql://dev-mysql/{schema}",
"dbtable": table,
"user": "root",
"password": "password",
"customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
}
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
data_frame.printSchema()
df_mysql(glueContext, "my_schema", "my_table")
```
However this fails with
```
Traceback (most recent call last):
File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
df_mysql(glueContext, "my_schema", "my_table")
File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
```
What's interesting here is its failing due to `java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver` but if I change the config to use something else, it will fail with the same error message.
Is there some other environment variable or configuration I'm supposed to set in order for this to work?
Hello,
We are using the AWS Docker container for Glue (available [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)) and we are trying to connect to a Databricks JDBC connection using the DatabricksJDBC42.jar (available [here](https://docs.databricks.com/integrations/jdbc-odbc-bi.html)). We placed the jar file both in the same folder as the jupyter notebook, and have also placed it in the C:/.aws/ folder. When we try to connect we get the error "java.lang.ClassNotFoundException: com.databricks.client.jdbc.Driver".
We have used DB2 driver without issue, using the same setup. Also, when we upload the jar to AWS and attach it to the glue job as an --extra-jars parameter it works fine.
Has anyone gotten this to successfully work?
Trying to DROP TABLE in Athena and getting the following error:
[ErrorCategory:USER_ERROR, ErrorCode:PERMISSION_ERROR], Detail:Amazon Athena experienced a permission error. Please provide proper permission and submitting the query again. If the issue reoccurs, contact AWS support for further assistance.
I also tried deleting from Glue under tables and got this error:
Failed to delete all tables.
AccessDeniedException: Insufficient Lake Formation permission(s): Required Drop on [table name]
I went to Lake Formation and granted myself SUPER on the database I'm trying to delete from but I'm still getting the same errors.
We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job _fails_ the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.
Here's an example of the task as defined in SFN:
```
{
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": ...,
"Arguments": {
...
}
},
"Next": "Notify Success",
"ResultPath": null,
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Notify Failure"
}
]
}
```
The job fails, and even has `"Attempt": 0` in the `cause` field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?
We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.
Hello All,
I am trying to create small python shell script to connect to oracle DB
In code I have mentioned
import oracledb
Since this module is not available. I ran pip install oracledb on my local system and folder got generated in C:\Dev\Python3.11\Lib\site-packages as oracledb.
Then copied that folder to s3 bucket and passed the path of s3 bucket s3://XXXXXXXX/oracledb/ like this .
Now when i am running it I am getting below error CommandFailedException: Library file doesn't exist: /tmp/glue-python-libs-jmz3/
### Issue
`relationalize` is coercing timestamp columns of nested child tables to string.
Note I've tested this using Glue 4.0 locally with the AWS provided `amazon/aws-glue-libs:glue_libs_4.0.0_image_01` Docker image, however it is derived from a live Glue issue we have. Our real example is much more complicated with more levels of nesting, and as such re-casting to timestamp is a possible workaround but really not ideal.
### Minimal Reproduction Script
```
from pyspark.context import SparkContext
from pyspark import SparkConf
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType
from datetime import datetime
sc = SparkContext.getOrCreate(SparkConf())
# df = spark.createDataFrame(data=data, schema=schema)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
schema: StructType = StructType(
[
StructField("parent_id", StringType(), True),
StructField("timestamp_col", TimestampType(), True),
StructField(
"child",
ArrayType(
StructType(
[
StructField("child_id", StringType(), True),
StructField("child_field", StringType(), True),
StructField("child_timestamp", TimestampType(), True),
]
),
True,
),
True,
),
]
)
input_data = {
"parent_id": "test_id",
"updatedAt": datetime(2022, 4, 1, 20, 50, 2, 515000),
"child": [
{
"_id": "test_child_id",
"child_field": "test_child_field",
"child_timestamp": datetime(2022, 4, 6, 19, 0, 0),
},
],
}
data = [tuple(input_data.values())]
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
"""
root
|-- parent_id: string (nullable = true)
|-- timestamp_col: timestamp (nullable = true)
|-- child: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- child_id: string (nullable = true)
| | |-- child_field: string (nullable = true)
| | |-- child_timestamp: timestamp (nullable = true) <- Correct type
"""
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "test")
dfc = dynamic_frame.relationalize("root", "./tmp/")
root = dfc.select("root").toDF()
root.printSchema()
"""
root
|-- parent_id: string (nullable = true)
|-- timestamp_col: timestamp (nullable = true) <- Correct type
|-- child: long (nullable = true)
"""
viewers = dfc.select("root_child").toDF()
viewers.printSchema()
"""
root
|-- id: long (nullable = true)
|-- index: integer (nullable = true)
|-- child.val.child_field: string (nullable = true)
|-- child.val.child_timestamp: string (nullable = true) <- Only incorrect type
"""
```
As called out in the comments with the printed output, `child.val.child_timestamp` has been coerced to a string. Its value is just the string representation of the Timestamp.
### Question
As a user it looks like a bug as the `relationalize` documentation makes no mention of modifying datatypes. Is there a way to control this behavior or prevent it?
Need advise on which approach is considered as best practice and correct while writing data to s3 and populating metadata to glue catalog.
Is there any pros and cons?
1. df.write.parquet("s3://path_to_my_database_and_table") and then run crawler on the same s3 path.
2. df.write.mode("overwrite").saveAsTable("my_database.my_table")
3. df.write.mode("overwrite").saveAsTable("my_database.my_table") and run crawler on already created tables.
Would appreciate if you would share your experiance.
Thank you.
Each supplier provides me with their product data in their own format, so I need to map them in order to import the data into Magento.
As the product data needs to be updated frequently, I'm wondering if AWS Glue (Databrew) would be a suitable service to map these different Excel files into the same unified Magento format. I would configure it once and once I get an updated file, I can just use the glue job again.
Would you recommend Glue (Databrew) for this or is it just for data analysis preparation?