Questions tagged with AWS Glue
Content language: English
Sort by most recent
Hello All,
I am working on Glue pyspark script . In this script I read data from table and store it in pyspark dataframe. Now I want to add new column whose value will be calculated by passing existing columns to lambda and result will be returned.
So is it possible to call lambda service in Glue script ?
Hello,
I'm stuck with this error and I can't find anything help full.
I'm trying to migrate data between s3 to Redshift,
Note: i crawled both and both tables are in my glue databases
but when i'm running a job there is an error
An error occurred while calling o131.pyWriteDynamicFrame. Exception thrown in awaitResult:
I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)
My hudi config is following:
```
hudi_config = {
'className': 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.precombine.field': config['sort_key'],
'hoodie.datasource.write.partitionpath.field': config['partition_field'],
'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
'hoodie.datasource.write.recordkey.field': config['primary_key'],
'hoodie.table.name': config['hudi_table'],
'hoodie.datasource.hive_sync.database': config['target_database'],
'hoodie.datasource.hive_sync.table': config['hudi_table'],
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.consistency.check.enabled': 'true',
'hoodie.cleaner.commits.retained': 10,
'path': f"s3://{config['target_bucket']}{config['s3_path']}",
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', # noqa: E501
'hoodie.bulkinsert.shuffle.parallelism': 200,
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.insert.shuffle.parallelism': 200,
'hoodie.datasource.hive_sync.support_timestamp': 'true',
# 'hoodie.datasource.write.operation': "insert"
'hoodie.datasource.write.operation': "upsert"
}
```
The reading part looks like this:
```
Read_node1 = glueContext.create_data_frame.from_catalog(
database="gluedatabase",
table_name="table",
transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")
```
The result is being written to s3 bucket and always generates the same file.
Thank you
On the AWS Glue console page.When trying to run the job with parameters.In the job parameters section, while adding the key and value under the parameters. The page is highly unstable. Hence unable to add any key-value pair. Has anyone faced the same issue?
hello,
AWS recently announced that Glue crawlers can now create native delta lake tables (last December, https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/). We tested and it works fine. However, we would like to not use crawlers.
Is this the only way to create a native delta lake table for now? Is this planned to allow this through the Glue console table creation screen?
As a side note, it looks like terraform is still missing a "CreateNativeDeltaTable" option in their latest provider (they have an open issue for that).
Thanks.
Cheers,
Fabrice
I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:
1. Partition columns: Year,Month & Day
2. catalogPartitionPredicate: year>='2021' and month>='12'
If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.
Is there any way to implement to_date functionality or any other workaround to handle this scenario?
I was able to include the DatabricksJDBC42.jar in my Glue Docker container used for local machine development ([link](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)).
I am able to reach the host using Jupyter notebook, but I am getting an SSL type error
`Py4JJavaError: An error occurred while calling o80.load.
: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.`
My connection string looks like this:
`.option("url","jdbc:databricks://host.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/111111111111111/1111-111111-abcdefghi;AuthMech=3;UseNativeQuery=0;StripCatalogName=0;")\
.option("dbtable","select 1")\
.option("driver", "com.databricks.client.jdbc.Driver")\
.load()`
I used the same JDBC string in my code uploaded to our live account and the AWS Glue job runs and executes the queries in dbtable just fine. Its just in the local Docker Glue development container where we get this SSL error.
I tried adding a separate **option** for the sslConnection and sslCertLocation and placed tried the files in /root/,aws as well as the jupyter notebook folder. The cert is showing in directory listings and is correctly assigned but the jdbc connection is failing with the SSL error.
Anyone see this before or have a suggestion for next steps?
Thanks.
Trying to load tables from single source.. Source has data for EMP NAME , ADDRESS … target table A has EMP ID(Auto generated PK) and EMP NAME … table B has EMP ID(foreign key) ADDRESS ID ( Auto generated PK) AND ADDRESS…
Now how to load these two tables using AWS GLUE ?
No proper notes any where for this… can you help clarify
Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.
My docker setup is:
```yaml
version: '3.7'
services:
glue:
container_name: "dev-glue"
image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
ports:
- "4040:4040"
- "18080:18080"
volumes:
- ~/.aws:/home/glue_user/.aws
- /workspace:/home/glue_user/workspace/
environment:
- "AWS_PROFILE=$AWS_PROFILE"
- "AWS_REGION=us-west-2"
- "AWS_DEFAULT_REGION=us-west-2"
- "DISABLE_SSL=true"
stdin_open: true
mysql:
image: mysql:8.0
container_name: 'dev-mysql'
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- mysql-db:/var/lib/mysql
ports:
- '3306:3306'
volumes:
mysql-db:
```
And according to the documentation found here the following should work without a problem
```python
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
def df_mysql(glue_context: GlueContext, schema: str, table: str):
connection = {
"url": f"jdbc:mysql://dev-mysql/{schema}",
"dbtable": table,
"user": "root",
"password": "password",
"customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
}
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
data_frame.printSchema()
df_mysql(glueContext, "my_schema", "my_table")
```
However this fails with
```
Traceback (most recent call last):
File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
df_mysql(glueContext, "my_schema", "my_table")
File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
```
What's interesting here is its failing due to `java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver` but if I change the config to use something else, it will fail with the same error message.
Is there some other environment variable or configuration I'm supposed to set in order for this to work?
Hello,
We are using the AWS Docker container for Glue (available [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)) and we are trying to connect to a Databricks JDBC connection using the DatabricksJDBC42.jar (available [here](https://docs.databricks.com/integrations/jdbc-odbc-bi.html)). We placed the jar file both in the same folder as the jupyter notebook, and have also placed it in the C:/.aws/ folder. When we try to connect we get the error "java.lang.ClassNotFoundException: com.databricks.client.jdbc.Driver".
We have used DB2 driver without issue, using the same setup. Also, when we upload the jar to AWS and attach it to the glue job as an --extra-jars parameter it works fine.
Has anyone gotten this to successfully work?
Trying to DROP TABLE in Athena and getting the following error:
[ErrorCategory:USER_ERROR, ErrorCode:PERMISSION_ERROR], Detail:Amazon Athena experienced a permission error. Please provide proper permission and submitting the query again. If the issue reoccurs, contact AWS support for further assistance.
I also tried deleting from Glue under tables and got this error:
Failed to delete all tables.
AccessDeniedException: Insufficient Lake Formation permission(s): Required Drop on [table name]
I went to Lake Formation and granted myself SUPER on the database I'm trying to delete from but I'm still getting the same errors.
We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job _fails_ the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.
Here's an example of the task as defined in SFN:
```
{
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": ...,
"Arguments": {
...
}
},
"Next": "Notify Success",
"ResultPath": null,
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Notify Failure"
}
]
}
```
The job fails, and even has `"Attempt": 0` in the `cause` field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?
We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.