Questions tagged with AWS Glue
Content language: English
Sort by most recent
I want to join two tables.I have the tables in CSV format stored in S3 bucket
1.Is Amazon Glue studio,the right option?
2.What is the correct procedure?
3.What are the IAM permissions required?
4.Where to see the joined table output?
Please throw some light
I have a working lab setup that has a glue job extract all data from a single dynamodb table to s3 in json format. This was done with the super simple setup using the AWS Glue Dynamo connector, all through the glue visual editor. I plan to run the job daily to refresh the data. The job is setup with Glue 3.0 & Python 3. Two questions:
1. I assume I need to purge/delete the s3 objects from the previous ETL job each night - how is this done within glue or do I need to handle it outside of glue?
2. I would like to update that job to limit the data sent to s3 to only include dynamodb records that have a specific key/value (status <> 'completed') so that I am not loading all of the dynamo data into my target. I dont care if the job has to get ALL of the dynamo table during extract and then filters it out during the transform phase, or if there is a way to selectively get data during the extract phase even better.
If anyone could advise with a simple example I would appreciate it. While I have looked for a little bit, I havent found much quality educational material, so happy to take any suggestions there as well (other than the AWS documentation - I have that, but need some initial direction/reference/101 hands on).
Hey I am running into this error during a data ingest job. The job has worked in the past with main different files but this one refuses to ingest.
I am curious if anyone has over come it The error seems like some sort of threading issue and cant write data.
Cloudwatch logs
```
2023-02-23 19:13:01,888 ERROR [shutdown-hook-0] util.Utils (Logging.scala:logError(94)): Uncaught exception in thread shutdown-hook-0
java.lang.ExceptionInInitializerError
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:125)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:149)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:356)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2424)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2390)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2353)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.copyFromLocalFile(EmrFileSystem.java:568)
at com.amazonaws.services.glue.LogPusher.upload(LogPusher.scala:27)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2(ShutdownHookManagerWrapper.scala:9)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2$adapted(ShutdownHookManagerWrapper.scala:9)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$1(ShutdownHookManagerWrapper.scala:9)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.util.Try$.apply(Try.scala:209)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: Shutdown in progress
at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
at java.lang.Runtime.addShutdownHook(Runtime.java:203)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoryShutdownHook.<clinit>(TemporaryDirectoryShutdownHook.java:18)
... 31 more
```
Hi,
I am using GlueETL version Spark 3.0 with Python version 
The ETL job has only 2 steps. I am using CodeGenConfiguration to auto-create the Spark script from my service backend.
```
"{\"sink-node-1\":{\"nodeId\":\"sink-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[\"source-node-1\"],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=62789a5b-78dd-4d41-ae96-1447674861a6__type=GDL\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id62789a5b78dd4d41ae961447674861a6__typeGDL_sinknode1\",\"classification\":\"DataSink\",\"type\":\"S3\",\"streamingBatchInterval\":100,\"format\":\"parquet\",\"compression\":\"snappy\",\"path\":\"s3://x-bucket/event_etl_data/source_id=glueetl/schema_id=etl_raw_event/pipeline_id=fcf172f2-1cd1-4f9d-bdce-62b3b0c26696/organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e/model_name=test-sql-with-database-namez__version=None/\",\"partitionKeys\":[[\"year\"],[\"month\"],[\"day\"],[\"hour\"]],\"schemaChangePolicy\":{\"enableUpdateCatalog\":false,\"updateBehavior\":null,\"database\":null,\"table\":null},\"updateCatalogOptions\":\"none\",\"calculatedType\":\"\"},\"source-node-1\":{\"nodeId\":\"source-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id9cf97c7bced84096a7c32ca3560e0fd0__typeSNOWFLAKE_sourcenode1\",\"classification\":\"DataSource\",\"type\":\"Connector\",\"isCatalog\":false,\"connectorName\":\"SNOWFLAKE\",\"connectionName\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"connectionType\":\"custom.jdbc\",\"outputSchemas\":[],\"connectionTable\":null,\"query\":\"SELECT \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"A\\\" AS \\\"inputs__A\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"B\\\" AS \\\"inputs__B\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"C\\\" AS \\\"outputs__C\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"ID\\\" AS \\\"feedback_id\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"D\\\" AS \\\"timestamp\\\", year(SYSDATE()) AS \\\"year\\\", month(SYSDATE()) AS \\\"month\\\", day(SYSDATE()) AS \\\"day\\\", hour(SYSDATE()) AS \\\"hour\\\", SYSDATE() AS \\\"log_timestamp\\\" FROM \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\" ORDER BY \\\"D\\\"\",\"additionalOptions\":{\"filterPredicate\":\"\",\"partitionColumn\":null,\"lowerBound\":null,\"upperBound\":null,\"numPartitions\":null,\"jobBookmarkKeys\":[],\"jobBookmarkKeysSortOrder\":\"ASC\",\"dataTypeMapping\":{},\"filterPredicateArg\":[],\"dataTypeMappingArg\":[]},\"calculatedType\":\"\"}}"
```

As you can see, I am using the Snowflake JDBC connector, and simply using S3DirectTarget to write the parquet files to S3 destination. However, any NULL values of numeric columns in the source table ends up with 0.0, and there is no way for me to tell whether these are actual 0.0s or falsely converted 0.0s. Without modifying the PySpark script since my backend service is dependent on CodeGenConfiguration, is there a way to make sure the NULL values do not get falsely converted?
Thanks,
Kyle
Hi,
I built Iceberg table that uses Glue as the Hive catalog. Team members I work with want to connect to it using Spark. They run Spark locally on their laptop and want to read the table or they have Spark running locally in an Airflow Task on an EC2 and want to connect to it.
Is that possible to configure Spark not running on Glue or EMR to connect to Glue as the Hive Metastore? If so some examples would be appreciative.
We set this conf when running Iceberg "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory".
Is this a JAR I can add to any Spark application that allows it to connect to AWS Glue as the Hive site or only works on EMR?
In my case,
When query about information_schema.columns in athena, Result does not include [Comment] column
Is there any update? or is it just temporary Error?
I am following what is mentioned in to launch spark history server locally and run spark ui but getting error on starting container
> https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
Did anyone face same issue ? Please help
```
2023-02-22 17:54:07 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2023-02-22 17:54:07 23/02/22 22:54:07 INFO HistoryServer: Started daemon with process name: 1@514d84090bb7
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for TERM
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for HUP
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for INT
2023-02-22 17:54:07 23/02/22 22:54:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls to: root
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls to: root
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls groups to:
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls groups to:
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager:** SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2023-02-22 17:54:08 23/02/22 22:54:08 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin **permissions:
2023-02-22 17:54:08 Exception in thread "main" java.lang.reflect.InvocationTargetException
2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2023-02-22 17:54:08 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2023-02-22 17:54:08 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:300)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
2023-02-22 17:54:08 Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
2023-02-22 17:54:08 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:116)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:88)
2023-02-22 17:54:08 ... 6 more
```
I want to create EventBridge event to trigger Glue job, but when I create glue trigger there is no option for EventBridge (on legacy page it is but is blocked). I have CloudTrail enabled. Where is the problem? Is this option still available?
I am trying to delete entries from my Lake Formation Governed Table. I ran the commands via the SDK, and it all looked successful, but the linked Athena still sees the data that was supposedly deleted. Deleting the S3 resources after (since DeleteObject from the governed table doesn't adjust S3) now throws errors in Athena because the expected files are missing.
Is there something wrong with my process of deleting from Lake Formation Governed tables?
Hi,
I don't see Glue DataBrew in Terraform's AWS provider.
I do see that it's supported in CloudFormation (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-databrew-job.html).
Can anyone help me locate it please?
Hello All,
I am trying to implement solution mentioned at below link :
https://medium.com/analytics-vidhya/multithreading-parallel-job-in-aws-glue-a291219123fb
In this solution they have shown AWS logs showing of Scheduler settings . I am not getting where will I get these complete logs .
I am running Glue job from console and there I see 3 types of logs
All logs
Output Logs
Error logs
When I am opening all logs > I don't get anything
Output logs > getting output what I am printing in my script . This also sure it also shows something related to pyspark application .
Error log > I am not sure about
I'm running a Glue 4.0 job with some local algorithmic process. I tested this on my local instance and it works fine.
`from sklearn.model_selection import StratifiedGroupKFold, RandomizedSearchCV`
But when I run it on Glue, it gives me exception,
```
ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/__init__.py)
```
The Glue 4.0 does have a `scikit-learn=1.1.3`, which are compatible with the version on my local instance according to this https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html not sure why this happens?
**Update I**
A little bit weird. I tried output the sklearn version in the Glue job, it shows `scikit-learn==0.24.2`, which doesn't match the official doc. Was there a mismatch?
**Update II**
I tried to append below configs to force upgrade the scikit-learn version. But just not a perfect solution since the lib version mismatch.
```
--additional-python-modules: scikit-learn
--python-modules-installer-option: --upgrade
```