Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I want to join two tables.I have the tables in CSV format stored in S3 bucket 1.Is Amazon Glue studio,the right option? 2.What is the correct procedure? 3.What are the IAM permissions required? 4.Where to see the joined table output? Please throw some light
2
answers
0
votes
41
views
asked 24 days ago
I have a working lab setup that has a glue job extract all data from a single dynamodb table to s3 in json format. This was done with the super simple setup using the AWS Glue Dynamo connector, all through the glue visual editor. I plan to run the job daily to refresh the data. The job is setup with Glue 3.0 & Python 3. Two questions: 1. I assume I need to purge/delete the s3 objects from the previous ETL job each night - how is this done within glue or do I need to handle it outside of glue? 2. I would like to update that job to limit the data sent to s3 to only include dynamodb records that have a specific key/value (status <> 'completed') so that I am not loading all of the dynamo data into my target. I dont care if the job has to get ALL of the dynamo table during extract and then filters it out during the transform phase, or if there is a way to selectively get data during the extract phase even better. If anyone could advise with a simple example I would appreciate it. While I have looked for a little bit, I havent found much quality educational material, so happy to take any suggestions there as well (other than the AWS documentation - I have that, but need some initial direction/reference/101 hands on).
0
answers
0
votes
26
views
Damian
asked a month ago
Hey I am running into this error during a data ingest job. The job has worked in the past with main different files but this one refuses to ingest. I am curious if anyone has over come it The error seems like some sort of threading issue and cant write data. Cloudwatch logs ``` 2023-02-23 19:13:01,888 ERROR [shutdown-hook-0] util.Utils (Logging.scala:logError(94)): Uncaught exception in thread shutdown-hook-0 java.lang.ExceptionInInitializerError at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:125) at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:149) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:356) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2424) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2390) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2353) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.copyFromLocalFile(EmrFileSystem.java:568) at com.amazonaws.services.glue.LogPusher.upload(LogPusher.scala:27) at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2(ShutdownHookManagerWrapper.scala:9) at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2$adapted(ShutdownHookManagerWrapper.scala:9) at scala.Option.foreach(Option.scala:257) at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$1(ShutdownHookManagerWrapper.scala:9) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at scala.util.Try$.apply(Try.scala:209) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalStateException: Shutdown in progress at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66) at java.lang.Runtime.addShutdownHook(Runtime.java:203) at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoryShutdownHook.<clinit>(TemporaryDirectoryShutdownHook.java:18) ... 31 more ```
0
answers
0
votes
12
views
jbates
asked a month ago
Hi, I am using GlueETL version Spark 3.0 with Python version ![Glue Job Details](/media/postImages/original/IMGyPWz2XIS_-4GohAdHXVpw) The ETL job has only 2 steps. I am using CodeGenConfiguration to auto-create the Spark script from my service backend. ``` "{\"sink-node-1\":{\"nodeId\":\"sink-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[\"source-node-1\"],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=62789a5b-78dd-4d41-ae96-1447674861a6__type=GDL\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id62789a5b78dd4d41ae961447674861a6__typeGDL_sinknode1\",\"classification\":\"DataSink\",\"type\":\"S3\",\"streamingBatchInterval\":100,\"format\":\"parquet\",\"compression\":\"snappy\",\"path\":\"s3://x-bucket/event_etl_data/source_id=glueetl/schema_id=etl_raw_event/pipeline_id=fcf172f2-1cd1-4f9d-bdce-62b3b0c26696/organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e/model_name=test-sql-with-database-namez__version=None/\",\"partitionKeys\":[[\"year\"],[\"month\"],[\"day\"],[\"hour\"]],\"schemaChangePolicy\":{\"enableUpdateCatalog\":false,\"updateBehavior\":null,\"database\":null,\"table\":null},\"updateCatalogOptions\":\"none\",\"calculatedType\":\"\"},\"source-node-1\":{\"nodeId\":\"source-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id9cf97c7bced84096a7c32ca3560e0fd0__typeSNOWFLAKE_sourcenode1\",\"classification\":\"DataSource\",\"type\":\"Connector\",\"isCatalog\":false,\"connectorName\":\"SNOWFLAKE\",\"connectionName\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"connectionType\":\"custom.jdbc\",\"outputSchemas\":[],\"connectionTable\":null,\"query\":\"SELECT \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"A\\\" AS \\\"inputs__A\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"B\\\" AS \\\"inputs__B\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"C\\\" AS \\\"outputs__C\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"ID\\\" AS \\\"feedback_id\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"D\\\" AS \\\"timestamp\\\", year(SYSDATE()) AS \\\"year\\\", month(SYSDATE()) AS \\\"month\\\", day(SYSDATE()) AS \\\"day\\\", hour(SYSDATE()) AS \\\"hour\\\", SYSDATE() AS \\\"log_timestamp\\\" FROM \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\" ORDER BY \\\"D\\\"\",\"additionalOptions\":{\"filterPredicate\":\"\",\"partitionColumn\":null,\"lowerBound\":null,\"upperBound\":null,\"numPartitions\":null,\"jobBookmarkKeys\":[],\"jobBookmarkKeysSortOrder\":\"ASC\",\"dataTypeMapping\":{},\"filterPredicateArg\":[],\"dataTypeMappingArg\":[]},\"calculatedType\":\"\"}}" ``` ![Enter image description here](/media/postImages/original/IMcyL_gd9eR4GxbOeCOYE-lA) As you can see, I am using the Snowflake JDBC connector, and simply using S3DirectTarget to write the parquet files to S3 destination. However, any NULL values of numeric columns in the source table ends up with 0.0, and there is no way for me to tell whether these are actual 0.0s or falsely converted 0.0s. Without modifying the PySpark script since my backend service is dependent on CodeGenConfiguration, is there a way to make sure the NULL values do not get falsely converted? Thanks, Kyle
1
answers
0
votes
10
views
asked a month ago
Hi, I built Iceberg table that uses Glue as the Hive catalog. Team members I work with want to connect to it using Spark. They run Spark locally on their laptop and want to read the table or they have Spark running locally in an Airflow Task on an EC2 and want to connect to it. Is that possible to configure Spark not running on Glue or EMR to connect to Glue as the Hive Metastore? If so some examples would be appreciative. We set this conf when running Iceberg "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". Is this a JAR I can add to any Spark application that allows it to connect to AWS Glue as the Hive site or only works on EMR?
1
answers
0
votes
25
views
Thomas
asked a month ago
In my case, When query about information_schema.columns in athena, Result does not include [Comment] column Is there any update? or is it just temporary Error?
1
answers
0
votes
29
views
asked a month ago
I am following what is mentioned in to launch spark history server locally and run spark ui but getting error on starting container > https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html Did anyone face same issue ? Please help ``` 2023-02-22 17:54:07 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 2023-02-22 17:54:07 23/02/22 22:54:07 INFO HistoryServer: Started daemon with process name: 1@514d84090bb7 2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for TERM 2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for HUP 2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for INT 2023-02-22 17:54:07 23/02/22 22:54:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls to: root 2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls to: root 2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls groups to: 2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls groups to: 2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager:** SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2023-02-22 17:54:08 23/02/22 22:54:08 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin **permissions: 2023-02-22 17:54:08 Exception in thread "main" java.lang.reflect.InvocationTargetException 2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 2023-02-22 17:54:08 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 2023-02-22 17:54:08 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:300) 2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) 2023-02-22 17:54:08 Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3" 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281) 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301) 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) 2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) 2023-02-22 17:54:08 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) 2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:116) 2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:88) 2023-02-22 17:54:08 ... 6 more ```
Accepted AnswerAWS Glue
1
answers
0
votes
55
views
asked a month ago
I want to create EventBridge event to trigger Glue job, but when I create glue trigger there is no option for EventBridge (on legacy page it is but is blocked). I have CloudTrail enabled. Where is the problem? Is this option still available?
1
answers
0
votes
44
views
jarxrr
asked a month ago
I am trying to delete entries from my Lake Formation Governed Table. I ran the commands via the SDK, and it all looked successful, but the linked Athena still sees the data that was supposedly deleted. Deleting the S3 resources after (since DeleteObject from the governed table doesn't adjust S3) now throws errors in Athena because the expected files are missing. Is there something wrong with my process of deleting from Lake Formation Governed tables?
1
answers
0
votes
27
views
rf
asked a month ago
Hi, I don't see Glue DataBrew in Terraform's AWS provider. I do see that it's supported in CloudFormation (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-databrew-job.html). Can anyone help me locate it please?
1
answers
0
votes
76
views
asked a month ago
Hello All, I am trying to implement solution mentioned at below link : https://medium.com/analytics-vidhya/multithreading-parallel-job-in-aws-glue-a291219123fb In this solution they have shown AWS logs showing of Scheduler settings . I am not getting where will I get these complete logs . I am running Glue job from console and there I see 3 types of logs All logs Output Logs Error logs When I am opening all logs > I don't get anything Output logs > getting output what I am printing in my script . This also sure it also shows something related to pyspark application . Error log > I am not sure about
1
answers
0
votes
47
views
asked a month ago
I'm running a Glue 4.0 job with some local algorithmic process. I tested this on my local instance and it works fine. `from sklearn.model_selection import StratifiedGroupKFold, RandomizedSearchCV` But when I run it on Glue, it gives me exception, ``` ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/__init__.py) ``` The Glue 4.0 does have a `scikit-learn=1.1.3`, which are compatible with the version on my local instance according to this https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html not sure why this happens? **Update I** A little bit weird. I tried output the sklearn version in the Glue job, it shows `scikit-learn==0.24.2`, which doesn't match the official doc. Was there a mismatch? **Update II** I tried to append below configs to force upgrade the scikit-learn version. But just not a perfect solution since the lib version mismatch. ``` --additional-python-modules: scikit-learn --python-modules-installer-option: --upgrade ```
1
answers
0
votes
29
views
asked a month ago