Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I am trying to create rule in eventbridge to trigger a workflow when a specific file format is upload in the desired object of s3 bucket. ``` { "source": ["aws.s3"], "detail-type": ["AWS API Call via CloudTrail"], "detail": { "eventSource": ["s3.amazonaws.com"], "eventName": ["PutObject"], "requestParameters": { "bucketName": ["my-bucket"], "key": [{ "prefix": "folder1/folder2" }], "FileName": [ { "suffix": ".xlsx" } ] } } } ``` I upload files say in s3://my-bucket/folder1/folder2/folder3/test.xlsx, glue workflow is not triggered. Can someone help me in this event pattern to trigger workflow for specific file type?
2
answers
0
votes
29
views
asked 12 days ago
we have a usecase to create a Glue schedule trigger (timezone: UTC-3)
1
answers
0
votes
13
views
asked 12 days ago
New to glue and athena. I have a great toy example by an AWS community builder working. But in my real use case, I want to capture all the fields from an eventbridge event 'detail' section and have columns created. This is nested multiple levels. I can't figure out the schema discovery process. Tried to post a text file to S3 and have a glue crawler work on it but no luck. Thanks in advance.
1
answers
0
votes
10
views
asked 12 days ago
I am using an IAM role AWSGlueServiceRole created in AWS Glue and tried to create the crwaler to run on S3 source. The error I get is The following crawler failed to create: "abc" Here is the most recent error message: Account XXX is denied access. Also tried with another role that I created with below policies. But still get the same error. AmazonS3FullAccess AWSGlueServiceRole AdministratorAccess AWSGlueConsoleFullAccess AWSGlueSchemaRegistryFullAccess AWSGlueDataBrewServiceRole ![Enter image description here](/media/postImages/original/IMPaOXZmKjQvWt8ZZZxtHSmw)
2
answers
0
votes
30
views
asked 13 days ago
I am trying to use the AWS Glue Studio to build a simple ETL workflow. Basically, I have a bunch of `csv` files in different directories in S3. I want those csvs to be accessible via a database and have chosen Redshift for the job. The directories and will be updated every day with new csv files. The file structure is: YYYY-MM-DD (e.g. 2023-03-07) |---- groupName1 |---- groupName1.csv |---- groupName2 |---- groupName2.csv ... |---- groupNameN |---- groupNameN.csv We will be keeping historical data, so every day I will have a new date-based directory. I've read that AWS Glue can automatically copy data on a schedule but I can't see my Redshift databases or tables (screenshot below). I'm using my AWS admin account and I do have `AWSGlueConsoleFullAccess` permission (screenshot below) ![Enter image description here](/media/postImages/original/IMLGj4xk83RSiWw_X-q368iA) ![Enter image description here](/media/postImages/original/IMdY_iM6ckSMOvvFgXb7FRsw)
1
answers
0
votes
11
views
asked 14 days ago
Hi All, I have some issues when running my glue job, I landed my pipe delimited csv file in a s3 bucket and after running the crawler pointing to the folder where the file is placed, a glue catalog table is created. However when I tried to read the data(code below) from the catalog table in a glue job for additional processing and converting to parquet, its not picking all the records. dyf = glueContext.create_dynamic_frame.from_catalog( database=DATABASE, table_name=table_name, transformation_ctx="dyf-" + table_name, ) rows = dyf.count() print(f"DataFrame records count : {rows}") Please can someone suggest what could be the reason for the missing records? I see that there are three columns in the catalog table with incorrect data type( bigint in place of string). I went and manually corrected the data type and set the infer_schema = True in the above code. job is still not picking up the correct number of records.
0
answers
0
votes
33
views
Pradeep
asked 14 days ago
I am getting error when I am running AWS Glue Job with Data Quality Check! ModuleNotFoundError: No module named 'awsgluedq' Is there anyone can help? Thanks,
0
answers
0
votes
12
views
Ali
asked 14 days ago
We have a Jupyter Notebook Glue Job and We're calling start_job_run from a Lambda with Python with boto3, we would like to change the default MaxConcurrentRuns (1) and MaxRetries (3) of the Job, as a Notebook we need to use magics, we already tried with magics and also we tried using Arguments in the Lambda that runs the Job, nothing seems to work, how should we set those settings to guarantee the Jupyter Notebook never will retry and will have a concurrency of 5, for example, we would like to have max 5 Jobs running at the same time.
1
answers
0
votes
15
views
jarvy
asked 17 days ago
Available Amazon SageMaker Kernels include [the following two Spark kernels](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-kernels.html): - PySpark (SparkMagic) with Python 3.7 - Spark (SparkMagic) with Python 3.7 - Spark Analytics 1.0 - Spark Analytics 2.0 And at re:Invent 2022 there was [an announcement](https://aws.amazon.com/about-aws/whats-new/2022/09/sagemaker-studio-supports-glue-interactive-sessions/) that "SageMaker Studio now supports Glue Interactive Sessions." "The built-in Glue PySpark or Glue Spark kernel for your Studio notebook to initialize interactive, serverless Spark sessions." It seems like the benefits of using one of the Glue Spark kernels are that you can "quickly browse the Glue data catalog, run large queries, and interactively analyze and prepare data using Spark, right in your Studio notebook." But can't you already do all that with the existing two SageMaker kernels? In other words, how do you choose whether to use one of the existing two SparkMagic kernels in SageMaker Studio notebooks or to use this new Glue Interactive Sessions feature?
1
answers
0
votes
24
views
AWS
asked 18 days ago
Hey Guys! I am trying to Read a large amout of data(About 45GB in 5.500.000 files) in S3 and rewrite in a partitioned folder (In another Folder inside the same Bucket) but I am facing this error: Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord When I tried with just one file in the same folder it works. do you have any Ideia what could be the Problem? Code(running using 60 DPUs, Glue 4.0): ``` import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="parquet", options=JsonOptions("""{"paths": ["s3://bucket/raw-folder"],"recurse": true, "groupFiles": "inPartition", "useS3ListImplementation": true}""") ).getDynamicFrame() glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://bucket/partition-folder"}"""), format="parquet", formatOptions=JsonOptions("""{"compression": "snappy","blockSize": 268435456, "pageSize": 1048576, "useGlueParquetWriter": true}""") ).writeDynamicFrame(dynamicFrame.repartition(10)) } } ``` Best
1
answers
0
votes
51
views
lp_evan
asked 18 days ago
I do a crawler to load all my S3 csv files to Glue Data Catalog. Now I want to create a glue job to execute ETL (create and drop temporary tables, select and insert data to tables in Data Catalog) But in the glue job as a python shell I have to split my sql statements to execute one by one. With the following code, I got an error. client = boto3.client('athena') client.start_query_execution(QueryString = """drop table if exists database1.temptable1 ;CREATE EXTERNAL TABLE IF NOT EXISTS temptable1(id int ) """, ResultConfiguration = config) Is there any way to run multiple sql statements in glue job?
0
answers
0
votes
23
views
asked 19 days ago
Data layer is not my thing and I need some guidance. I create a glue crawler to extract compressed JSON files and store them in an aws S3 bucket. I recently learned that I can use Athena to directly connect to the glue database. When I do select * from *table-name* It starts to load but then errors with long string of stuff HIVE_METASTORE_ERROR: Error: : expected at the position 407 of 'struct<http:struct<status_code:int,url_details:struct<path:string,queryString:struct<prefix:string,versioning:string,logging:string,encoding-type:string,nodes:string,policy:string,acl:string,policyStatus:string,replication:string,notification:string,tagging:string,website:string,encryption:string,size:string,limit:string,hash:string,accelerate:string,publicAccessBlock:string,code:string,protocol:string,G%EF%BF%BD%EF%BF%BD%EF%BF%BD\,%EF%BF%BD%EF%BF%BD%EF%BF%BD`~%EF%BF%BD%00%EF%BF%BD%EF%BF%BD{%EF%BF%BD%D5%96%EF%BF%BDw%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3C:string,cors:string,object- etc etc etc. I can load one small table but the others fail.
0
answers
0
votes
34
views
asked 19 days ago