Questions tagged with AWS Glue
Content language: English
Sort by most recent
AWS glue header doesn't work properly
Hello. I am using aws glue, but the aws glue header does not automatically recognize it. In other words, the header is included in the data and the query does not work properly. The header is Korean, and I tried classfier, but it didn't work. Is there a solution? plus. I use Athena to get the result.
trying to create a csv file in s3 using glue from mongodb as data source.
i have installed a mongodb server on t2 micro. i was able to successfully connect it with mongo compass without ssh tunnel and just the authentication and public ip. then i have also created a crawler and ran it on the source and the crawler successfully created a table and i can see the names of all of the fields. now i am tryin to make a glue job but i am constantly getting this error: An error occurred while calling o96.getDynamicFrame. scala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.lang.String i am successfully able to run another glue job on a sample json data sitting in s3. jobid: Job Run - jr_19c00d6ff707bd8af110e007a207d9d92d0f64e41dacffa98250398b57dbf30b i am stuck on this error for two days now. any help will be highly appreciated.
Glue incremental load
I am loading the data from Amazon RDS(mysql database) to Redshift using AWS Glue ETL and data catolgue. But I can't figure out how to do incremental loading(upsert)? Is there a way to create a filter/parameter on date while reading from source database to load only yesterday's data without using bookmarks?
Athena | Cross Account Access to Connector
I am trying to run a cross-account Athena query : 1. I can list thedefault catalog ARN ""glue:arn:aws:glue:us-east-1:999999999999:catalog" fine (https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) 2. I am unclear how to reference other connector catalogs 3. Standard presto "show catalogs" doesn't work How do I reference and query a catalog cross account. Note the second catalog is a connector (hive) from the athena.
Glue, SQL Server and Table Location with spaces
Hi, I'm trying to use Glue to get a table from SQL Server (I have the schema). This table has a "Location" with an space (I cannot change the "Location" in AWS). This generate an error: py4j.protocol.Py4JJavaError: An error occurred while calling o96.getDynamicFrame. : com.microsoft.sqlserver.jdbc.SQLServerException: Sintaxis incorrecta cerca de la palabra clave 'of'. After investigating, I detected that it is because the space. I tried to add  but I couldn't. Also I cannot change the database. Any idea? Thanks for the help
Glue Crawler cannot classify SNAPPY compressed JSON files
I have a KFH application that puts compressed json files as snappy into an S3 bucket. I have also a Glue Crawler that creates schema using that bucket. However, the crawler classifies the table as UNKNOWN in case I activate snappy compression. It cannot detect the file is in JSON format indeed. According to below doc, Glue crawler provides snappy compression with JSON files but I wasn't able to achieve it. https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in I have also thought it might be related to the file extension and tried below names but it didn't work: Original: ``` |-----s3://my-bucket/my-table/day=01/file1.snappy ``` (1) ``` |-----s3://my-bucket/my-table/day=01/file1.snappy.json ``` (2) ``` |-----s3://my-bucket/my-table/day=01/file1.json.snappy ``` Thanks.
get_connection timeout in AWG Glue job
I am following the articale below for doing update in MySQL using pymysql - https://awstip.com/aws-etl-glue-job-insert-update-support-7a396db832b. However, it looks like the job is timing out on below line - connection = glue_client.get_connection(Name="<My Connection>") I do not see any exception in the logs. Test connection works fine.Also, the same connection worked when I used it in another job for insert-only created from Visual editor.
Transfer Family vs Lambda Function for file transfer from SFTP server
Hello Community, I was trying to narrow down to use one of the options to transfer files from the SFTP server to the S3 bucket, so as to help my Glue jobs because AWS Glue doesn’t support data loads from other cloud applications, File Storages. So, here I am trying to choose one, either AWS transfer Family or AWS lambda function that can connect to the remote server, and move them into the S3 bucket/folder which becomes the source of my integrations. I greatly appreciate it if you could share some insights into this scenario, and the advantages, and drawbacks of choosing one over the other. Any bottlenecks that you guys faced in using either of these services before for file transfer? Which is more cost-effective, suppose we say we have gigabytes of data(files). Thank you. Best, Tharun
S3 job bookmarks implementation
Does AWS provide implementation details for the S3 bookmarking logic in Glue? I have a bucket with tens of thousands of partitions (year, month, day, device_id) and each file inside the partition holds a number of events When I run a job, how does the bookmarking logic call into the S3 APIs to determine which files need to be processed? I understand that it uses ListObjects or ListObjectsV2 and checks the modified time of each file, but my concern is when there are millions of files, how does Glue optimize this listing behaviour? I would have thought that perhaps it uses the `objectCount` or `recordCount` properties of each partition to check first if there are new objects to be processed, before calling ListObjects, but I just ran some testing and confirmed that this does not occur. ie. if I upload a file to S3 and re-run the job, without running the crawler, it still picks up the new files (which have not yet been picked up by the crawler, nor added as aggregate metadata to the partition properties)
Runaway glue jobs leading to exception Task allocated capacity limit being exceeded ? "Glue ErrorCode 400 InvalidInputException"
My glue jobs assumed the default 48 hours timeout (which I was not aware of initially) and because they ended up in a delayed loop test for a specific file in a particular S3 bucket which never got created. So now when I run a simple basic Hello World type of glue job, it consistently fails with the following error ``` JobName:test and JobRunId:jr_6eb6af04d2a560f71d935ab3fca35504d7fdb99b748c0e0266e71402ced4437f_attempt_3 failed to execute with exception Task allocated capacity exceeded limit. (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: 7e43f436-4ca4-403e-a50f-8a15672ea2ef; Proxy: null) ``` I'm thinking this error is down to glue job tasks possibly still running and therefore the allocated capacity limit being exceeded, although I do not see any cloudwatch logs being updated now after 24 hours. **Questions:** **1)** Is this error, because the glue jobs are maybe still running in the background ? **2)** Is there a way to list and kill these still running glue jobs to free up these resources? I have already tried with awscli aws glue batch-get-jobs --job-names ..., but no joy here of listing them. I have now updated my glue job timeout to 60 minutes within my terraform code as a safeguard. Any help or guidance will be appreciated, thank you.