Browse through the questions and answers listed below or filter and sort to narrow down your results.
Glue Crawler cannot classify SNAPPY compressed JSON files
I have a KFH application that puts compressed json files as snappy into an S3 bucket. I have also a Glue Crawler that creates schema using that bucket. However, the crawler classifies the table as UNKNOWN in case I activate snappy compression. It cannot detect the file is in JSON format indeed. According to below doc, Glue crawler provides snappy compression with JSON files but I wasn't able to achieve it. https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in I have also thought it might be related to the file extension and tried below names but it didn't work: Original: ``` |-----s3://my-bucket/my-table/day=01/file1.snappy ``` (1) ``` |-----s3://my-bucket/my-table/day=01/file1.snappy.json ``` (2) ``` |-----s3://my-bucket/my-table/day=01/file1.json.snappy ``` Thanks.
get_connection timeout in AWG Glue job
I am following the articale below for doing update in MySQL using pymysql - https://awstip.com/aws-etl-glue-job-insert-update-support-7a396db832b. However, it looks like the job is timing out on below line - connection = glue_client.get_connection(Name="<My Connection>") I do not see any exception in the logs. Test connection works fine.Also, the same connection worked when I used it in another job for insert-only created from Visual editor.
Transfer Family vs Lambda Function for file transfer from SFTP server
Hello Community, I was trying to narrow down to use one of the options to transfer files from the SFTP server to the S3 bucket, so as to help my Glue jobs because AWS Glue doesn’t support data loads from other cloud applications, File Storages. So, here I am trying to choose one, either AWS transfer Family or AWS lambda function that can connect to the remote server, and move them into the S3 bucket/folder which becomes the source of my integrations. I greatly appreciate it if you could share some insights into this scenario, and the advantages, and drawbacks of choosing one over the other. Any bottlenecks that you guys faced in using either of these services before for file transfer? Which is more cost-effective, suppose we say we have gigabytes of data(files). Thank you. Best, Tharun
S3 job bookmarks implementation
Does AWS provide implementation details for the S3 bookmarking logic in Glue? I have a bucket with tens of thousands of partitions (year, month, day, device_id) and each file inside the partition holds a number of events When I run a job, how does the bookmarking logic call into the S3 APIs to determine which files need to be processed? I understand that it uses ListObjects or ListObjectsV2 and checks the modified time of each file, but my concern is when there are millions of files, how does Glue optimize this listing behaviour? I would have thought that perhaps it uses the `objectCount` or `recordCount` properties of each partition to check first if there are new objects to be processed, before calling ListObjects, but I just ran some testing and confirmed that this does not occur. ie. if I upload a file to S3 and re-run the job, without running the crawler, it still picks up the new files (which have not yet been picked up by the crawler, nor added as aggregate metadata to the partition properties)
Runaway glue jobs leading to exception Task allocated capacity limit being exceeded ? "Glue ErrorCode 400 InvalidInputException"
My glue jobs assumed the default 48 hours timeout (which I was not aware of initially) and because they ended up in a delayed loop test for a specific file in a particular S3 bucket which never got created. So now when I run a simple basic Hello World type of glue job, it consistently fails with the following error ``` JobName:test and JobRunId:jr_6eb6af04d2a560f71d935ab3fca35504d7fdb99b748c0e0266e71402ced4437f_attempt_3 failed to execute with exception Task allocated capacity exceeded limit. (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: 7e43f436-4ca4-403e-a50f-8a15672ea2ef; Proxy: null) ``` I'm thinking this error is down to glue job tasks possibly still running and therefore the allocated capacity limit being exceeded, although I do not see any cloudwatch logs being updated now after 24 hours. **Questions:** **1)** Is this error, because the glue jobs are maybe still running in the background ? **2)** Is there a way to list and kill these still running glue jobs to free up these resources? I have already tried with awscli aws glue batch-get-jobs --job-names ..., but no joy here of listing them. I have now updated my glue job timeout to 60 minutes within my terraform code as a safeguard. Any help or guidance will be appreciated, thank you.
Glue: Using S3 ObjectCreated events with Crawler Catalog Target
I'm attempting to create a crawler using a pre-existing table defined in the catalog table, which defines a table stored in S3. I would like to use the "CRAWL_EVENT_MODE" recrawl policy, but this appears only to be available for S3 targets in the crawler, not data catalog tables that have an underlying S3 storage Is there a way around this? I need to have the table defined in the data catalog first, because there is no self-describing schema in the source objects, and the Crawler produces an incorrect schema when I allow it to create the table from scratch I would also like to use S3 events to optimize the crawler behaviour and achieve near real time latency. Thanks
Synschronous Glue Job in Step Function is slow to recognize completion of Glue Job
I am using a Step Function to execute a Glue Job. The Step Function is set to run in synchronous mode, however, there is usually a 2-4 minute lag from Glue Job completion to the point at which the Step Function considers the Glue Job complete and moves to the next step. For example, the Glue Job's last run took 15 minutes but the Step Function spent 19 minutes on this step. Has anyone else experienced this? Is my only option to execution in async mode and poll more often for completion?
Glue: UPDATE_IN_DATABASE is not working
Hello All, We have set up few AWS glue jobs that read data from the database and writes them to S3 files in an parquet format. Along the way, we also create Glue Data Catalog. We had created glue data catalog table with 5 columns (say A, B, C, D and E). Now after 2 months, the team decided to drop columns D and E and rename C to c. I tried to use updateBehavior="UPDATE_IN_DATABASE", but that didn't work. The parquet files that we were writing in S3 had the updated data i.e. A, B and c; but the glue data catalog had the new columns but also the old columns were present but without any data. code is like below : silver_target = self.glue_context.getSink( path=silverLakeLocation + table_name, connection_type="s3", updateBehavior="UPDATE_IN_DATEBASE", partitionKeys=["pyear"], enableUpdateCatalog=True, transformation_ctx="silver_target") if I glue catalog is delete first then i see updated table with removed columns, What should I do to update table to remove ols columns too ? Please advise
Custom classifier for AWS Glue crawler
I have a set of files in my S3 bucket which have a delimiter ASCII 31 (unit separator). I am using a crawler to read these files and create the tables in AWS Glue catalog. I tried using the custom delimiter in the classifiers but with no luck since this is a non-printable character. What is the best way to incorporate this delimiter within a crawler?
AWS Glue with CSV source data that changes over time
We have data that is being dumped into S3 every hour, with a basic Glue crawler running that enables us to query this data in Athena. The problem we're facing is that the source data is changing over time (columns added & removed) and the crawler doesn't seem to recognise this. Data from newer datasets is being put into columns that appear to positionally align with the earlier datasets, rather than being placed into columns based on the column name. (e.g. if the first dataset has columns A, C & D, and a new dataset has columns A, B, C & D, the new column B data shows up in column "C" in athena, and the new column C data shows up in column "D"). How can I fix this so that we can see all the columns with the data properly assigned to each column based on the header name of the columns?
Glue output to Stream?
I am relatively new to AWS and have been researching using Glue for a specific use case; What I would like to do is use Glue to rip apart a file into its component records and then push those records individually onto a queue or stream (thinking onto an EventBridge with Lambda resolvers that tackle different record types that are published upstream). Based on the documentation I'm seeing, it looks like while Glue can now consume a stream of data it doesn't seem to have the ability to output processed data to a stream. I had considered creating a Lamdba to rip apart the file and then publish the records on to the stream but the files can be large and might exceed the limitations of a Lambda (time/size) so was thinking that Glue would be a better solution (not to mention it includes many ETL functions related to data cleansing, profiling, etc that I'd like to take advantage of). If not Glue is there a more appropriate solution to my problem? Any suggestions would be welcomed. TIA -Steve