Browse through the questions and answers listed below or filter and sort to narrow down your results.
trying to create a csv file in s3 using glue from mongodb as data source.
i have installed a mongodb server on t2 micro. i was able to successfully connect it with mongo compass without ssh tunnel and just the authentication and public ip. then i have also created a crawler and ran it on the source and the crawler successfully created a table and i can see the names of all of the fields. now i am tryin to make a glue job but i am constantly getting this error: An error occurred while calling o96.getDynamicFrame. scala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.lang.String i am successfully able to run another glue job on a sample json data sitting in s3. jobid: Job Run - jr_19c00d6ff707bd8af110e007a207d9d92d0f64e41dacffa98250398b57dbf30b i am stuck on this error for two days now. any help will be highly appreciated.
aws command installed with awscli library inside a python venv on Windows invokes a python OUTSIDE the venv
For awscli 1.25.86, installing on a freshly minted Windows EC2 (Windows Server 2022 Dataserver), I did this: 1. selected my home directory (`cd`) 2. installed `pyenv` (e.g., via PowerShell using https://github.com/pyenv-win/pyenv-win#power-shell). This said it didn't succeed but seems to have fully installed pyenv. This is not the bug. I had to select a new PowerShell to see the effect of having installed `pyenv`) 3. told `pyenv` to install python 3.8 (`pyenv install 3.8.10`) 4. selected python 3.8 globally (`pyenv global 3.8.10`) 5. created a virtual environment (`pyenv exec python -m venv myvenv`) 6. entered the venv (`myvenv\scripts\activate`) 7. installed `awscli` (`pip install awscli`) 8. tried to invoke `awscli` (`aws --version`). This gives the message `File association not found for extension .py` which is an ignorable problem followed by an error that is the bug I'm reporting: ``` Traceback (most recent call last): File "C:\Users\Andrea\GitHub\Submit4DN\s4dn_venv\Scripts\aws.cmd", line 50, in <module> import awscli.clidriver ModuleNotFoundError: No module named 'awscli' ``` After studying this problem, I believe I know the source of this problem, and am pretty sure it's in the `awscli` library. The library installs `myvenv\scripts\aws.cmd` which implements the `aws` command inside the virtual environment, but that script sniffs around for a `python` to invoke and finds one _outside_ of the virtual environment. The problem isn't that it tries to get out of the virtual environment, it's just apparently oblivious to the presence of one, and so it isn't picky about which python it finds. it successively seeks `python.cmd`, `python.bat`, and `python.exe` (see line 7 of `myenv\scripts\aws.cmd`) but finds `python.cmd` first, and that is not inside the virtual environment. Had it checked `python.exe` first, it would have found the one in the virtual environment. If you swap the order of `(cmd bat exe)` on line 7 of `aws.cmd` so that it searches `(exe bat cmd)` it will invoke the python within the virtual env and so will find the `awscli` that was just installed within the virtual environment. That's not necessarily the right fix. It still feels fragile. But it seems to me that this proves it's the locus of the problem. Another somewhat workaround is to install `awscli` outside of the virtual environment by doing `deactivate`, then `pip install awscli`, then `myvenv/scripts/activate`, and then finally trying `aws --version` and it will work _except_ that if you change to another version of python globally via pyenv, the `aws` command within the venv will break again unless you again reinstall `awscli` in each globally selected python. I don't have a good fix to suggest because I'm not current on writing of Windows shell scripts, but imagine it involves a different way of discovering python that gives strong preference to a venv if one is active, e.g., by noticing there is a `%VIRTUAL_ENV%` in effect and just invoking `python`(since virtual envs always have a `python`) or `%VIRTUAL_ENV%\scripts\python` if you're wanting to be double-sure. Note that I was able to reproduce this problem on my professional desktop version of Windows 10 at my home as well, so it's nothing specific to the EC2 itself, that's just a way to show that this problem can be demonstrated in a clean environment. The problem seems pretty definitely in the `awscli` library. Whatever solution you pick, I hope this illustrates the issue clearly enough that you can quickly issue some sort of fix to the `awscli` library because the present situation is just plain broken and this is impacting some instructions we're trying to give some users about how to access our system remotely. I'd rather not be advising users to edit scripts they got from elsewhere, nor do I want to supply alternate scripts for them to use. Things should just work.
Redshift Increment identity key by 1 when loading a table
I have a serverless database set up in Redshift, created a table, and am now trying to load that table from a .csv file I have uploaded to an S3 bucket. When I created the table I set the primary key as an identity key as follows: customerid integer NOT NULL identity(0,1) When I load the table using the COPY query, the key increments by 128 starting with 64 rather than by 1 starting at 1. For example, my customerID field has values of 64, 192, 320, 448, etc. I've read numerous articles that this is due to compression and parallelism. I've tried including the "COMPUPDATE off" command as part of my COPY query but that did not change the results. I've truncated my table each time before I've tried to reload it to reset the seed. How can I load a table and have the identity key start with 1 and increment by 1?
Glue Crawler cannot classify SNAPPY compressed JSON files
I have a KFH application that puts compressed json files as snappy into an S3 bucket. I have also a Glue Crawler that creates schema using that bucket. However, the crawler classifies the table as UNKNOWN in case I activate snappy compression. It cannot detect the file is in JSON format indeed. According to below doc, Glue crawler provides snappy compression with JSON files but I wasn't able to achieve it. https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in I have also thought it might be related to the file extension and tried below names but it didn't work: Original: ``` |-----s3://my-bucket/my-table/day=01/file1.snappy ``` (1) ``` |-----s3://my-bucket/my-table/day=01/file1.snappy.json ``` (2) ``` |-----s3://my-bucket/my-table/day=01/file1.json.snappy ``` Thanks.
S3 job bookmarks implementation
Does AWS provide implementation details for the S3 bookmarking logic in Glue? I have a bucket with tens of thousands of partitions (year, month, day, device_id) and each file inside the partition holds a number of events When I run a job, how does the bookmarking logic call into the S3 APIs to determine which files need to be processed? I understand that it uses ListObjects or ListObjectsV2 and checks the modified time of each file, but my concern is when there are millions of files, how does Glue optimize this listing behaviour? I would have thought that perhaps it uses the `objectCount` or `recordCount` properties of each partition to check first if there are new objects to be processed, before calling ListObjects, but I just ran some testing and confirmed that this does not occur. ie. if I upload a file to S3 and re-run the job, without running the crawler, it still picks up the new files (which have not yet been picked up by the crawler, nor added as aggregate metadata to the partition properties)
Runaway glue jobs leading to exception Task allocated capacity limit being exceeded ? "Glue ErrorCode 400 InvalidInputException"
My glue jobs assumed the default 48 hours timeout (which I was not aware of initially) and because they ended up in a delayed loop test for a specific file in a particular S3 bucket which never got created. So now when I run a simple basic Hello World type of glue job, it consistently fails with the following error ``` JobName:test and JobRunId:jr_6eb6af04d2a560f71d935ab3fca35504d7fdb99b748c0e0266e71402ced4437f_attempt_3 failed to execute with exception Task allocated capacity exceeded limit. (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: 7e43f436-4ca4-403e-a50f-8a15672ea2ef; Proxy: null) ``` I'm thinking this error is down to glue job tasks possibly still running and therefore the allocated capacity limit being exceeded, although I do not see any cloudwatch logs being updated now after 24 hours. **Questions:** **1)** Is this error, because the glue jobs are maybe still running in the background ? **2)** Is there a way to list and kill these still running glue jobs to free up these resources? I have already tried with awscli aws glue batch-get-jobs --job-names ..., but no joy here of listing them. I have now updated my glue job timeout to 60 minutes within my terraform code as a safeguard. Any help or guidance will be appreciated, thank you.
SYNTAX_ERROR while running query in athena editor after triggering crawler
Hi folks , got the below error while running query in athena editor after triggering crawler SYNTAX_ERROR: line 1:8: SELECT * not allowed from relation that has no columns This query ran against the "demodb" database, unless qualified by the query **Note** : the table got populated in the workgroup but when I try to preview table it should run the query but got the syntax error , Any help would be great , Thanks
Glue: Using S3 ObjectCreated events with Crawler Catalog Target
I'm attempting to create a crawler using a pre-existing table defined in the catalog table, which defines a table stored in S3. I would like to use the "CRAWL_EVENT_MODE" recrawl policy, but this appears only to be available for S3 targets in the crawler, not data catalog tables that have an underlying S3 storage Is there a way around this? I need to have the table defined in the data catalog first, because there is no self-describing schema in the source objects, and the Crawler produces an incorrect schema when I allow it to create the table from scratch I would also like to use S3 events to optimize the crawler behaviour and achieve near real time latency. Thanks
Glue: UPDATE_IN_DATABASE is not working
Hello All, We have set up few AWS glue jobs that read data from the database and writes them to S3 files in an parquet format. Along the way, we also create Glue Data Catalog. We had created glue data catalog table with 5 columns (say A, B, C, D and E). Now after 2 months, the team decided to drop columns D and E and rename C to c. I tried to use updateBehavior="UPDATE_IN_DATABASE", but that didn't work. The parquet files that we were writing in S3 had the updated data i.e. A, B and c; but the glue data catalog had the new columns but also the old columns were present but without any data. code is like below : silver_target = self.glue_context.getSink( path=silverLakeLocation + table_name, connection_type="s3", updateBehavior="UPDATE_IN_DATEBASE", partitionKeys=["pyear"], enableUpdateCatalog=True, transformation_ctx="silver_target") if I glue catalog is delete first then i see updated table with removed columns, What should I do to update table to remove ols columns too ? Please advise
Custom classifier for AWS Glue crawler
I have a set of files in my S3 bucket which have a delimiter ASCII 31 (unit separator). I am using a crawler to read these files and create the tables in AWS Glue catalog. I tried using the custom delimiter in the classifiers but with no luck since this is a non-printable character. What is the best way to incorporate this delimiter within a crawler?