Questions tagged with Amazon Athena

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I've tried to UNLOAD() data from MySQL to S3 by Athena engine 3 and got an error: __INVALID_SESSION_PROPERTY: Unknown session property ***.manifest_location.__ *** - this is my Data source. But it works with Athena engine 2. I can't find any info about this error. UNLOAD (SELECT * FROM database.table) TO 's3://my-bucket/table/' WITH (format = 'PARQUET',compression = 'SNAPPY')
0
answers
0
votes
68
views
asked a month ago
I have a 10GB dataset loaded in a PySpark dataframe. ``` df.coalesce(1).write.mode('overwrite').parquet("s3://xxxxxxxxxx-eu-west-1-athena-results-bucket-h1snx89wnc/output-data-parquet2") ``` Yesterday, the parquet file was created in 6 to 7 minutes. Today, it won't even finish, as I am disconnected from AWS console before it completes (so at least 45mn+). Is it possible or did I do something wrong ? (the source file hasn't changed)
1
answers
0
votes
46
views
lalvaro
asked 2 months ago
Anyone who encountered this error? INVALID_FUNCTION_ARGUMENT: Invalid format: "trnNo"
1
answers
0
votes
15
views
asked 2 months ago
I'm following this tutorial to perform text analytics using UDFs. https://aws.amazon.com/blogs/machine-learning/translate-and-analyze-text-using-sql-functions-with-amazon-athena-amazon-translate-and-amazon-comprehend/ The example where data are not saved works fine work fine in Athena v.3 USING EXTERNAL FUNCTION detect_sentiment_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf', EXTERNAL FUNCTION detect_entities_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf' SELECT *, detect_sentiment_all(review_body, language) AS sentiment, detect_entities_all(review_body, language) AS entities FROM amazon_reviews_with_language WHERE language IN ('ar', 'hi', 'ko', 'zh-TW', 'ja', 'zh', 'de', 'pt', 'en', 'it', 'fr', 'es') However, the example where a table is created to save the results does not work in Athena v.3, but it works in Athena v.2. CREATE TABLE amazon_reviews_with_text_analysis WITH (format='parquet') AS USING EXTERNAL FUNCTION detect_sentiment_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf', EXTERNAL FUNCTION detect_entities_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf' SELECT *, detect_sentiment_all(review_body, language) AS sentiment, detect_entities_all(review_body, language) AS entities FROM amazon_reviews_with_language WHERE language IN ('ar', 'hi', 'ko', 'zh-TW', 'ja', 'zh', 'de', 'pt', 'en', 'it', 'fr', 'es') It fails with an error: line 2:1: mismatched input 'USING'. Expecting: '(', <query> The question is how can I save data into a table when using external function in Athena v.3?
1
answers
0
votes
67
views
MikeW
asked 2 months ago
Hello, I am currently using Service Now table dumps exported as JSON array files to S3. The S3 folder path structure is similar to the following: ``` aws s3 ls s3://bucket/data/ PRE data1-2022-09-22/ PRE data2-2022-09-22/ PRE data3-2022-09-22/ ``` Each folder contains a series of JSON files in their own schema. These files each contain an array of nested JSON objects, in the following structure: ``` [ { "obj1": { "display_value": "Change", "value": "Change" }, "obj2": { "display_value": "Cancel all future Tasks", "value": "cancel" }, "obj3": { "display_value": "2021-01-21 15:00:01", "value": "2021-01-21 15:00:01" } }, { "obj1": { "display_value": "Change", "value": "Change" }, "obj2": { "display_value": "Cancel all future Tasks", "value": "cancel" }, "obj3": { "display_value": "2021-01-07 20:36:34", "value": "2021-01-07 20:36:34" } }, { "obj1": { "display_value": null, "value": "" }, "obj2": { "display_value": "Cancel all future Tasks", "value": "cancel" }, "obj3": { "display_value": "2021-02-11 19:40:40", "value": "2021-02-11 19:40:40" } } ] ``` As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same. When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.: ``` { "obj3": { "display_value": "string", "value": "string" } } ``` Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types. I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3. Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown): ``` obj1, obj2, obj3 {display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"} {display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"} {display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"} ``` May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without? ##### Sources [1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html
0
answers
0
votes
51
views
AWS
asked 2 months ago
Hi everyone, I am creating a simple datalake with S3, Glue and Athena through Terraform (Infrastructure As A Code), which is managed by Lake Formation. I grant all the permissions in both IAM and Lake Formation needed for the workflow roles and also Athena users. However, from time to time (I deploy and destroy the infrastructure many times under dev environment) I got an error when I try to query the glue tables in Athena: **Permission denied on S3 path: s3://XXX/XX. This query ran against the "XXXX" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 0adbb706-4500-4c3b-8f0d-3174fa4e12b9.** The error affects all the tables in the database when it occurs. When I redeploy the same code, the error might not occur at all. I searched on AWS and internet but has not found the reason and a solution. Has anyone encountered this error or known the reason ? Any solution for this problem? By the way, I don't have KMS encryption yet just have S3-SSE turned on at this moment. Thank you very much in advance. Feng
1
answers
0
votes
131
views
asked 2 months ago
After running the below query in Athena I am getting the error: ``` SELECT REPEAT('*', LENGTH('testing')) AS "e7cfff0a-f4d3-4fd9-a6cc-c5f7af177074" FROM "default"."all_flights" GROUP BY REPEAT('*', LENGTH('testing')) LIMIT 500 ``` ``` SYNTAX_ERROR: line 2:3: Unexpected parameters (varchar(1), bigint) for function repeat. Expected: repeat(T, integer) T This query ran against the "amtestdb" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 0f51f40e-66e1-4512-aa22-ed36b4f7bee6 ``` Wondering why this query doesn't run as expected?
1
answers
0
votes
24
views
AWS
asked 2 months ago
Hi, I am getting this below error when writing data to iceberg table where I'm partitioning the table on a column called device_id. The device_id has a set of guids which is stored as a string. I tried to check if the below error is because of device_id being null for some records but that is not the case. Any help provided will be greatly appreciated. GENERIC_INTERNAL_ERROR: Unsupported partition transform: 1000: device_id: void(1)
0
answers
0
votes
20
views
asked 2 months ago
I want to use Athena to allow users to troubleshoot and find specific data. I'm storing sensitive information, so I'd like to limit the user's queries against the tables. For example, the user can search for a specific field value, but they can only return 100 records at a time. I also would like to prevent exporting the data, but I don't see an appropriate policy to prevent this. Saved parameterized queries and prepared queries seem to offer a way to do this, but I don't see a way to *only* limit the user to these saved queries. I also don't want them to be able to write ad-hoc queries that can get at more data than the saved or prepared queries allow. I see the "per query data usage control" setting in the Workgroup to limit the data per query, but the minimum I can set this to is 10 MB, which is larger than I'd like to set it. In short, without creating a middle-layer API, is there a way to lock down the Athena console to prevent users from bulk-querying all the data and limit the user to pre-saved queries?
1
answers
0
votes
40
views
asked 2 months ago
I have an array which is stored inside s3 bucket that looks like ``` [ { "bucket_name": "ababa", "bucket_creation_date": "130999", "additional_data": { "bucket_acl": [ { "Grantee": { "DisplayName": "abaabbb", "ID": "abaaaa", "Type": "CanonicalUser" }, "Permission": "FULL_CONTROL" } ], "bucket_policy": { "Version": "2012-10-17", "Id": "abaaa", "Statement": [ { "Sid": "iddd", "Effect": "Allow", "Principal": { "Service": "logging.s3.amazonaws.com" }, "Action": "s3:PutObject", "Resource": "aarnnn" }, { "Effect": "Deny", "Principal": "*", "Action": [ "s3:GetBucket*", "s3:List*", "s3:DeleteObject*" ], "Resource": [ "arn:aws:s3:::1111-aaa/*", "arn:aws:s3:::1111-bbb" ], "Condition": { "Bool": { "aws_SecureTransport": "false" } } } ] }, "public_access_block_configuration": { "BlockPublicAcls": true, "IgnorePublicAcls": true, "BlockPublicPolicy": true, "RestrictPublicBuckets": true }, "website_hosting": {}, "bucket_tags": [ { "Key": "keyyy", "Value": "valueee" } ] }, "processed_data": {} }, ....................... ] ``` NOTE- some of the field may be string/array/struct based on the data we get(eg actions can be array or string) END GOAL- I want to query inside this data and look for multiple conditions and then create a field inside processed_data and set it to true/false based on the query using AWS Glue Example- For each object inside the array, i want to check : ``` 1- if bucket_acl has grantee.type=CanonicalUser and Permission=FULL_CONTROL AND 2- if bucket_policy has statement that contains Effect=Allow and Principal=* and Action = ...... and Resources = ...... and condition is empty AND 3- website_hosting is empty and then create a field inside processes_data and set it to true if the above query satisfies eg- processed_data:{ isPublic: True} ``` Approaches I Tried: 1- I tried saving the data in s3 bucket in parquet format using aws-wrangler/aws-pandas for faster querying and then getting the data in aws glue using glue dynamic frame: ``` S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={}, connection_type="s3", format="parquet", connection_options={"paths": ["s3://abaabbb/abaaaaa/"], "recurse": True}, transformation_ctx="S3bucket_node1", ) S3bucket_node1.printSchema() S3bucket_node1.show() ``` Output: ``` root |-- bucket_name: string |-- bucket_creation_date: string |-- additional_data: string |-- processed_data: string {"bucket_name": "abaaaa", "bucket_creation_date": "139999", "additional_data": "{'bucket_acl': [{'Grantee': {'DisplayName': 'abaaaaaa', 'ID': 'abaaa', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}], 'bucket_policy': {}, 'public_access_block_configuration': {'BlockPublicAcls': True, 'IgnorePublicAcls': True, 'BlockPublicPolicy': True, 'RestrictPublicBuckets': True}, 'website_hosting': {}, 'bucket_tags': []}", "processed_data": "{}"} ``` Getting everything as string, seems like most of these libraries doesn't support nested data types 2- Tried saving the data as it is(in json) using put object API and then getting the data in aws glue using glue dynamic frame: ``` piece1 = glueContext.create_dynamic_frame.from_options( format_options={"multiline": True}, connection_type="s3", format="json", connection_options={"paths": ["s3://raghav-test-df/raghav3.json"], "recurse": True}, transformation_ctx="S3bucket_node1", ) piece1.printSchema() piece1.show() piece1.count() ``` Output: ``` root 0 ``` Getting no schema and count as 0 3- Tried getting the data using spark data frame: ``` sparkDF=spark.read.option("inferSchema", "true").option("multiline", "true").json("s3://ababa/abaa.json") sparkDF.printSchema() sparkDF.count() sparkDF.show() ``` Output- ``` root |-- additional_data: struct (nullable = true) | |-- bucket_acl: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- Grantee: struct (nullable = true) | | | | |-- DisplayName: string (nullable = true) | | | | |-- ID: string (nullable = true) | | | | |-- Type: string (nullable = true) | | | |-- Permission: string (nullable = true) | |-- bucket_policy: struct (nullable = true) | | |-- Id: string (nullable = true) | | |-- Statement: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- Action: string (nullable = true) | | | | |-- Condition: struct (nullable = true) | | | | | |-- Bool: struct (nullable = true) | | | | | | |-- aws:SecureTransport: string (nullable = true) | | | | | |-- StringEquals: struct (nullable = true) | | | | | | |-- AWS:SourceAccount: string (nullable = true) | | | | | | |-- AWS:SourceArn: string (nullable = true) | | | | | | |-- aws:PrincipalAccount: string (nullable = true) | | | | | | |-- s3:x-amz-acl: string (nullable = true) | | | | |-- Effect: string (nullable = true) | | | | |-- Principal: string (nullable = true) | | | | |-- Resource: string (nullable = true) | | | | |-- Sid: string (nullable = true) | | |-- Version: string (nullable = true) | |-- bucket_tags: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- Key: string (nullable = true) | | | |-- Value: string (nullable = true) | |-- public_access_block_configuration: struct (nullable = true) | | |-- BlockPublicAcls: boolean (nullable = true) | | |-- BlockPublicPolicy: boolean (nullable = true) | | |-- IgnorePublicAcls: boolean (nullable = true) | | |-- RestrictPublicBuckets: boolean (nullable = true) |-- bucket_creation_date: string (nullable = true) |-- bucket_name: string (nullable = true) ``` Getting the schema and correct count, but some of the field has different data types(eg actions can be string or array) and spark makes them default to string, i think querying the data based on multiple conditions using sql will be too complex Do i need to change the approach or something else, i am stuck here Can someone please help in achieving the end goal?
1
answers
0
votes
84
views
asked 2 months ago
The error message I'm getting: Error message not found: ATHENA_CLIENT_ERROR. Can't find bundle for base name com.simba.athena.athena.jdbc42.messages, locale en_US We have a datalake architecture which we stood up on AWS s3. When I'm trying to run queries against the tables in the Curated db, in Athena, I'm getting results. When I copy the same query and paste it in the custom SQL funtion in Tableau it gives me an error. This issue is affecting our business and needs to resolve as soon as possible. Please send me an answer if you have previously dealt with this kind of issue. Important stuff: I have the JDBCSIMBA4.2 driver. I have an athena properties file directing to our S3 location. I have JAVA8.0 Installed with JAVA HOME. I have * access meaning I have all access in AWS. I am able to connect to tables in the database. I am able to view all the tables in the database. I also made couple dashboards using this database.
0
answers
0
votes
21
views
asked 2 months ago
The error message I'm getting: Error message not found: ATHENA_CLIENT_ERROR. Can't find bundle for base name com.simba.athena.athena.jdbc42.messages, locale en_US We have a datalake architecture which we stood up on AWS s3. When I'm trying to run queries against the tables in the Curated db, in Athena, I'm getting results. When I copy the same query and paste it in the custom SQL funtion in Tableau it gives me an error. This issue is affecting our business and needs to resolve as soon as possible. Please send me an answer if you have previously dealt with this kind of issue. Other stuff: I have * access meaning I have all access in AWS.
1
answers
0
votes
32
views
asked 2 months ago