By using AWS re:Post, you agree to the Terms of Use
/Amazon Athena/

Questions tagged with Amazon Athena

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Can't get Partitions to work with my Glue Data Catalog

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works. What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions: Partition 1 YEAR Partition 2 MONTH Partition 3 DAY The proper values are in these columns, and I have verified that. After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok." I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows: year month day No partitions found What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
0
answers
0
votes
6
views
asked 14 hours ago

Athena Error: Permission Denied on S3 Path.

I am trying to execute athena queries from a lambda function but I am getting this error: `Athena Query Failed to run with Error Message: Permission denied on S3 path: s3://bkt_logs/apis/2020/12/16/14` The bucket `bkt_logs` is the bucket which is used by AWS Glue Crawlers to crawl through all the sub-folders and populate Athena table on which I am querying on. Also, `bkt_logs` is an encrypted bucket. These are the policies that I have assigned to the Lambda. ``` [ { "Action": [ "s3:Get*", "s3:List*", "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::athena-query-results/*", "Effect": "Allow", "Sid": "AllowS3AccessToSaveAndReadQueryResults" }, { "Action": [ "s3:*" ], "Resource": "arn:aws:s3:::bkt_logs/*", "Effect": "Allow", "Sid": "AllowS3AccessForGlueToReadLogs" }, { "Action": [ "athena:GetQueryExecution", "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:GetWorkGroup", "athena:GetDatabase", "athena:BatchGetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowAthenaAccess" }, { "Action": [ "glue:GetTable", "glue:GetDatabase", "glue:GetPartitions" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowGlueAccess" }, { "Action": [ "kms:CreateGrant", "kms:DescribeKey", "kms:Decrypt" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowKMSAccess" } ] ``` What seems to be wrong here? What should I do to resolve this issue?
1
answers
0
votes
34
views
asked 3 days ago

_temp AWS lake formation blueprint pipeline tables appears to IAM user in athena editor although I didn't give this user permission on them

_temp lake formation blueprint pipeline tables appears to IAM user in Athena editor, although I didn't give this user permission on them below the policy granted to this IAM user,also in lake formation permsissions ,I didnt give this user any permissions on _temp tables: { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1652364721496", "Action": [ "athena:BatchGetNamedQuery", "athena:BatchGetQueryExecution", "athena:GetDataCatalog", "athena:GetDatabase", "athena:GetNamedQuery", "athena:GetPreparedStatement", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata", "athena:GetWorkGroup", "athena:ListDataCatalogs", "athena:ListDatabases", "athena:ListEngineVersions", "athena:ListNamedQueries", "athena:ListPreparedStatements", "athena:ListQueryExecutions", "athena:ListTableMetadata", "athena:ListTagsForResource", "athena:ListWorkGroups", "athena:StartQueryExecution", "athena:StopQueryExecution" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetDatabases", "glue:BatchDeleteTable", "glue:GetTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition" ], "Resource": [ "*" ] }, { "Sid": "Stmt1652365282568", "Action": "s3:*", "Effect": "Allow", "Resource": [ "arn:aws:s3:::queryresults-all", "arn:aws:s3:::queryresults-all/*" ] }, { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": [ "*" ] } ] }
1
answers
0
votes
8
views
asked 8 days ago

Sync DynamoDB to S3

What is the best way to sync my DynamoDB tables to S3, so that I can perform serverless 'big data' queries using Athena? The data must be kept in sync without any intervention. The frequency of sync would depend on the cost, ideally daily but perhaps weekly. I have had this question a long time. I will cover what I have considered, and why I don't like the options. 1) AWS Glue Elastic Views. Sounds like this will do the job with no code, but it was announced 18 months ago and there have been no updates since. Its not generally available, and there is not information on when it might be. 2) Use dynamodb native backup following this blog https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/. I actually already use this method for 'one-off' data transfers that I kick-off manually and then configure in Athena. I have two issues with this option. The first is that, to my knowledge, the export cannot be scheduled natively. The blog suggests using the CLI to kick off exports, and I assume the writer intends that the CLI would need scheduling on a cron job somewhere. I don't run any servers for this. I imagine I could do it via a scheduled Lambda with an SDK. The second issue is that the export path in S3 always includes a unique export ID. This means I can't configure the Athena table to point to a static location for the data and just switch over the new data after a scheduled export. Perhaps I could write another lambda to move the data around to a static location after the export has finished, but it seems a shame to have to do so much work and I've not seen that covered anywhere before. 3) I can use data pipeline as described in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html. This post is more about backing data up than making it accessible to Athena. I feel like this use case must be so common, and yet none of the ideas I've seen online are really complete. I was wondering if anyone had any ideas or experiences that would be useful here?
2
answers
0
votes
9
views
asked 15 days ago

ErrorCode: INTERNAL_ERROR_QUERY_ENGINE

I am trying to create a view using a table have more than 5 billion rows. using the following query:- with date_month as (select distinct as_on_month from prod_edp_db_trusted.keyword_metrics_serp_dataforseo_v2_details where as_on_date >= date_add('Month',-2,current_date) ), sim_data as (select *,date as sim_date,substring(date,1,4)||substring(date,6,2) as sim_yr_month from ( select tenant,locale,search_engine,device_type,url,keyword,traffic,cast(date as varchar(255)) as date,cast(organic_rank as decimal(38,0)) as organic_rank,create_date ,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword order by date desc,create_date) as rn from prod_edp_db_trusted.simulated_score_v3_4 WHERE save_type='simulation' ) where rn=1 ), serp_data as ( select * from( select sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,sim.traffic,sim.organic_rank as sim_rank,sim.sim_date as simulation_date,sim.sim_yr_month,srp.position as rank, srp_mn.as_on_month as serp_year_month, srp.as_on_date as serp_as_on_date, row_number() over(partition by sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,srp_mn.as_on_month order by srp.as_on_date desc,srp.created_date) as rn from sim_data sim join date_month as srp_mn on srp_mn.as_on_month<=sim.sim_yr_month or srp_mn.as_on_month>=sim.sim_yr_month left join prod_edp_db_trusted.keyword_metrics_serp_dataforseo_v2_details srp on srp.locale=sim.locale and srp.search_engine=sim.search_engine and srp.device_type=sim.device_type and srp.url=sim.url and srp.keyword=sim.keyword --and srp.as_on_date>=sim.sim_date and srp.as_on_month=srp_mn.as_on_month and srp.as_on_date>=date_add('Month',-2,date_trunc('Month',current_date)) ) where rn=1 ), sv_data AS ( select tenant,locale,url,search_engine,keyword,search_volume,sv_yr_month from ( select LKP.tenant ,MAIN.locale ,lkp.url ,MAIN.search_engine ,MAIN.keyword ,MAIN.count as search_volume ,MAIN.as_on_month as sv_yr_month ,row_number() over(partition by lkp.tenant,main.locale,lkp.url,main.search_engine,main.keyword ,as_on_month order by as_on_date desc,created_date) rn from prod_edp_db_trusted.keyword_metrics_search_volume_dataforseo_v2 MAIN JOIN (SELECT tenant,locale,url,search_engine,keyword,Min(SIM_YR_MONTH) AS SIM_YR_MONTH FROM sim_data GROUP BY 1,2,3,4,5) LKP ON MAIN.locale=LKP.locale AND MAIN.search_engine=LKP.search_engine and main.keyword=LKP.keyword where MAIN.as_on_month IN(SELECT * FROM date_month ) ) where rn=1 ), base_dataset AS ( select srp.tenant,srp.locale,srp.search_engine,srp.device_type,srp.url,srp.keyword,srp.sim_rank,srp.rank,srp.serp_year_ month,srp.traffic ,srp.serp_as_on_date,srp.simulation_date,srp.SIM_YR_MONTH ,sv.search_volume from serp_data srp left join sv_data sv on srp.tenant=sv.tenant and srp.locale=sv.locale and srp.search_engine=sv.search_engine and srp.keyword=sv.keyword and srp.url=sv.url and srp.serp_year_month=sv.sv_yr_month ), sim_wavg as ( (select * from base_dataset where sim_yr_month=serp_year_month) union (select tenant,locale,search_engine,device_type,url,keyword,sim_rank,rank,serp_year_month,traffic ,serp_as_on_date,simulation_date,SIM_YR_MONTH ,search_volume from (select *,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword,simulation_date, SIM_YR_MONTH order by serp_year_month) base_rn from base_dataset where sim_yr_month<(select min(as_on_month) from date_month))t3 where base_rn=1 ) ), sim_latest_mnth as ( select * FROM ( select *,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword ,simulation_date, SIM_YR_MONTH order by serp_year_month desc) base_rn from base_dataset ) t3 where base_rn=1 ), final_base AS ( select sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,sim.sim_rank,sim.search_volume ,lkp.serp_year_month,lkp.serp_as_on_date,lkp.RANK as SERP_Rank,lkp.traffic ,sim.simulation_date,sim.SIM_YR_MONTH,lkp.search_volume as month_sv ,lst.search_volume as latest_month_sv,sim.sim_rank-lst.rank as rank_diff from sim_wavg sim left join base_dataset lkp on sim.tenant=lkp.tenant and sim.locale=lkp.locale and sim.search_engine=lkp.search_engine and sim.device_type=lkp.device_type and sim.url=lkp.url and sim.keyword=lkp.keyword left join sim_latest_mnth lst on sim.tenant=lst.tenant and sim.locale=lst.locale and sim.search_engine=lst.search_engine and sim.device_type=lst.device_type and sim.url=lst.url and sim.keyword=lst.keyword ) select tenant,locale,search_engine,device_type,url,keyword ,search_volume as "Simulation Month Search Volume",simulation_date,latest_month_sv,traffic as "SIMULATION MONTH TRAFFIC" ,sim_rank as "Rank as on Last Simulation",rank_diff as "Rank Difference With Latest Month" ,kw_imp['202204'] as Current_Month ,kw_imp['202203'] as Last_Month ,kw_imp['202202'] as "2nd_Last_Month" ,kw_imp['202201'] as "3rd_Last_Month" FROM ( select tenant,locale,search_engine,device_type,url,keyword,search_volume,latest_month_sv,simulation_date,traffic,si m_rank ,rank_diff ,map_agg(serp_year_month,SERP_Rank) as kw_imp from final_base group by tenant,locale,search_engine,device_type,url,keyword,search_volume,latest_month_sv,simulation_date,traffic,si m_rank ,rank_diff ) order by tenant,locale,search_engine,device_type,url,keyword;
0
answers
0
votes
5
views
asked 24 days ago

AWS:InstanceInformation folder created in s3 by Resource Data Sync cannot be queried by Athena because it has an invalid schema with duplicate columns.

[After resolving my first issue](https://repost.aws/questions/QUXOInFRr1QrKfR0Bh9wVglA/aws-glue-not-properly-crawling-s-3-bucket-populated-by-resource-data-sync-specifically-aws-instance-information-is-not-made-into-a-table) with getting a resource data sync set up, I've now run into another issue with the same folder. When a resource data sync is created, it creates a folder structure with 13 folders following a folder structure like: `s3://resource-data-sync-bucket/AWS:*/accountid=*/regions=*/resourcetype=*/instance.json}` When running the glue crawler over this, a schema is created where partitions are made for each subpath with an `=` in it. This works fine for most of the data, except for the path starting with `AWS:InstanceInformation`. The instance information json files ALSO contain a "resourcetype" field as can be seen here. ``` {"PlatformName":"Microsoft Windows Server 2019 Datacenter","PlatformVersion":"10.0.17763","AgentType":"amazon-ssm-agent","AgentVersion":"3.1.1260.0","InstanceId":"i","InstanceStatus":"Active","ComputerName":"computer.name","IpAddress":"10.0.0.0","ResourceType":"EC2Instance","PlatformType":"Windows","resourceId":"i-0a6dfb4f042d465b2","captureTime":"2022-04-22T19:27:27Z","schemaVersion":"1.0"} ``` As a result, there are now two "resourcetype" columns in the "aws_instanceinformation" table schema. Attempts to query that table result in the error `HIVE_INVALID_METADATA: Hive metadata for table is invalid: Table descriptor contains duplicate columns` I've worked around this issue by removing the offending field and setting the crawler to ignore schema updates, but this doesn't seem like a great long term solution since any changes made by AWS to the schema will be ignored. Is this a known issue with using this solution? Are there any plans to change how the AWS:InstanceInformation documents are so duplicate columns aren't created.
0
answers
0
votes
2
views
asked 25 days ago

AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table

I set up an s3 bucket that collects inventory data from multiple AWS accounts using the Systems Manager "Resource Data Sync". I was able to set up the Data Syncs to feed into the single bucket without issue and the Glue crawler was created automatically. Now that I'm trying to query the data in Athena, I noticed there is an issue with how the Crawler is parsing the data in the bucket. The folder "AWS:InstanceInformation" is not being turned into a table. Instead, it is turning all of the "region=us-east-1/" and "test.json" sub-items into tables which are, obviously, not queryable. To illustrate further, each of the following paths is being turned into it's own table. * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/test.json This is ONLY happening with the "AWS:InstanceInformation" folder. All of the other folders (e.g. "AWS:DetailedInstanceInformation") are being properly turned into tables. Since all of this data was populated automatically, I'm assuming that we are dealing with a bug? Is there anything I can do to fix this?
1
answers
0
votes
1
views
asked 25 days ago

Athena returns "FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null"

Following the well architected labs 200: Cost and usage analysis I get the following error when adding partitions in Athena Query Editor: ``` MSCK REPAIR TABLE `cost_optimization_10XXXXXXXX321`; ``` and it returned the following error: > FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null This query ran against the "costfubar" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 856e146a-8b13-4175-8cd8-692eef6d3fa5 The table was created correctly in Glue with ``` Name cost_optimization_10XXXXXXXXX21 Description Database costfubar Classification parquet Location s3://cost-optimization-10XXXXXXX321// Connection Deprecated No Last updated Wed Apr 20 16:46:28 GMT-500 2022 Input format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat Output format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Serde serialization lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe Serde parameters serialization.format 1 Table properties sizeKey 4223322objectCount 4UPDATED_BY_CRAWLER costfubarCrawlerSchemaSerializerVersion 1.0recordCount 335239averageRecordSize 27exclusions ["s3://cost-optimization-107457606321/**.json","s3://cost-optimization-1XXXXXXXX21/**.csv","s3://cost-optimization-107457606321/**.sql","s3://cost-optimization-1XXXXXXXX321/**.gz","s3://cost-optimization-107457606321/**.zip","s3://cost-optimization-107457606321/**/cost_and_usage_data_status/*","s3://cost-optimization-107457606321/**.yml"]CrawlerSchemaDeserializerVersion 1.0compressionType nonetypeOfData file ``` and has the following partitions shown in Glue: ``` partition_0 partition_1 year month detailed-cur-1XXXXXXXX57 detailed-cur-1XXXXXXXX57 2018 12 View files View properties detailed-cur-1XXXXXXXXX57 detailed-cur-1XXXXXXXXX57 2022 4 View files View properties detailed-cur-1XXXXXXXXX57 detailed-cur-1XXXXXXXXX57 2018 11 View files View properties detailed-cur-1XXXXXXXX57 detailed-cur-1XXXXXXXX57 2018 10 View files View properties ```
2
answers
0
votes
33
views
asked a month ago

Best way to overcome HIVE_PARTITION_SCHEMA_MISMATCH error in Athena while preserving structure of structs?

I ran the following Amazon Athena query on a table created by AWS Glue, which had crawled an Amazon S3 export of Synthea data from Amazon HealthLake: ``` SELECT * FROM "glue01234567890_fhir_export_abcdefghijklmnopqrst"; ``` That resulted in this error: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'code' in table 'demo.glue01234567890_fhir_export_abcdefghijklmnopqrst' is declared as type 'struct<coding:array<struct<system:string,code:string,display:string>>,text:string>', but partition 'partition_0=ImagingStudy' declared column 'id' as type 'string'. This query ran against the "demo" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id I saw [one answer](https://repost.aws/questions/QU1vPk1069Q5qg4iyuQWTk6Q/while-importing-s-3-data-into-quicksight-from-glue-database-getting-the-following-error) to go to edit the Crawler and then to output -->configuration and "Update all new and existing partitions with metadata from the table." However, that didn't resolve the error for me. When I edit the schema manually an change column 'code' from `struct` to `string`, the error goes away, and Athena brings my attention to the next mismatch: > partition 'partition_0=Immunization' declared column 'patient' as type 'struct<reference:string>'. I don't think, however, manually replacing all structs with strings will get me the results I want, because I need Glue and Athena to be aware of the fields inside the structs. What is the best approach to overcoming this error while preserving the structure of the data, in this situation where I want to make the HealthLake export queryable by Athena? I understand I'll want to add a step to the ETL in Glue to first convert to Parquet or ORC for Athena performance reasons, but right now I'm dealing with a small sample dataset and just want to focus on getting a minimum viable end-to-end flow going.
1
answers
0
votes
11
views
asked a month ago

Athena query consistently fails with HIVE_CURSOR_ERROR: Failed to read ORC file

I am seeing Athena queries over a bucket containing ORC files fail with the error message 'HIVE_CURSOR_ERROR: Failed to read ORC file'. Any query over entirety of the data in the bucket fails. A specific example query has been ```SELECT * FROM reachcounts_outbound WHERE calculation='a8d9458d-83e2-4e94-b272-3dbcd91296a0'``` where calculation is set up as a partition in the reachcounts_outbound table (which is backed by an S3 bucket unscoreit-reachcounts-outbound). I've validated that the file referenced by the error message is a valid ORC file by downloading it and running ```orc-tools data ``` on it, and the contents are what I'd expect. I've downloaded other ORC files in the bucket and compared them. They have the same schema and that schema is what I'd expect it to be; it matches the schema I've defined for the table. I've tried deleting the individual file referenced when the error message first appeared. However, it continues to fail with the same message with a different file in the bucket. However if I specify a limit clause of any number under 1597894 on the query above, it will succeed. I've tried running MSCK REPAIR TABLE on the reachcounts_outbound table. This did not change anything. The query id of a request that caused a failure is ```54480f27-1992-40f7-8240-17cc622f91db```. Thanks! Update: The ORC files that are rejected all appear to have exactly 10,000 rows, which is the stride size for the file
0
answers
0
votes
5
views
asked 2 months ago

Advice for best database/datastorage for historical data

Hi, I´m doing some reasearch to find the best place to centralize lots of data logs generated by my application considering pricing ,performance and scalabilty. Today all my application data including logs are stored on an Oracle database, but I´m thinking to move all the LOG related data outside it to reduce it´s size and not to worry about storage performance etc... Just put everything on a "infinite" storage apart from my actual database using CDC or a regular batch process **Below are some needs:** - Only inserts are necessary (no updates or deletes) - Customers will need access to this historic data - Well defined pattern of access (one or two indexes at maximum) - Latencies of few seconds is ok - Avoid infrastrucure, DBA, perfomance bottleneck log term... - Infinite Retentiton period (means I don´t want to worry about performance issues, storage size in long term. But something that can handle a few terabytes of data ) **Use case example: ** Historical Sales order by items ( id_item | id_customer | qty_sold | date_inserted ... ), aprox 50 millions records per day Where I would need to see the historical data by item, and by customer for example (two dimensions) I´ve done some research with the options below **S3 + Athena **-> Put everthing on s3, no worries about infrastructure perfomance issues, however as I need query by item and customer, probably it´would be necessary to break files by item or customer , generate millions of partitions to avoid high costs searching on every file etc.. **Postgre** -> Not sure if could be performance bottleneck once tables gets too big even with partition strategies **DynamoDB **-> Not sure if it´s a good alternative to historical data regarding pricing once seconds latency is ok **MongoDB/ DocumentDB **-> Not very familiar with it (I´d prefer SQL language type) but I know it´s has a good scalability **Cassandra**-> dont´know very much **Timeseries db as influxDB, timestream etc..**-> dont´know very much, but it seems appropriate for timeseries What option would you choose ? Sorry in advance if I saying something wrong or impossible :) Thank you!
1
answers
0
votes
5
views
asked 2 months ago

HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

I am reading files from S3 and using a Glue ETL job to populate a Data Catalog Table. My S3 files look like this: ``` VID,ALTID,VTYPE,TIME,TIMEGMT,VALUE ABC, ABC, DATAFEED,31-10-2021 22:00:00,01-11-2021 02:00:00, 11775.685 ``` The scheme read in looks like this ``` root |-- VID: string |-- ALTID: string |-- VTYPE: string |-- TIME: string |-- TIMEGMT: string |-- VALUE: string ``` I am changing fields 3 and 4 from "strings" to timestamps, which matches the schema of my Data Catalog Table. I am also doing a few other transformations. I am transforming them like so: ``` df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss')) df = df.withColumn("timegmt", to_timestamp("timegmt", 'dd-MM-yyyy HH:mm:ss')) ``` When I try to read the data with Athena, I get an error: `HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"` The Data Catalog Table Schema looks like this: | Column Name | Data Type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | int | | filename | string | And the line in the `run-1647806179090-part-r-00000` file it is choking on looks like this: ``` vid,altid,vtype,time,timegmt,value,filename ABC,ABC, DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"lf_buf_20211101_005904.csv" ``` Does anyone know why it would throw this error? I believe according to the documentation, this is the correct timestamp format.
1
answers
0
votes
14
views
asked 2 months ago

DataSourceArn error for Athena Quick sight in cfn template

AWSTemplateFormatVersion: '2010-09-09' Description: 'Creating QuickSight data source' Resources: QuickSightDataSource: Type: AWS::QuickSight::DataSource Properties: AwsAccountId: !Ref AWS::AccountId Name: Testing Data Source Type: ATHENA DataSourceId: testing-data-source DataSourceParameters: AthenaParameters: Workgroup: primary Permissions: - Actions: - quicksight:DescribeDataSource - quicksight:DescribeDataSourcePermissions - quicksight:PassDataSource Principal: !Sub - 'arn:aws:quicksight:us-east-1:${Account}:user/default/my-user-name' - Account: !Ref AWS::AccountId QSDataSet: DependsOn: QuickSightDataSource Type: AWS::QuickSight::DataSet Properties: AwsAccountId: !Ref 'AWS::AccountId' DataSetId: 'QSDataSet-test' ImportMode: SPICE Name: Dataset-test Permissions: - Actions: - 'quicksight:CreateDataSet' - 'quicksight:DeleteDataSet' - 'quicksight:DescribeDataSet' - 'quicksight:DescribeDataSetPermissions' - 'quicksight:PassDataSet' - 'quicksight:UpdateDataSet' - 'quicksight:UpdateDataSetPermissions' Principal: !Sub - 'arn:aws:quicksight:us-east-1:${AWS::AccountId}:user/default/my-user-name' - Account: !Ref 'AWS::AccountId' PhysicalTableMap: downtime_data_json: DataSourceArn: !GetAtt QuickSightDataSource.Arn InputColumns: - Name: downtime Type: INTEGER Schema: XXX??? I have data source & data tables in Athena and want to refer that in above template. I am not finding DatasourceArn, the above template throwing error as Model validation failed (#: extraneous key [DataSourceArn] is not permitted)". can anyone please help me out to hash out this.
0
answers
0
votes
1
views
asked 2 months ago

LakeFormation assumed role cannot access s3 objects during Athena requests ("Permission denied on S3 path" )

I have Delta Lake tables (using Symlink text input format) catalogued in Glue, stored in a S3 bucket, with all its resources tagged with LakeFormation Tags (for tag-based governance). The problem is that, although the users can see the database, tables, and metadata within Athena's catalogue, they cannot perform queries against the specific tables because of "Permission denied on S3 path" errors. LakeFormation has the data location registered for the datalake bucket, with AWSServiceRoleForLakeFormationDataAccess role. And this role has IAM permissions automatically added to the resources: ``` LakeFormationDataAccessServiceRolePolicy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::*" ] } ] } ``` and ``` LakeFormationDataAccessPolicyForS3 { "Version": "2012-10-17", "Statement": [ { "Sid": "LakeFormationDataAccessPermissionsForS3", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::MYBUCKET/*" ] }, { "Sid": "LakeFormationDataAccessPermissionsForS3ListBucket", "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::MYBUCKET" ] } ] } ``` I have also tried registering the data location with a role with Admin permissions (Action "*" and Resources "*"), but even so the same error is thrown. Looking through the CloudTrail logs, I found that LakeFormation passes custom policies to the role when running AssumeRole: ``` "policy": "{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Action\": [\"s3:GetObject\"],\n \"Effect\": \"Allow\",\n \"Resource\": [\"arn:aws:s3:::MYBUCKET\",\n \"Condition\": {\"ForAnyValue:StringLike\":{\"s3:prefix\":[\"MYTABLE/_symlink_format_manifest\",\"MYTABLE/_symlink_format_manifest/*\"]}}\n },\n {\n \"Action\": [\"kms:Decrypt\"],\n \"Effect\": \"Allow\",\n \"Resource\": [\"*\"],\n \"Condition\": {\"StringEquals\":{\"kms:ViaService\":[\"s3.us-east-2.amazonaws.com\"]}}\n } ]\n}" } ``` This seems like a malformatted json string that is being passed to the assumed role. Can this be causing the errors I'm having? And does anyone have had this issue before? PS: I have manually removed ACL control over the S3 bucket and objects. Still same behavior. The error is not shown if I remove the data location, and Athena ignores Lake Formation.
1
answers
0
votes
100
views
asked 4 months ago

Athena QuickSight query timesout

I am using Athena to connect to my DocumentDB database with the intention of visualising the data in QuickSight. I can see the connection is correct as the database name and table name load properly. When I go to create an analysis or even preview the data it timesout. A refresh just causes the same result. The table only contains 40 or so documents so its not an issue of the amount of data. I have tried with direct query and SPICE. This is the error returned when attempting to reconnect from QuickSight: ``` Error Details Error type: SQL_EXCEPTION Learn more Error details: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. GENERIC_USER_ERROR: Encountered an exception[com.amazonaws.SdkClientException] from your LambdaFunction[arn:aws:lambda:eu-west-1:717816899369:function:documentdb-connector] executed in context[retrieving meta-data] with message[Unable to execute HTTP request: Connect to s3.eu-west-1.amazonaws.com:443 [s3.eu-west-1.amazonaws.com/52.218.57.91] failed: connect timed out] [Execution ID: 56486351-0cba-4e81-9d9f-3ff4722c198f] Ingestion Id: 061d9b7e-c7f3-49a7-88ba-531551f6cc9a ``` The SQL_EXCEPTION errors states that it can be caused by a timeout so I figure thats the issue. `SQL_EXCEPTION – A general SQL error occurred. This error can be caused by query timeouts, resource constraints, unexpected data definition language (DDL) changes before or during a query, and other database errors. Check your database settings and your query, and try again.` Could there be an issue with the Lambda function being used by Athena to connect to the DocumentDB? Looking at the logs the connection seems to be working. I thought QuickSight would just be able to pull in the data now that everything is linked up. Any advice is appreciated.
0
answers
0
votes
9
views
asked 4 months ago

AWS Data Wrangler Athena Query Failure

I'm attempting to run an Athena query using AWS Data Wrangler which is failing with an error indicating a column cannot be resolved. However when I view the failing query in the Athena console and rerun it there, it succeeds. I have a feeling it has something to do with the formatting with table or column names but I've been unsuccessful finding a work around. When I attempt the same query using vanilla boto3, it also fails with the same error. Here is an example of the code which fails followed by the same successful query ran in the console. (with some names anonymized for security) Adding to the strangeness further, if I remove the joins, it will error indicating the first column in the select cannot be resolved but the same query runs successfully in the console. ### Python Code ``` import awswrangler as wr query = """ SELECT t1.description, t1.model_number as model, t1.p_id as p_id, t1.created_at as createdAt, t1.updated_at as updatedAt, t3.description as platform, t2.name as module FROM database_with_underscores.first_table t1 LEFT JOIN database_with_underscores.other_table t2 ON t2.id = t1.module_id LEFT JOIN database_with_underscores.lasttable t3 ON t1.platform_id = t3.id """ df = wr.athena.read_sql_query(query, workgroup="primary", database="database_with_underscores") ``` ### Python Result `awswrangler.exceptions.QueryFailed: SYNTAX_ERROR: line 19:5: Column 't2.id' cannot be resolved` ### AWS Athena Console ``` SELECT t1.description, t1.model_number as model, t1.p_id as p_id, t1.created_at as createdAt, t1.updated_at as updatedAt, t3.description as platform, t2.name as module FROM database_with_underscores.first_table t1 LEFT JOIN database_with_underscores.other_table t2 ON t2.id = t1.module_id LEFT JOIN database_with_underscores.lasttable t3 ON t1.platform_id = t3.id ``` ### Console Result Successful Link to a diff of the execution where you can see the Query is identical but one succeeds and the other fails https://imgur.com/a/V9QSbIn
2
answers
0
votes
21
views
asked 4 months ago
  • 1
  • 90 / page