By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Amazon Athena

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Query Athena from s3 database - remove metadata/corrupted data

Hello, I was following along with the tutorials for connecting Tableau to Amazon Athena and got hung up when running the query and returning the expected result. I downloaded the student-db.csv from https://github.com/aws-samples/amazon-athena-tableau-integration and uploaded the csv to a S3 bucket that I created. I can create the database within Athena however when I create a table either with the bulk add or directly from the query editor and preview with a query the data gets corrupted. and includes unexpected characters and unexpected/unnecessary punctuations and sometimes all the data is aggregated into a single column and also contains metadata such as "1 ?20220830_185102_00048_tnqre"0 2 ?hive" 3 Query Plan* 4 Query Plan2?varchar8 @H?P?". Also with my Athena - Tableau connected receiving the same issues when I preview the table that was created with Athena and stored in my bucket. ``` CREATE EXTERNAL TABLE IF NOT EXISTS student( `school` string, `country` string, `gender` string, `age` string, `studytime` int, `failures` int, `preschool` string, `higher` string, `remotestudy` string, `health` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://jj2-test-bucket/' TBLPROPERTIES ( 'has_encrypted_data'='false', 'skip.header.line.count'='1', 'transient_lastDdlTime'='1595149168') ``` ``` SELECT * FROM "studentdb"."student" limit 10; ``` ![Enter image description here](/media/postImages/original/IMVPFHEHtDTo2aPRoGcWSBBg)
1
answers
0
votes
27
views
asked a month ago

Querying Latest Available Partition

I am building an ETL pipeline using primarily state machines, Athena, and the Glue catalog. In general things work in the following way: 1. A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process. 2. A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time. My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table? I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture: 1. Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment. 2. Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access) 3. Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows. Any guidance would be appreciated!
1
answers
0
votes
50
views
asked a month ago