Query Athena from s3 database - remove metadata/corrupted data

0

Hello, I was following along with the tutorials for connecting Tableau to Amazon Athena and got hung up when running the query and returning the expected result. I downloaded the student-db.csv from https://github.com/aws-samples/amazon-athena-tableau-integration and uploaded the csv to a S3 bucket that I created. I can create the database within Athena however when I create a table either with the bulk add or directly from the query editor and preview with a query the data gets corrupted. and includes unexpected characters and unexpected/unnecessary punctuations and sometimes all the data is aggregated into a single column and also contains metadata such as "1 ?20220830_185102_00048_tnqre"0 2 ?hive" 3 Query Plan* 4 Query Plan2?varchar8 @H?P?". Also with my Athena - Tableau connected receiving the same issues when I preview the table that was created with Athena and stored in my bucket.

CREATE EXTERNAL TABLE IF NOT EXISTS student(
  `school` string, 
  `country` string, 
  `gender` string, 
  `age` string, 
  `studytime` int, 
  `failures` int, 
  `preschool` string, 
  `higher` string, 
  `remotestudy` string, 
  `health` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://jj2-test-bucket/'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1', 
  'transient_lastDdlTime'='1595149168')
SELECT * FROM "studentdb"."student" limit 10;

Enter image description here

1개 답변
1

The solution is to create a separate S3 bucket to house the query results. Additionally, when connecting to Tableau you must set the S3 Staging Directory to the location of the Query Result bucket rather than connecting to the S3 bucket that contains your raw data/csv

답변함 2년 전
AWS
전문가
검토됨 2년 전
  • As explained in the answer, based on the data you posted it seems that additional data is being written in the path of your table. you have to make sure that only the actual data from your table is stored in the location.

    you may want to create a folder "student" in your bucket and move the raw data into it and change the DDL of the Table as follow:

    CREATE EXTERNAL TABLE IF NOT EXISTS student( school string, country string, gender string, age string, studytime int, failures int, preschool string, higher string, remotestudy string, health string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://jj2-test-bucket/student/' TBLPROPERTIES ( 'has_encrypted_data'='false', 'skip.header.line.count'='1', 'transient_lastDdlTime'='1595149168')

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠